Friday, February 8, 2013

Are Twitter Followers Real || Modeling Changes in the Number of Twitter Followers

What type of pattern would one expect with an increase in the number of a person's Twitter followers? Would one expect a pattern? Would one be able to model an increase in one's Twitter followers similar to population growth?

A month or two or three ago the Pope signed up for Twitter and I was waiting for him to make a tweets because I really wanted to see what he would say. So I kept up a tab in my browser with his Twitter feed and would refresh it once in a while to see if he posted anything. I noticed that there were weird fluctuations in the number of followers that the Pope had and I wanted to understand this so I decided to monitor a few Twitter accounts and see how the number of followers changed over time (Perl Script for Grabbing Followers).

I sampled maybe slightly too often about once every 15-30 seconds depending on lag and my want to not DDoS Twitter.

So what model would you expect to fit the growth of the number of Twitter followers that a person has?

Tough Question. 
The other issue is we only have data for the days in question not since the account was open.
When an account is open  till a certain amount of time after we expect it to follow an exponential model. That is over the longterm. I am thinking more over the short term. How do we expect the number of followers to fluctuate day-to-day or over the course of a week?

I think we can assume that the increase in number of followers is a "stochastic" process and counting the number of events over a certain period of time is somewhat Poisson. I mean let's be honest here. Rare events and independence of events are tough definitions here and we need to make those assumptions for our data at least for a basic model.

We will start off with some data on Lebron James or KingJames as he calls himself on Twitter.
(HEHE, who calls themself King? weird. I mean he throws a ball at a circle for a living...)

We will start off with some Sample data from December:

Time From Start (s)       Number of Followers
16                                         6,622,356
36                                         6,622,391
52                                         6,622,356
68                                         6,622,395
84                                         6,622,396

What immediately jumped out at me was the fluctuations how in a period of less than a minute it fluctuates multiple times between 6,622,356 and 6,622,391ish. I looked this up online and some suggest it is due to Twitter removing followers from fake accounts, which doesn't seem to be the case because we always end up at the higher number in the fluctuation. What is going on here?



Next what is this repeating sinusoidal like pattern? Some Twitter filtering? Time dependent fluctuations? We will see when we generate our model. The unfortunate part is that the data fluctuates up and down so we need to fix that. I filtered the data by just staying with the highest number until another higher number was reached (Perl script here). The data looks the same it just doesn't have all those downward spikes like the graph above. It was an average of about 0.08 new followers per second but I can't calculate a poisson distribution for that so I found the average time value at which there were 5 new followers and that was 64 seconds. So then I calculated the probability distribution. Poisson: e^(-average)*average^x/x!


I generated the distribution by writing a Perl script that generated a pseudo-random number between 0 and 10000 and mapped that to the probabilities multiplied by 100. Multiplied by 100 because then the sum equals 10000 (Code for model here). Alot easier to generate a model using numbers greater than 1.




My initial conclusion is that they follow a poissonian model pretty well which makes me conclude that the increase in followers is real. What is with all this variability though?

I could do something crazy here like a Poisson Markov Model to try and build a better model but I don't think that will tell me anything more about the data. What I want to do next is build a time dependent Poisson function to replicate those ripples if I can. I can do it in a program I just want to try and figure out how to actually write the equation. I could try multiplying by a time dependent sine function but the problem is that it doesn't exactly look sinusoidal. It actually looks like there are multiple sine like functions in play here. Short term(on the order of tens of thousands of seconds) and long term(on the order of two hundreds of thousands of seconds). Ok I need to collect more data and try and do a fourier transform like process to see if I pinpoint these periodicities using a Lomb-Scargle method.