In [2]:
figsize( 12.5 ,4)
import scipy.stats as stats
Most of you are likely familar with the git-repository website Github. An observed phenomenon on Github is the scale-ness of the popularity of repositories. Here, for lack of a better measure, we use the numbers of stars and forks to measure popularity. This is not a bad measure, but it can ignore page-views, downloads and tends to overemphasize older repositories. Since we will be studying all repositories and not a single one, the absense of these measures is not as relevant.
Contained in this folder is a Python script for scrapping data from Github on the popularity of repos. The script requires the Requests
and BeautifulSoup
libraries, but if you don't have that installed, provided in the ./data
folder is the same data from a previous date (Feburary 18, 2013 at last pull). The data is the fraction of repositories with stars equal to or greater than $2^k,\; k = 0,...,15$ and the fraction of repositories with forks equal to or than $2^k,\; k = 0,...,15$.
In [1]:
run github_datapull.py
In [3]:
plt.plot( ( stars_to_explore), repo_with_stars, label = "stars" )
plt.plot( ( forks_to_explore), repo_with_forks, label = "forks" )
plt.legend(loc = "lower right")
plt.title("Popularity of Repos (as measured by stars and forks)" )
plt.xlabel("$K$")
plt.ylabel("number of repos with stars/forks $K$" )
plt.xlim( -200, 35000 )
Out[3]:
Clearly, we need to adjust the scale of this plot as most of the action is hidden. The number of repos falls very quickly. We will put it on a log-log plot.
In [4]:
plt.plot( log2( stars_to_explore+1), log2(repo_with_stars+1), 'o-', label = "stars" )
plt.plot( log2( forks_to_explore+1), log2(repo_with_forks+1), 'o-', label = "forks", )
plt.legend(loc = "upper right")
plt.title("Log-Log plot of Popularity of Repos (as measured by stars and forks)" )
plt.xlabel("$\log{K}$")
plt.ylabel("$\log$(number of repos with stars/forks < K )" )
Out[4]:
Both characteristics look like a straight line plotted on a log-log plot. What does this mean? Denote the fraction of repos with greater than or equal to $k$ stars (or forks) $P(k)$. So in the above plot, $\log{P(k)}$ on the y-axis and $\log{k}$ is on the x-axis. The above linear relationship can be written as:
$$ \log_2{P(k)} = \beta\log_2{k} + \alpha$$rearranging by taking both sides to the power of 2:
$$ P(k) = 2^\alpha k^{\beta} = C k^{\beta}, \;\; C = 2^{\alpha}$$This relationship is very interesting. It is called a power-law, and occurs very freqently in social datasets. Why does it occur so frequently in social datasets? It has much to do with a "winner-take-all" or "winner-take-most" effect. Winners in a power-law enviroment are components that seem take a disproportiante amount of the popularity, and keep winning. In term of popularity of repos, winning repos are repos that are very good quailty (intially are winners), and are shared/talked about often (keep winning).
The above plot is also telling us that the majority of repos have very few stars and forks, only a handful have hundreds, and an incredibly small number have thousands. This is not-so obvious after browsing Github's website, where you see some repos with 36000+ stars, but fail to see the millions that do not have any stars (as they are not popular, they won't be common on your tour of the site.)
Distributions like this are also said to have fat-tails, i.e. the probability does not drop quickly as we extend into the tail of the dataset, but most of the probability is still centered near zero.
The heaviness of the tail and strength of "winner-take-all" effect are both influenced by the $\beta$ parameter. The small the $\beta$, the more pronounced these effects. Below is a list of distributions that follow a power-law and an approximate $\beta$ exponent [1]. Recall though that we never observe these numbers, we must infer them from the data.
Phenomenon | Assumed Exponent | |
---|---|---|
Frequency of word use | -1.2 | |
Number of hits on website | -1.4 | |
US book sales | -1.5 | |
Intensity of wars | -0.8 | |
New worth of Americans | -1.1 | |
Github Stars | ?? |
It is very easy to overestimate the true paramter $\beta$. This is because the tail events (the events of 500+ stars) are very rare. For example, suppose in our Github dataset we only observe 100 samples. With very high probability (about 30%), all of these samples will have less than 31 stars. This is because approximately 99% ( Number of all repos - Number of repos with greater than 31 stars)/(Number of all repos) of all repos have less than 31 stars. Thus, we would have no samples in our dataset from the tail of the distribution. If I then told you that there existed a repo with 36000+ stars, you would call me crazy, as it would be about 1000 times larger than your observed most popular repo. You would assign a very large $\beta$ exponent to your dataset (recall large $\beta$ means thinner tails). Similarly, with the same 30% probability we would not see repos more popular than 64 stars if we had a sample of 1000. Taking this to its logical conclusion, how confident should we be that there might not exist a theoretical repo that can attain 72000+ stars, or 150000+ stars, one which would push an estimated $\beta$ down even more.
The
In [83]:
from scipy.special import beta
import pymc as mc
param = mc.Exponential("param", 1 )
@mc.stochastic( dtype = int, observed = True )
def yule_simon( value = repo_with_stars, rho = param ):
"""test"""
def logp( value, rho):
return np.log( rho ) + np.log( beta( value, rho+1) )
def random(rho):
W = stats.expon.rvs(scale=1./rho)
return stats.geom.rvs( np.exp( -W ) )
model = mc.Model( [param, yule_simon] )
mcmc = mc.MCMC( model )
mcmc.sample( 10000, 8000)
In [84]:
def logp( value, rho):
return np.log( rho ) + np.log( beta( value, rho+1) )
beta( repo_with_stars, 1.3)
Out[84]:
In [20]:
x = np.linspace( 1, 50, 200 )
plt.plot( x, exp( -(x-1)**2 ), label = "Normal distribution" )
plt.plot( x, x**(-2), label = r"Power law, $\beta = -2$" )
plt.plot( x, x**(-1), label = r"Power law, $\beta = -1$" )
plt.legend()
Out[20]:
In [1]:
from IPython.core.display import HTML
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]:
In [167]:
import pymc as mc
beta = mc.Uniform("beta", -100, 100)
@mc.observed
def survival( value = y_, beta = beta):
return np.sum( [ value[i-1]*np.log( (i+0.)**beta - (i+1.)**beta ) for i in range(1,99) ] )
In [168]:
model = mc.Model( [survival, beta] )
#map_ = mc.MAP( model )
#map_.fit()
mcmc = mc.MCMC( model )
mcmc.sample( 50000, 40000 )
In [155]:
from pymc.Matplot import plot as mcplot
mcplot(mcmc)
In [17]:
stars_to_explore[1:]
Out[17]:
In [149]:
a = stats.pareto.rvs( 2.5, size = (50000,1) )
In [150]:
hist(a, bins = 100)
print
In [165]:
y = [ (a >= i).sum() for i in range(1,100 ) ]
In [166]:
y_ = -np.diff(y)
print y_
print y
In [112]:
b = -2.3
In [113]:
np.sum( [ y_[i-1]*np.log( (i+0.)**b - (i+1.)**b ) for i in range(1,7) ] ) + y[-1]*np.log(7)
Out[113]:
In [114]:
y_
Out[114]:
In [116]:
np.append( y_, y[-1] )
Out[116]:
In [129]:
mc.Uninformative?