If you want to deepen your theoretical knowledge of power laws, you can read (this is not mandatory):
Further readings:
In [ ]:
%pylab inline
A continuous real random variable $X$ has a power-law distribution if its tail fails according to some power:
$$ \mathbb{P}\left\{ X > x \right\} = \left( \frac{x_{min}}{x} \right)^\beta, \quad \forall x \ge x_{min}. $$
In [ ]:
# Power law arguments
xmin = 1.
t = arange(xmin, 20, 0.01)
beta = 1.5
# Geometric law arguments
p = (beta - 1) / beta
n = arange(0, 20, 1)
# CCDF
fig = figure(figsize=(12,8))
clf()
plot(t, pow(xmin/t, beta), label = r'Power law ($\beta = 3/2, x_{min} = 1$)', linewidth=2.0)
fill_between(t, 0, pow(xmin/t, beta), alpha = .4)
bar(n, pow(1-p, n), 1, color = "sandybrown", edgecolor = "white",
label = r'Geometric ($p = 1/3$)', linewidth=2.0)
axis([0, 20, 0, 1.])
xticks(fontsize = 20)
yticks(fontsize = 20)
xlabel("$x$", fontsize = 20)
ylabel("Survival function", fontsize = 20)
legend(fontsize = 20)
show()
In [ ]:
# CCDF semilog
t = arange(xmin, 100, 0.01)
n = arange(0, 100, 1)
fig = figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.loglog(t, pow(xmin/t, beta), label = r'Power law ($\beta = 3/2, x_{min} = 1$)', linewidth=2.0)
fill_between(t, 0, pow(xmin/t, beta), alpha = .4)
ax.bar(n, pow(1-p, n), 1, color = "sandybrown", edgecolor = "white",
label = r'Geometric ($p = 1/3$)', linewidth=2.0)
ax.set_xscale("log", nonposy='clip')
ax.set_yscale("log", nonposy='clip')
axis([0, 100, 0, 1.])
xticks(fontsize = 20)
yticks(fontsize = 20)
xlabel("$x$", fontsize = 20)
ylabel("Survival function", fontsize = 20)
legend(loc = 3, fontsize = 20)
show()
In [ ]:
t = arange(xmin, 50, 0.01)
n = arange(1, 50, 1)
s = arange(0, 50, 0.01)
# PDF
fig = figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.loglog(t, beta * pow(xmin, beta) / pow(t, beta + 1),
label = r'Power law ($\beta = 3/2, x_{min} = 1$)', linewidth=2.0)
fill_between(t, 0, beta * pow(xmin, beta) / pow(t, beta + 1), alpha = .4)
ax.bar(n, p * pow(1-p, n-1), 1, color = "sandybrown", edgecolor = "white",
label = r'Geometric ($p = 1/3$)', linewidth=2.0)
ax.set_xscale("log", nonposy='clip')
ax.set_yscale("log", nonposy='clip')
axis([1, 50, 0, 1.5])
xticks(fontsize = 20)
yticks(fontsize = 20)
xlabel("$t$", fontsize = 20)
ylabel("Probability distribution function (pdf)", fontsize = 20)
legend(loc = 3, fontsize = 20)
show()
In [ ]:
t = arange(xmin, 50, 0.01)
n = arange(1, 50, 1)
#CDF
fig = figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.loglog(t, 1 - pow(xmin/t, beta), label = r'Power law ($\beta = 3/2, x_{min} = 1$)', linewidth=2.0)
fill_between(t, 0, 1 - pow(xmin/t, beta), alpha = .4)
ax.bar(n, 1 - pow(1-p, n), 1, color = "sandybrown", edgecolor = "white",
label = r'Geometric ($p = 1/3$)', linewidth=2.0)
ax.set_xscale("log", nonposy='clip')
ax.set_yscale("log", nonposy='clip')
axis([1, 50, 0, 1])
xticks(fontsize = 20)
yticks(fontsize = 20)
xlabel("$x$", fontsize = 20)
ylabel("Cumulative distribution function (cdf)", fontsize = 20)
legend(loc = 4, fontsize = 20)
show()
A discrete random variable $X$ which takes its values in $\{1,\ldots,N\}$ has a Zipf's distribution with parameter $\tau > 0$ if
$$
\mathbb{P}\{ X = k \}
= \frac{ \frac1{k^\tau} }{ \sum_{n=1}^N \frac1{n^\tau} },
\quad \forall k = 1,\ldots,N.
$$
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph structure in the Web. Comput. Netw. 33, 1-6, 309-320, 2000.
Concretely:
"... there are many small elements contained within the Web, but few large ones. A few sites consist of millions of pages, but millions of sites only contain a handful of pages. Few sites contain millions of links, but many sites have one or two. Millions of users flock to a few select sites, giving little attention to millions of others."
L.A. Adamic and A.H. Bernardo, Zipf’s law and the Internet. Glottometrics 3.1, 143-150 2002.
Word frequency
"Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation." See Wikipedia, Zipf's law.
\frac{c_i}{c_{i-1}} = \frac{\alpha + \frac{1-\alpha}2 (i-1)}{1 + \alpha + \frac{1-\alpha}2 i},
\quad \forall i > 1.
$$\frac{c_i}{c_{i-1}}
= 1 - \frac{3 - \alpha}{2 + 2 \alpha + (1 - \alpha)i}
= 1 - \frac1i \frac{3 - \alpha}{1 - \alpha} + O\left( \frac1{i^2} \right).
\end{align*}c_i
&= c_1 \prod_{j=2}^i \left( 1 - \frac\beta{j} + O\left( \frac1{j^2} \right) \right).
\end{align*}&\ln\left( \prod_{j=1}^i \left( 1 - \frac\beta{j} \right) \right) \\
&= \sum_{j=1}^i \ln\left( 1 - \frac\beta{j} \right)
\approx \sum_{j=1}^i - \frac\beta{j}, \\
&= - \beta \sum_{j=1}^i \frac1j
\approx - \beta \ln(i).
\end{align*}