Last time (see 'long-tail-distributions-001') I talked about power law distributions, I talked about them in the context of long tail distributions as a whole. Before launching into a discussion of the power law distributions, I want to reiterate that power laws are just one type of long tail distribution.
Long tail distributions, and even power law distributions, show up with some regularity in empirical Twitter data. Distributions such as "the number of likes a Tweet recieves" seem to follow a long tail distribution (possibly even explicitly a power law distribution).
Understanding how to recognize long tail distributions, how to represent them, how to test for them, and where they might be coming from could help us indentify underlying processes for the emergence fo certain patterns in Twitter data.
Today's RST is brought to you by the in part by the lovely Dr MEJ Newman, and his 2006 paper: Power laws, Pareto distributions and Zipf's law. This notebook is by me, Fiona Pigott.
In [ ]:
# Plotting library
import matplotlib.pyplot as plt
%matplotlib inline
# mathematics
from math import exp, pi, sqrt, log
from numpy import linspace, hstack, round, random, arange, mean, logspace, argmax, array
from collections import Counter
from random import sample, choice
from random import uniform
#from functools import reduce
In [ ]:
def power_law(x, alpha):
'''
x: iterable of x values
alpha: parameter alpha
'''
return [point**(-alpha) for point in x]
# x values
x = linspace(1,5,10000)
alphas = linspace(1,3,5)
for alpha in alphas:
_ = plt.plot(x, power_law(x, alpha), '-', markersize = 1)
_ = plt.ylabel("Probabilty mass at x")
_ = plt.xlabel("x")
_ = plt.title("Power Law Distribution")
_ = plt.legend(["alpha = {}".format(round(a,2)) for a in alphas], loc = 1)
In [ ]:
for alpha in alphas:
_ = plt.loglog(x, power_law(x, alpha), '-', markersize = 1)
_ = plt.ylabel("Probabilty mass at x")
_ = plt.xlabel("x")
_ = plt.title("Power Law Distribution")
_ = plt.legend(["alpha = {}".format(round(a,2)) for a in alphas], loc = 1)
I'm going to generate a sample of power-law distributed data using a transformation method--by using a fuctional transformation on uniformly distributed random data to get power law distributed random data.
I want to show how sampling a power law is subject to lots of noise--especially in the tail--and how to see that in your representations of data.
In [ ]:
# generate 100000 samples of a uniform distribution on 0,1
xmin = 1
r = random.uniform(0,1,10**6)
# generate the power law distributed samples
# set alpha
alpha = 2.5
power_law_samples = [xmin*(1-sample)**(-1/(alpha-1)) for sample in r]
From Newman:
A histogram of a quantity with a power-law distribution appears as a straight line when plotted on logarithmic scales. Just making a simple histogram, however, and plotting it on log scales to see if it looks straight is, in most cases, a poor way proceed.
Simply plotting a histogram of an empirical distribution to determine whether or not it is power law distributed is "a poor way to proceed" because of the very noisy data in the low-frequency very large values in the tail of the distribution. Newman suggests a few different ways of looking at power law distribtuions:
In [ ]:
def simple_histogram(samples, bin_sizes = None, num_bins = None):
# sort the samples
sorted_samples = sorted(samples)
# grab the max for the bins
max_val = max(samples)
# min is 0, but not inclusive
min_val = 0
# either create bins by interval or by size
if bin_sizes is not None:
intervals = arange(min_val,max_val+bin_sizes,bin_sizes)
if num_bins is not None:
intervals = linspace(min_val,max_val,num_bins)
# list of tuples defining the bins
list_bins = list(zip(intervals,intervals[1:]))
dict_bins = {x: 0 for x in list_bins}
# put each sample into a bin
i = 0
for s in sorted_samples:
if (s > list_bins[i][0]) and (s <= list_bins[i][1]):
dict_bins[list_bins[i]] += 1
else:
while not ((s > list_bins[i][0]) and (s <= list_bins[i][1])):
i += 1
dict_bins[list_bins[i]] += 1
return dict_bins
def plot_histogram(histogram):
histogram_items = sorted(list(histogram.items()),key = lambda x:x[0])
x_vals = [mean([x[0][0],x[0][1]]) for x in histogram_items]
y_vals = [x[1] for x in histogram_items]
return x_vals, y_vals
In [ ]:
x,y = plot_histogram(simple_histogram(power_law_samples,num_bins = 100))
_ = plt.plot(x,y,'.')
_ = plt.title("A simple histogram")
_ = plt.ylabel("# of samples")
_ = plt.xlabel("x")
From Newman:
Notice how noisy the results get in the tail towards the right-hand side of the panel. This happens because the number of samples in the bins becomes small and statistical fluctuations are therefore large as a fraction of sample number. The power-law distribution dwindles in this region, meaning that each bin only has a few samples in it, if any.
Noise in the tail is one of the reasons that a log-log histogram isn't always the clearest way to plot a power law distribution.
In [ ]:
x,y = plot_histogram(simple_histogram(power_law_samples,num_bins = 100))
_ = plt.loglog(x,y,'.-')
_ = plt.title("Log-log histogram")
_ = plt.ylabel("# of samples")
_ = plt.xlabel("x")
In [ ]:
def log_histogram(samples, num_bins = None):
# sort the samples
sorted_samples = sorted(samples)
# grab the max for the bins
max_val = max(samples)
# min is 0, but not inclusive
min_val = 0
# create logarithmically spaced bins
intervals = logspace(min_val,log(max_val+10,10),num_bins,base = 10)
#return intervals
# list of tuples defining the bins
list_bins = list(zip(intervals,intervals[1:]))
dict_bins = {x: 0 for x in list_bins}
# put each sample into a bin
i = 0
for s in sorted_samples:
if (s > list_bins[i][0]) and (s <= list_bins[i][1]):
dict_bins[list_bins[i]] += 1
else:
while not ((s > list_bins[i][0]) and (s <= list_bins[i][1])):
i += 1
dict_bins[list_bins[i]] += 1
return dict_bins
def plot_log_histogram(histogram):
histogram_items = sorted(list(histogram.items()),key = lambda x:x[0][0])
x_vals = [10**mean([log(x[0][0],10),log(x[0][1],10)]) for x in histogram_items]
y_vals = [x[1]/(x[0][1] - x[0][0]) for x in histogram_items]
return x_vals, y_vals
In [ ]:
x,y = plot_log_histogram(log_histogram(power_law_samples,num_bins = 100))
_ = plt.loglog(x,y,'.')
_ = plt.title("A log-log histogram with logarithmic binning")
_ = plt.ylabel("# of samples")
_ = plt.xlabel("x")
In [ ]:
def cumulative_dist(samples,bin_sizes = None, num_bins = None):
# count the number of samples with a value greater than X
# sort the samples
sorted_samples = array(sorted(samples))
# grab the max for the bins
max_val = max(samples)
# min is 0, but not inclusive
min_val = 0
# either create bins by interval or by size
if bin_sizes is not None:
intervals = arange(min_val,max_val+bin_sizes,bin_sizes)
if num_bins is not None:
intervals = linspace(min_val,max_val,num_bins)
# put each sample into a bin
cum_dist = []
for i in intervals:
cum_dist.append(len(sorted_samples) - argmax(sorted_samples > i))
return intervals, cum_dist
In [ ]:
x,y = cumulative_dist(power_law_samples,num_bins = 100)
_ = plt.loglog(x,y,'.')
_ = plt.title("Cumulative distribution")
_ = plt.ylabel("# of samples > x")
_ = plt.xlabel("x")
In [ ]:
xmin = min(power_law_samples)
est_alpha = 1 + (len(power_law_samples) * sum([log(x/xmin) for x in power_law_samples])**(-1))
print("The estimated alpha is: {}".format(est_alpha))
print("The actual alpha is: {}".format(alpha))
From Newman:
A rich-get-richer mechanism in which the most populous cities or best-selling books get more inhabitants or sales in proportion to the number that they already have.
I'm particularly interested in this generative process because it seems like it could be a good hypothesis for explaining Twitter platform distributions like "the number of Likes/RTs per Tweet."
In networks, this process is commonly labeled as "preferential attachment."
There is a functional form for the distribution resulting from the Yule process, but I don't think that's as informative as seeing the process in action. Let's simulate a simple rich-get-richer process.
In [ ]:
# simulate the yule process
# start with, say, a set of 100000 nodes, with a single link
# I'm only interested in counting degree distributions, so I will simply record the links by
# augmenting the degree count of 2 nodes by 1
N = {x:0 for x in range(0,10**6)}
# we'll do this 100,000 - 1 times:
# in order, each node gets a single link
# the probability that any link connects to an existing node is proportional to that node's degree
# note: the probability of a node with no links getting a new link is 0
in_network_nodes = [0]
nodes_by_degree = [0]
for x in range(1,10**6):
in_network_nodes.append(x)
N[x] += 1
gets_a_link = choice(nodes_by_degree)
N[gets_a_link] += 1
nodes_by_degree.extend([x,gets_a_link])
# the probability that the new link connects to an existing node is proportional to that node's degree
In [ ]:
x,y = plot_histogram(simple_histogram(list(N.values()),num_bins = 1000))
_ = plt.loglog(x,y,'.')
_ = plt.title("Log-log histogram distribution")
_ = plt.ylabel("# of samples > x")
_ = plt.xlabel("x")
In [ ]:
xmin = min(list(N.values()))
est_alpha = 1 + (len(list(N.values())) * sum([log(x/xmin) for x in list(N.values())])**(-1))
print("The estimated alpha is: {}".format(est_alpha))