In [8]:
import matplotlib.pyplot as plt
import numpy as np
from nbsutils import imp
import time
For a normal distribution, all of these are fine choices because they're all about the same. An example of where this applies somewhat well is performance on a difficult test:
In [82]:
hard_test_scores = np.random.randn(1000)*10+50
easy_test_scores = np.random.randn(1000)*5+75
print "average:\t{}\t{}".format( int(np.mean(easy_test_scores)), int(np.mean(hard_test_scores)) )
print "median:\t\t{}\t{}".format( int(np.median(easy_test_scores)), int(np.median(hard_test_scores)) )
print "most common:\t{}\t{}".format( np.bincount(map(int, easy_test_scores)).argmax(), np.bincount(map(int, hard_test_scores)).argmax() )
plt.hist(hard_test_scores, label="test A", alpha=0.5);
plt.hist(easy_test_scores, label="test B", alpha=0.5);
plt.title("Test scores for 1000 students"); plt.xlabel("Scores"); plt.legend()
plt.show()
The distribution of test scores gives us context to evaluate how well a particular test taker performed. Knowing that someone got a score of 70 isn't a meaningful metric in and of itself. On test A, a score of 70 is great, while on test B a score of 70 is slightly below average.
The fact that the normal distribution is symmetric and has a single well defined peak, makes it easy to compare the relative performance from two different tests. We can capture everything thing we need to know about that distribution in just two numbers.
These two distribution parameters allow us to precisely quantify the relative performance of any particular test taker. Knowing the mean allows you to determine if a score is normal. Knowing the standard deviation allows you to determine how far from normal that score is. How good is a score of 70 on the test A? Well, the average score is 50 and the standard deviation is 10, so a score of 70 on test A is two standard deviations above the mean. That's better than 98% of all test takers of test A.
The average score on test B, however, is 75, with a standard deviation of 5 (scores are less spread out than on test A). A score of 70 is one standard deviation below the average.
So how well would you have to score on test B to have the equivalent of a score of 70 on test A?
These types of comparisons are so common in statistics that they have a special name: the z-score. The z-score is the number of standard deviations above (positive scores) or below (negative scores) the average value.
The z-score for a score of 70 on test A is 2 (2 standard deviations above the average score on that test). What score on test B has a z-score of 2?
One way to visualize the z-score, is to imagine that you take the peak of a histogram of a distribution and move it to the zero mark on the x-axis. Then you stretch or shrink the distribution so that 95% of the values lie between -2 and +2. Here's what the distributions for test A and test B look like when we do that shifting and shrinking:
In [44]:
plt.hist((hard_test_scores-50)/10, label="test A", alpha=0.5);
plt.hist((easy_test_scores-75)/5, label="test B", alpha=0.5);
plt.title("Test scores for 1000 students"); plt.xlabel("Scores"); plt.legend()
plt.show()
To do this shifting and shrinking we just subtract the average value from every score in the distribution and divide by the standard deviation. By doing this we're plotting the z-score for every test taker rather than the raw score.
$$zscore = (raw score - average)/(std deviation)$$Notice that when we plot a histogram of the z-scores, the distributions from both tests lie on top of each other. A raw score of 70 on test A is not equal to a raw score of 70 on test B, but a z-score of 2 on test A is equivalent to a z-score of 2 on test B.
z-scores are a great way to build context into a metric, but they're really only meant to be used with single-peaked, symmetric, normal distributions. For metric values that lie within a normal distribution the mean, median, and mode are all the same so it's clear how to define what a "normal" value should be. Many real world distributions, however, are very assymetric. Take, for example, the number of twitter followers for each twitter account tracked by Next Big Sound.
In [185]:
onemonth_ago = time.time()-30*86400
six_months_ago = time.time()-90*86400
impdb = imp.Connection()
query = """SELECT entity_id, {}(value) AS val
FROM idx_entity
WHERE metric_id={}
AND count_type='{}'
AND unix_seconds>{}
GROUP BY entity_id"""
twitter_total = impdb.fetchAll( "MAX" query.format(28, "t", onemonth_ago) )
impdb.close()
In [78]:
print "average:\t{}".format( int(np.mean(twitter_total['val'])) )
print "median:\t\t{}".format( int(np.median(twitter_total['val'])) )
print "most common:\t{}".format( np.bincount(map(int, twitter_total['val'])).argmax() )
print "max value:\t{}".format( int(np.max(twitter_total['val'])) )
print "percent of accounts with less than 1000 followers:\t{}%".format( int( 100.0*len(twitter_total[twitter_total['val']<1000])/len(twitter_total)))
The mean, median, and mode are vastly different, spanning 4 orders of magnitude. Most twitter handles have less than 500 followers, though the most popular twitter accounts have more than 10 million followers. Here's what the histogram of the distribution looks like:
In [67]:
twitter_total['val'].plot( kind='hist', bins=100)
plt.title("Follower Count for the 200k Twitter Accounts"); plt.xlabel("Number of Twitter Followers");
plt.show()
The large majority of twitter accounts fall in the first bin. Even if we zoom in to the to far left side of the histogram, the distribution is still hightly skewed:
In [68]:
twitter_total['val'].plot( kind='hist', bins=10000, xlim=[0,100000])
plt.title("Follower Count for the 200k Twitter Accounts"); plt.xlabel("Number of Twitter Followers");
plt.show()
Twitter follower counts are a good example of a log-normal distribution. On a linear scale, the distribution is unmanageably skewed, but on log scale, the distrubtion is actually normal (hence the name):
In [133]:
twitter_total['val'].plot( kind='hist', bins=np.logspace(0.1, int(np.log10(np.max(twitter_total['val'])))+1, 50), logx=True)
plt.title("Follower Count for the 200k Twitter Accounts Log10 Scale"); plt.xlabel("Number of Twitter Followers");
plt.show()
When dealing with metrics that are log-normally distributed, it is often best to work with the logarithm of those values. Then, rather than trying to describe our distribution with the average value (which can be heavily influenced by just a few values within the distribution), we can use the average log value.
After we perform the log transform on our metric values, we can then use the z-score of those log transormed values to determine how well an artist is performing on a given network. This z-score also allows us to compare different networks with each other.
Below are the log-transformed distributions for total Twitter followers and daily Wikipedia page views. Below that, is a plot of the distribution of z-scores of both metrics.
In [147]:
impdb = imp.Connection()
wiki_avg_daily = impdb.fetchAll( query.format("AVG", 41, "d", onemonth_ago) )
impdb.close()
In [154]:
plt.hist(np.log10(twitter_total['val']), label="Twitter Followers", alpha=0.5, bins=50);
plt.hist(np.log10(wiki_avg_daily['val']), label="Wikipedia Pageviews", alpha=0.5, bins=15);
plt.title("Metric Count Histogram (Log Scale)"); plt.xlabel("Log10(Metric Value)"); plt.legend()
plt.show()
In [164]:
plt.hist((np.log10(twitter_total['val'])-np.mean(np.log10(twitter_total['val'])))/np.std(np.log10(twitter_total['val'])), label="Twitter Followers", alpha=0.5, bins=50);
plt.hist((np.log10(wiki_avg_daily['val'])-np.mean(np.log10(wiki_avg_daily['val'])))/np.std(np.log10(wiki_avg_daily['val'])), label="Wikipedia Pageviews", alpha=0.5, bins=15);
plt.title("Metric Distributions"); plt.xlabel("z-score"); plt.legend()
plt.show()
Lady Gaga's Social Media metrics will always be in the top 99.99 percentile, but the engagement with that online audience may fluctuate from month to month. If we plot Instagram's "Likes" Metric vs the number of Instagram Followers for 10,000 artist on NBS, we see that the two are highly correlated.
In [215]:
impdb = imp.Connection()
Inst_Followers = impdb.fetchAll( query.format("MAX", 256, "t", onemonth_ago) )
Inst_Likes = impdb.fetchAll( query.format("AVG", 254, "d", six_months_ago) )
impdb.close()
In [238]:
Inst_Followers = FB_pagelikes
Inst_Likes = FB_TAT
Inst = Inst_Followers.merge(Inst_Likes,on="entity_id")
Inst.columns = ['entity_id', 'Followers', 'Likes']
Inst_sample = Inst[(Inst['Followers']>1) & (Inst['Likes']>0.05)]
Inst_sample = Inst_sample.ix[np.random.choice(Inst_sample.index.values, 10000)]
In [271]:
plt.scatter( np.log(Inst_sample['Followers']), np.log(Inst_sample['Likes']), alpha=0.05)
plt.xlabel("Log(Instagram Followers)"); plt.ylabel("Log(Instagram Likes)");
A best-fit line through this data gives us a function relating audience size to audience engagement:
In [270]:
p1 = np.polyfit(np.log(Inst_sample['Followers']), np.log(Inst_sample['Likes']),1).tolist()
x = np.linspace(2,18,50)
y = p1[0]*x+p1[1]
plt.scatter( np.log(Inst_sample['Followers']), np.log(Inst_sample['Likes']), alpha=0.1)
plt.plot(x,y,color='black')
plt.xlabel("Log(Instagram Followers)"); plt.ylabel("Log(Instagram Likes)");
Anyone falling above this line is seeing more engagement than what we would expect based on the best-fit line. Anyone falling below this line is seeing less engagement than we might expect. Plotting a histogram of the distance from the best fit line, we see that it's normal-ish, so we can again use z-scores to describe artists' relative engagement on a particular network:
In [268]:
plt.hist(inst_flat.tolist(), bins=30)
plt.show()
In [ ]: