All numbers used in this exercise are part Yelp's academic dataset challenge to which we've added some fake traffic data.
In this notebook we're going to do a bit of data science using pandas, matplotlib, and some fake Yelp traffic data. We're going to create a better SEO strategy for our platform geared around optimizing for expected future traffic in various markets and in the process learn about "long-tail" or power-law data.
This notebook is the full version where all the steps are filled out. If you're looking for the "give me a challenge" version this isn't what you want.
Let's start with some problem framing. Your SEO analysts tell you they need to decide if they should focus their work on low review businesses or high review businesses this quarter. You want to have a coherent market strategy so you ask your web-ops team to get you some numbers on incoming traffic versus the number of reviews you've seen historically.
Here we'll use the yelp academic dataset to get our business information and an included csv the yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year file to represent the web-ops teams response to you on incoming traffic. Here we've already segmented this data by market "saturation" which is a rough gauge of how saturated we think Yelp usage is in a given city.
Here we're just doing some basic python importing. For this tutorial you'll need the following python packages:
This tutorial is made for python 2.7
In [1]:
%load_ext autoreload
In [3]:
%autoreload 2
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (12.0, 8.0)
Jupyter notebooks a really wonderful way to merge code and data and notes into a single cohesive analysis. The most important bits to understand for this tutorial is that notebooks run code based one "cells" This text is in a cell for writing notes called a markdown cell. Here we can use HTML tags and web markdown syntax.
In [4]:
# This is a code cell here text needs to either be python code or commented out like this.
a = 5
print 1
When we run a code cell (shift-enter) the notebook runs that bit of code, when we run a markdown cell what happends? Try clicking on this to enter edit mode, doing a bit of editing then hitting shfit-enter. How does HTML tagging work in this sort of cell? Can you make this BOLD?
Edit the below base_file_path to point to where you've saved the yelp_academic_dataest_business.json file and the yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year.csv.
In [5]:
base_file_path = '../data/'
In [6]:
yelp_biz_dataset_file = base_file_path + 'yelp_academic_dataset_business.json'
yelp_market_saturation_file = base_file_path + 'yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year.csv'
Let's start by importing our market pageview data:
In [7]:
market_data = pd.read_csv(yelp_market_saturation_file, index_col=0)
The above loads the csv into a pandas dataframe, the most important part of pandas. It is easiest to think of a pandas dataframe as a special form of an spreadsheet. Unlike a spreadsheet we have to interact with our dataframe with code. Let's take a look at it first.
In [8]:
# To look at a dataframe use the .head() command. By default this shows up-to the first 5 rows.
market_data.head()
Out[8]:
This dataframe only has 3 rows so that's all we see. We can always pass a number to df.head() to get more rows if they exist.
So what can we say from this dataframe? First, what is in it? Let's look at a subsection of columns first, we do this by passing a list of columns we'd like to see, enclosed in square-braces.
In [9]:
market_data[[
'market_saturation',
'mean_impact_for_first_ten_reviews',
'mean_impact_for_higher_reviews'
]].head()
Out[9]:
So what do these numbers mean? For each additional review on a business in a mid-market city we expect to get (on average or at the mean) an extra 135.426 cumulative page-views per year if that review is one of the first 10 reviews for that business and 84.14 additional page-views per year if that review is after those first 10 reviews. These are cumulative in this exercise because they take into account that more page-views = more reviews later and represent the total cumulative impact per year of a new review.
Does that make sense inuitively? We might imagine that the first 10 reviews allow a business to go from completely undiscovered to discovered so they generate a lot of traffic as people discover that a place exists. On the other hand we don't see nearly as much impact past the first 10 reviews. That makes sense too, suppose a place had 200 reviews, how much do we expect traffic to improve if it got another reivew?
For this tutorial I've simplified this to just a binary cut. In reality there's a time-dependent surface underlying what's going on but for the exercise let's keep it simple as we'll still capture the basic idea.
So what are the other columns in our market data? Let's transpose (rotate) our dataframe using the .T command. Notice how I can chain commands in pandas with the dot syntax.
In [10]:
market_data.head().T
Out[10]:
So we understand the mean impact rows in this dataframe, but what are these lower and upper bound columns? These represent our best guess on the limits of where the true impact value is given our data, the represent our uncertainty.
If you don't know any statistics, no problem, we can think of our uncertainty as simply representing out bounds on where we'd be willing to place our bets. We can understand intuitively what gives rise to this, if we examined just one business we wouldn't be very sure what to expect for every business of the millions of businesses on Yelp rather we'd want to capture that we had very limited information.
Diving a bit more into the statistics, these bounds are our 95% credible interval, this sounds really fancy but is actually pretty easy to understand.
Let's unpack what a credible interval is in human English using a simple coin toss, while this might seem a bit non-sequitor coin tosses are a great way to illustarte basic concepts.
Imagine, I tell you I have a coin and ask you to guess how often you think it will land heads if I flip it 100 times. Did you guess around 50? Why?
$\\$
What you just did is called setting a "prior," you used your knowledge (prior belief) of other coins to estimate that most coins flip heads and tails with about even chances. Yay, so now I flip the coin 100 times and it comes up heads 52 times, do you think this is still a fair coin?
I imagine that you still do, but why? It didn't come up exactly 50 times heads so why do we still believe it is a fair coin? Of course a fair coin can sometimes result in not exactly 50 heads in 100 flips there's going to be some "wiggle" each time you do it. That wiggle is what in stats we call our credible interval. A 95% credible interval reperesents the region of values we'd expect to come up 95% of the time, occasionally, 5% of the time our fair coin will come up wtih more extreme numbers of heads but we expect 95% of the probability to be within the bounds.
Let's return to our dataframe.
In [11]:
market_data.head().T
Out[11]:
We can now understand what our lower and upper bound on impact means, these are our 95% credible interval on the mean impact, sometimes businesses will see lower impact from their reviews and sometimes they'll see more than the mean impact, but 95% of the time their impact will be between the lower and upper bound.
One of the things we can do in Jupyter notebooks is define helper functions that live throughout the life of the notebook. Here I define a basic function to read in a json file, which for formatting reasons is a bit harder than csv's. Json is a formatting syntax for files that's a bit more flexible than csv but the price we pay is that we then need to put in more work to read it.
Here I define a function with python. A function represents a reusable piece of code. Once you run this cell you'll be able to call the json_reading_function from any other cell below or above this one. Cells in jupyter notebooks care about run order not place on the page. If you need to modify this function you'll need to re-run this cell (shift-enter) to updated the working memory copy of it in your notebook.
In [12]:
def json_reading_function(file_path):
with open(file_path) as f:
df = pd.DataFrame(json.loads(line) for line in f)
return df
If you run the above cell does the number to the left of the cell change? What do you think this represents?
In [13]:
biz_data = json_reading_function(yelp_biz_dataset_file)
Now let's make a simple plot from our loaded biz_data. We can first look at the columns available using .head() like we did previously.
In [14]:
biz_data.head(3).T
Out[14]:
Ah, lots of information here. From the Yelp academic dataset guide we know that each row in this dataframe represents just one business at a specific moment in time, it's like snapshot. Here we get the busienss location information, it's categories and city, it's hours, name, if it is still open and operational, and it's review count and average rating. Finally we get information about it's attributes as a jason blog in one of the dataframe columns.
Pandas provides a super useful and easy interface for making plots, let's do a quick histogram of our review count and set the scale to logrithmic on the y-axis.
In [15]:
ax = biz_data.review_count.hist()
ax.set_yscale('log')
ax.set_xlabel('Number of reviews per biz')
ax.set_ylabel('Number of businesses')
Out[15]:
One of the things we can do with histograms is bin them with different numbers of bins like so:
In [16]:
ax = biz_data.review_count.hist(bins=1000)
ax.set_yscale('log')
ax.set_xlabel('Number of reviews per biz')
ax.set_ylabel('Number of businesses')
Out[16]:
How are these two plots different? Why? Would the story you felt you got from each plot be different?
Let's try some other basic plotting to get a hang of things, suppose we wanted to plot number of reviews versus average rating?
In [17]:
ax = biz_data.plot(x='review_count', y='stars', linestyle='', marker='.', alpha=0.2)
ax.set_xlim([0, biz_data.review_count.max()])
ax.set_xlabel('Number of reviews')
ax.set_ylabel('Average Star Rating')
Out[17]:
What story about businesses does this tell us? Can your group think of any reasons the plot might have this shape? Try playing around with some of the above arguments to the plot function, what do they do and why?
Ok, let's return to our problem at hand, a coherent SEO strategy. Our SEO data comes in the form by market saturation, let's see if we can corral our business data to inform market saturation levels. Here we'll use .groupby and .agg (aggregate) syntax to slice up our data.
In [18]:
city_state_groups = biz_data.groupby(['state', 'city'])
In a groupby we ask pandas to setup artifical groups within our dataframe, when we operate on the groups we don't have to worry about anything "crossing over" from one group to another.
In [19]:
city_state_data = city_state_groups.agg({
'business_id': pd.Series.nunique,
'review_count': 'sum'
})
In [20]:
city_state_data.head(10).T
Out[20]:
Hmm, those are some odd cities and how come Montreal only has 4 reviews and 1 biz? Something's odd. Let's do a quick sort to figure it out.
In [21]:
city_state_data.sort_values(by='review_count', ascending=False).head(10).T
Out[21]:
Ah, our data aren't fully clean, that's pretty common, for example Montreal vs Montréal. Since we're just doing a minor example we won't spend too much time cleaning our data but let's quickly see how number of reviews stack-up.
In [22]:
def clean_city_str(city_str):
temp = city_str.replace(' ', '')
temp = temp.lower()
return temp
In [23]:
biz_data['city_key'] = biz_data.city.apply(lambda x: clean_city_str(x))
In [24]:
city_state_data = biz_data.groupby(['state', 'city_key'], as_index=[False, False]).agg({
'business_id': pd.Series.nunique,
'review_count': 'sum',
'state': 'first',
'city_key': 'first'
})
city_state_data.rename(columns={'business_id': 'num_biz'}, inplace=True)
In [25]:
fig, ax = plt.subplots(figsize=(20,10))
city_state_data.sort_values(by='review_count', ascending=False).head(100).review_count.plot(kind='bar', ax=ax)
ax.set_yscale('log')
ax.set_ylabel('review_count')
Out[25]:
Uh-oh. More long-tail data, as you may have guessed by now these sorts of patterns are super common when we look at real world applications and web data. Time for us to start thinking about how to deal with it.
Let's examine a few simple steps when working with long-tail data before moving back to our problem. The first trick is normalization. When we normalize we want to take out extraneous dependancies. For example we know our above cities probably have different populations that we should think about before trying to determine what we mean by market saturation. The simple raw-number of reviews in say Lead, SD and Las Vegas aren't good indicators of market saturation simply because the population, and thus the number of businesses in them are vastly different. We can take out this population effect by simply dividing our total reviews among the businesses to get the average # reviews per business in a city.
In [27]:
city_state_data['mean_num_reviews_per_biz'] = (
city_state_data.review_count / city_state_data.num_biz.astype('float')
)
In [28]:
fig, ax = plt.subplots(figsize=(20,10))
city_state_data.sort_values(
by='mean_num_reviews_per_biz',
ascending=False
).mean_num_reviews_per_biz.plot(kind='bar', ax=ax)
ax.set_yscale('log')
ax.set_ylabel('mean_num_reviews_per_biz')
Out[28]:
So what's wrong with the above, why might the average be misleading here? Think about it for a second then look at the next cell.
In [29]:
city_state_data.sort_values(
by='mean_num_reviews_per_biz',
ascending=False
).head(20)
Out[29]:
So most of our extremely high reviews per biz cities only have a tiny number of businesses so these likely aren't terribly informative. How can we treat these? We need to take into account our knowledge of how many businesses are present as well.
In [30]:
def standard_err(x):
return np.std(x)/np.sqrt(len(x))
In [31]:
city_state_data['std_err'] = biz_data.groupby(['state', 'city_key']).agg({
'review_count': standard_err
})
Now to use this information. We'll use the standard error to set a lower bound on the mean, we're techinically treating these as normals but this is just an approximation. In reality we would use something like scipy's stats models to get confidence bounds or boostrap but for the tutorial let's keep it simple.
In [32]:
# cut to eliminate places with no deviation.
city_state_data = city_state_data[city_state_data.std_err > 0]
In [33]:
city_state_data['lower_conf_mean_review'] = city_state_data.apply(
lambda x: max(x.mean_num_reviews_per_biz - 1.96 * x.std_err, 0),
1
)
In [34]:
ax = city_state_data.lower_conf_mean_review.hist(bins=100)
ax.set_ylabel('Number of cities')
ax.set_xlabel('lower bound on mean reviews per biz')
Out[34]:
So we're finally ready to ask our basic question, how can we make a coherent statement of where to focus or SEO efforts. To recap:
Critical to how we answer a question is identifying what we wish to use to measure success. Here we'll use a very simple proxy. We want to generate the maximal new page-views per year via our SEO startegy, we don't know how much work various types of SEO are but what we want to generate is a good sense of the trade offs.
From our data per city we need to define saturation levels. Looking at the distrbution let's make the following distinctions.
In [35]:
fig, ax = plt.subplots()
city_state_data.lower_conf_mean_review.hist(bins=100, ax=ax, color='#0073bb')
ax.set_ylabel('Number of cities')
ax.set_xlabel('lower bound on mean reviews per biz')
# Add transparent rectangles
head_patch = plt.matplotlib.patches.Rectangle((0,0), 1.5, 40, alpha=0.25, color='#41a700')
middle_patch = plt.matplotlib.patches.Rectangle((1.5,0), 7, 40, alpha=0.25, color='#0073bb')
tail_patch = plt.matplotlib.patches.Rectangle((8.5,0), 70, 40, alpha=0.25, color='#d32323')
ax.add_patch(head_patch)
ax.add_patch(middle_patch)
ax.add_patch(tail_patch)
# Add text annotations
ax.text(0.5,25,"New Market", color='#41a700', fontsize=16, rotation=90)
ax.text(7,25,"Mid Market", color='#0073bb', fontsize=16, rotation=90)
ax.text(30,25,"Saturated Market", color='#d32323', fontsize=16)
Out[35]:
Let's now apply these boundaries to our data cities.
In [36]:
def assign_market_saturation(lower_bound_mean_reviews):
saturation_level = None
if lower_bound_mean_reviews <= 3:
saturation_level = 'new_market'
elif lower_bound_mean_reviews <= 9:
saturation_level = 'mid_market'
else:
saturation_level = 'saturated_market'
return saturation_level
In [37]:
city_state_data['market_saturation'] = city_state_data.lower_conf_mean_review.apply(
assign_market_saturation
)
In [38]:
city_state_data.sort_values(by='review_count', ascending=False).head().T
Out[38]:
Cool, now that we know per city the market saturation we can join it onto our per business data.
In [39]:
state_city_saturation_df = city_state_data[['state', 'city_key', 'market_saturation']]
In [40]:
biz_data['city_key'] = biz_data.city.apply(clean_city_str)
In [41]:
# Merge on market saturation data
biz_data = biz_data.merge(state_city_saturation_df, how='inner', on=['state', 'city_key'])
So let's examine the impact of our focus now, to gauge how we might see impact, suppose we could add 1 review to all businesses below 10 reviews or 1 review to all businesses above 10 reviews, what would we expect to see in page-views over the course of the year?
In [42]:
lookup_key = {
('<10', 'lower'): 'lower_bound_impact_for_first_ten_reviews',
('<10', 'mean'): 'mean_impact_for_first_ten_reviews',
('<10', 'upper'): 'upper_bound_impact_for_first_ten_reviews',
('10+', 'lower'): 'lower_bound_impact_for_higher_reviews',
('10+', 'mean'): 'mean_impact_for_higher_reviews',
('10+', 'upper'): 'upper_bound_impact_for_higher_reviews'
}
In [43]:
def yearly_impact_of_one_more_review(biz_row, impact='mean'):
impact_measures = market_data[market_data.market_saturation == biz_row.market_saturation]
if biz_row.review_count >= 10:
lookup_review_count = '10+'
else:
lookup_review_count = '<10'
return impact_measures[lookup_key[(lookup_review_count, impact)]].values[0]
In [45]:
biz_data['mean_added_yearly_pageviews'] = biz_data.apply(
lambda x: yearly_impact_of_one_more_review(x),
1
)
biz_data['lower_added_yearly_pageviews'] = biz_data.apply(
lambda x: yearly_impact_of_one_more_review(x, 'lower'),
1
)
biz_data['upper_added_yearly_pageviews'] = biz_data.apply(
lambda x: yearly_impact_of_one_more_review(x, 'upper'),
1
)
In [46]:
biz_data['review_bucket'] = biz_data.review_count.apply(lambda x: '10+' if x >= 10 else '<10')
In [47]:
results = biz_data.groupby(['review_bucket']).agg(
{
'mean_added_yearly_pageviews': 'sum',
'lower_added_yearly_pageviews': 'sum',
'upper_added_yearly_pageviews': 'sum'
}
)
results
Out[47]:
In [48]:
ax = results.T.plot(kind='box')
ax.set_ylabel('Expected Pageviews')
ax.set_ylim([0, 3000000])
ax.set_xlabel('Review segement to target')
Out[48]:
So at first glance it looks like we're better off focusing on less reviewed businesses, but is there any subtlties here?
In [51]:
by_market_saturation_results = biz_data.groupby(['market_saturation', 'review_bucket']).agg(
{
'mean_added_yearly_pageviews': 'sum',
'lower_added_yearly_pageviews': 'sum',
'upper_added_yearly_pageviews': 'sum'
}
)
In [52]:
by_market_saturation_results[
[
'mean_added_yearly_pageviews',
'lower_added_yearly_pageviews',
'upper_added_yearly_pageviews'
]
]
Out[52]:
In [53]:
ax = by_market_saturation_results[
[
'mean_added_yearly_pageviews',
'lower_added_yearly_pageviews',
'upper_added_yearly_pageviews'
]
].T.plot(kind='box')
ax.set_yscale('log')
plt.xticks(rotation=70)
Out[53]:
We can now get a rough estimate of the trade-offs we'd be willing to make in SEO. If we're told it's much harder to get early reviews we'd know that may not be worth the ROI and by what factor in mid, new, and saturated markets.