- Patrick Phelps - Manger of Data Science @ Yelp
- Frances Haugen - Product Manager @ Pinterest

All numbers used in this exercise are part Yelp's academic dataset challenge to which we've added some **fake traffic data**.

In this notebook we're going to do a bit of data science using pandas, matplotlib, and some fake Yelp traffic data. We're going to create a better SEO strategy for our platform geared around optimizing for expected future traffic in various markets and in the process learn about "long-tail" or power-law data.

This notebook is the **full version** where all the steps are filled out. If you're looking for the "give me a challenge" version this isn't what you want.

Let's start with some problem framing. Your SEO analysts tell you they need to decide if they should focus their work on low review businesses or high review businesses this quarter. You want to have a coherent market strategy so you ask your web-ops team to get you some numbers on incoming traffic versus the number of reviews you've seen historically.

Here we'll use the yelp academic dataset to get our business information and an included csv the yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year file to represent the web-ops teams response to you on incoming traffic. Here we've already segmented this data by market "saturation" which is a rough gauge of how saturated we think Yelp usage is in a given city.

Here we're just doing some basic python importing. For this tutorial you'll need the following python packages:

- pandas
- matplotlib
- numpy
- json

This tutorial is made for python 2.7

```
In [1]:
```%load_ext autoreload

```
In [3]:
```%autoreload 2
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (12.0, 8.0)

Jupyter notebooks a really wonderful way to merge code and data and notes into a single cohesive analysis. The most important bits to understand for this tutorial is that notebooks run code based one "cells" This text is in a cell for writing notes called a markdown cell. Here we can use HTML tags and web markdown syntax.

```
In [4]:
```# This is a code cell here text needs to either be python code or commented out like this.
a = 5
print 1

```
```

```
In [5]:
```base_file_path = '../data/'

```
In [6]:
```yelp_biz_dataset_file = base_file_path + 'yelp_academic_dataset_business.json'
yelp_market_saturation_file = base_file_path + 'yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year.csv'

Let's start by importing our market pageview data:

```
In [7]:
```market_data = pd.read_csv(yelp_market_saturation_file, index_col=0)

```
In [8]:
```# To look at a dataframe use the .head() command. By default this shows up-to the first 5 rows.
market_data.head()

```
Out[8]:
```

This dataframe only has 3 rows so that's all we see. We can always pass a number to df.head() to get more rows if they exist.

So what can we say from this dataframe? First, what is in it? Let's look at a subsection of columns first, we do this by passing a list of columns we'd like to see, enclosed in square-braces.

```
In [9]:
```market_data[[
'market_saturation',
'mean_impact_for_first_ten_reviews',
'mean_impact_for_higher_reviews'
]].head()

```
Out[9]:
```

So what do these numbers mean? For each additional review on a business in a mid-market city we expect to get (on average or at the mean) an extra 135.426 cumulative page-views per year if that review is one of the first 10 reviews for that business and 84.14 additional page-views per year if that review is after those first 10 reviews. These are cumulative in this exercise because they take into account that more page-views = more reviews later and represent the total cumulative impact per year of a new review.

Does that make sense inuitively? We might imagine that the first 10 reviews allow a business to go from completely undiscovered to discovered so they generate a lot of traffic as people discover that a place exists. On the other hand we don't see nearly as much impact past the first 10 reviews. That makes sense too, suppose a place had 200 reviews, how much do we expect traffic to improve if it got another reivew?

For this tutorial I've simplified this to just a binary cut. In reality there's a time-dependent surface underlying what's going on but for the exercise let's keep it simple as we'll still capture the basic idea.

So what are the other columns in our market data? Let's transpose (rotate) our dataframe using the .T command. Notice how I can chain commands in pandas with the dot syntax.

```
In [10]:
```market_data.head().T

```
Out[10]:
```

So we understand the mean impact rows in this dataframe, but what are these lower and upper bound columns? These represent our best guess on the limits of where the true impact value is given our data, the represent our uncertainty.

If you don't know any statistics, no problem, we can think of our uncertainty as simply representing out bounds on where we'd be willing to place our bets. We can understand intuitively what gives rise to this, if we examined just one business we wouldn't be very sure what to expect for every business of the millions of businesses on Yelp rather we'd want to capture that we had very limited information.

Diving a bit more into the statistics, these bounds are our 95% credible interval, this sounds really fancy but is actually pretty easy to understand.

Let's unpack what a credible interval is in human English using a simple coin toss, while this might seem a bit non-sequitor coin tosses are a great way to illustarte basic concepts.

Imagine, I tell you I have a coin and ask you to guess how often you think it will land heads if I flip it 100 times. Did you guess around 50? Why?

$\\$

What you just did is called setting a "prior," you used your knowledge (prior belief) of other coins to estimate that most coins flip heads and tails with about even chances. Yay, so now I flip the coin 100 times and it comes up heads 52 times, do you think this is still a fair coin?

I imagine that you still do, but why? It didn't come up exactly 50 times heads so why do we still believe it is a fair coin? Of course a fair coin can sometimes result in not exactly 50 heads in 100 flips there's going to be some "wiggle" each time you do it. That wiggle is what in stats we call our credible interval. A 95% credible interval reperesents the region of values we'd expect to come up 95% of the time, occasionally, 5% of the time our fair coin will come up wtih more extreme numbers of heads but we expect 95% of the probability to be within the bounds.

Let's return to our dataframe.

```
In [11]:
```market_data.head().T

```
Out[11]:
```

```
In [12]:
```def json_reading_function(file_path):
with open(file_path) as f:
df = pd.DataFrame(json.loads(line) for line in f)
return df

```
In [13]:
```biz_data = json_reading_function(yelp_biz_dataset_file)

```
In [14]:
```biz_data.head(3).T

```
Out[14]:
```

Pandas provides a super useful and easy interface for making plots, let's do a quick histogram of our review count and set the scale to logrithmic on the y-axis.

```
In [15]:
```ax = biz_data.review_count.hist()
ax.set_yscale('log')
ax.set_xlabel('Number of reviews per biz')
ax.set_ylabel('Number of businesses')

```
Out[15]:
```

One of the things we can do with histograms is bin them with different numbers of bins like so:

```
In [16]:
```ax = biz_data.review_count.hist(bins=1000)
ax.set_yscale('log')
ax.set_xlabel('Number of reviews per biz')
ax.set_ylabel('Number of businesses')

```
Out[16]:
```

How are these two plots different? Why? Would the story you felt you got from each plot be different?

Let's try some other basic plotting to get a hang of things, suppose we wanted to plot number of reviews versus average rating?

```
In [17]:
```ax = biz_data.plot(x='review_count', y='stars', linestyle='', marker='.', alpha=0.2)
ax.set_xlim([0, biz_data.review_count.max()])
ax.set_xlabel('Number of reviews')
ax.set_ylabel('Average Star Rating')

```
Out[17]:
```

Ok, let's return to our problem at hand, a coherent SEO strategy. Our SEO data comes in the form by market saturation, let's see if we can corral our business data to inform market saturation levels. Here we'll use .groupby and .agg (aggregate) syntax to slice up our data.

```
In [18]:
```city_state_groups = biz_data.groupby(['state', 'city'])

```
In [19]:
```city_state_data = city_state_groups.agg({
'business_id': pd.Series.nunique,
'review_count': 'sum'
})

```
In [20]:
```city_state_data.head(10).T

```
Out[20]:
```

```
In [21]:
```city_state_data.sort_values(by='review_count', ascending=False).head(10).T

```
Out[21]:
```

```
In [22]:
```def clean_city_str(city_str):
temp = city_str.replace(' ', '')
temp = temp.lower()
return temp

```
In [23]:
```biz_data['city_key'] = biz_data.city.apply(lambda x: clean_city_str(x))

```
In [24]:
```city_state_data = biz_data.groupby(['state', 'city_key'], as_index=[False, False]).agg({
'business_id': pd.Series.nunique,
'review_count': 'sum',
'state': 'first',
'city_key': 'first'
})
city_state_data.rename(columns={'business_id': 'num_biz'}, inplace=True)

```
In [25]:
```fig, ax = plt.subplots(figsize=(20,10))
city_state_data.sort_values(by='review_count', ascending=False).head(100).review_count.plot(kind='bar', ax=ax)
ax.set_yscale('log')
ax.set_ylabel('review_count')

```
Out[25]:
```

```
In [27]:
```city_state_data['mean_num_reviews_per_biz'] = (
city_state_data.review_count / city_state_data.num_biz.astype('float')
)

```
In [28]:
```fig, ax = plt.subplots(figsize=(20,10))
city_state_data.sort_values(
by='mean_num_reviews_per_biz',
ascending=False
).mean_num_reviews_per_biz.plot(kind='bar', ax=ax)
ax.set_yscale('log')
ax.set_ylabel('mean_num_reviews_per_biz')

```
Out[28]:
```

```
In [29]:
```city_state_data.sort_values(
by='mean_num_reviews_per_biz',
ascending=False
).head(20)

```
Out[29]:
```

```
In [30]:
```def standard_err(x):
return np.std(x)/np.sqrt(len(x))

```
In [31]:
```city_state_data['std_err'] = biz_data.groupby(['state', 'city_key']).agg({
'review_count': standard_err
})

```
In [32]:
```# cut to eliminate places with no deviation.
city_state_data = city_state_data[city_state_data.std_err > 0]

```
In [33]:
```city_state_data['lower_conf_mean_review'] = city_state_data.apply(
lambda x: max(x.mean_num_reviews_per_biz - 1.96 * x.std_err, 0),
1
)

```
In [34]:
```ax = city_state_data.lower_conf_mean_review.hist(bins=100)
ax.set_ylabel('Number of cities')
ax.set_xlabel('lower bound on mean reviews per biz')

```
Out[34]:
```

So we're finally ready to ask our basic question, how can we make a coherent statement of where to focus or SEO efforts. To recap:

- We have data on the value of how much traffic we hope to drive per review added in our market_data dataframe segemented by market saturation.
- We have data on per city in the academic dataset how saturated the various markets might be in terms of # reviews per biz in our city_state_data dataframe.
- We have data on the number of reviews per business currently in each city in our biz_data dataframe.

Critical to how we answer a question is identifying what we wish to use to measure success. Here we'll use a very simple proxy. We want to generate the maximal new page-views per year via our SEO startegy, we don't know how much work various types of SEO are but what we want to generate is a good sense of the trade offs.

From our data per city we need to define saturation levels. Looking at the distrbution let's make the following distinctions.

```
In [35]:
```fig, ax = plt.subplots()
city_state_data.lower_conf_mean_review.hist(bins=100, ax=ax, color='#0073bb')
ax.set_ylabel('Number of cities')
ax.set_xlabel('lower bound on mean reviews per biz')
# Add transparent rectangles
head_patch = plt.matplotlib.patches.Rectangle((0,0), 1.5, 40, alpha=0.25, color='#41a700')
middle_patch = plt.matplotlib.patches.Rectangle((1.5,0), 7, 40, alpha=0.25, color='#0073bb')
tail_patch = plt.matplotlib.patches.Rectangle((8.5,0), 70, 40, alpha=0.25, color='#d32323')
ax.add_patch(head_patch)
ax.add_patch(middle_patch)
ax.add_patch(tail_patch)
# Add text annotations
ax.text(0.5,25,"New Market", color='#41a700', fontsize=16, rotation=90)
ax.text(7,25,"Mid Market", color='#0073bb', fontsize=16, rotation=90)
ax.text(30,25,"Saturated Market", color='#d32323', fontsize=16)

```
Out[35]:
```

Let's now apply these boundaries to our data cities.

```
In [36]:
```def assign_market_saturation(lower_bound_mean_reviews):
saturation_level = None
if lower_bound_mean_reviews <= 3:
saturation_level = 'new_market'
elif lower_bound_mean_reviews <= 9:
saturation_level = 'mid_market'
else:
saturation_level = 'saturated_market'
return saturation_level

```
In [37]:
```city_state_data['market_saturation'] = city_state_data.lower_conf_mean_review.apply(
assign_market_saturation
)

```
In [38]:
```city_state_data.sort_values(by='review_count', ascending=False).head().T

```
Out[38]:
```

Cool, now that we know per city the market saturation we can join it onto our per business data.

```
In [39]:
```state_city_saturation_df = city_state_data[['state', 'city_key', 'market_saturation']]

```
In [40]:
```biz_data['city_key'] = biz_data.city.apply(clean_city_str)

```
In [41]:
```# Merge on market saturation data
biz_data = biz_data.merge(state_city_saturation_df, how='inner', on=['state', 'city_key'])

So let's examine the impact of our focus now, to gauge how we might see impact, suppose we could add 1 review to all businesses below 10 reviews or 1 review to all businesses above 10 reviews, what would we expect to see in page-views over the course of the year?

```
In [42]:
```lookup_key = {
('<10', 'lower'): 'lower_bound_impact_for_first_ten_reviews',
('<10', 'mean'): 'mean_impact_for_first_ten_reviews',
('<10', 'upper'): 'upper_bound_impact_for_first_ten_reviews',
('10+', 'lower'): 'lower_bound_impact_for_higher_reviews',
('10+', 'mean'): 'mean_impact_for_higher_reviews',
('10+', 'upper'): 'upper_bound_impact_for_higher_reviews'
}

```
In [43]:
```def yearly_impact_of_one_more_review(biz_row, impact='mean'):
impact_measures = market_data[market_data.market_saturation == biz_row.market_saturation]
if biz_row.review_count >= 10:
lookup_review_count = '10+'
else:
lookup_review_count = '<10'
return impact_measures[lookup_key[(lookup_review_count, impact)]].values[0]

```
In [45]:
```biz_data['mean_added_yearly_pageviews'] = biz_data.apply(
lambda x: yearly_impact_of_one_more_review(x),
1
)
biz_data['lower_added_yearly_pageviews'] = biz_data.apply(
lambda x: yearly_impact_of_one_more_review(x, 'lower'),
1
)
biz_data['upper_added_yearly_pageviews'] = biz_data.apply(
lambda x: yearly_impact_of_one_more_review(x, 'upper'),
1
)

```
In [46]:
```biz_data['review_bucket'] = biz_data.review_count.apply(lambda x: '10+' if x >= 10 else '<10')

```
In [47]:
```results = biz_data.groupby(['review_bucket']).agg(
{
'mean_added_yearly_pageviews': 'sum',
'lower_added_yearly_pageviews': 'sum',
'upper_added_yearly_pageviews': 'sum'
}
)
results

```
Out[47]:
```

```
In [48]:
```ax = results.T.plot(kind='box')
ax.set_ylabel('Expected Pageviews')
ax.set_ylim([0, 3000000])
ax.set_xlabel('Review segement to target')

```
Out[48]:
```

```
In [51]:
```by_market_saturation_results = biz_data.groupby(['market_saturation', 'review_bucket']).agg(
{
'mean_added_yearly_pageviews': 'sum',
'lower_added_yearly_pageviews': 'sum',
'upper_added_yearly_pageviews': 'sum'
}
)

```
In [52]:
```by_market_saturation_results[
[
'mean_added_yearly_pageviews',
'lower_added_yearly_pageviews',
'upper_added_yearly_pageviews'
]
]

```
Out[52]:
```

```
In [53]:
```ax = by_market_saturation_results[
[
'mean_added_yearly_pageviews',
'lower_added_yearly_pageviews',
'upper_added_yearly_pageviews'
]
].T.plot(kind='box')
ax.set_yscale('log')
plt.xticks(rotation=70)

```
Out[53]:
```