Working with Data in Pandas

  • Patrick Phelps - Manger of Data Science @ Yelp
  • Frances Haugen - Product Manager @ Pinterest


All numbers used in this exercise are part Yelp's academic dataset challenge to which we've added some fake traffic data.

In this notebook we're going to do a bit of data science using pandas, matplotlib, and some fake Yelp traffic data. We're going to create a better SEO strategy for our platform geared around optimizing for expected future traffic in various markets and in the process learn about "long-tail" or power-law data.

This notebook is the full version where all the steps are filled out. If you're looking for the "give me a challenge" version this isn't what you want.

The Problem

Let's start with some problem framing. Your SEO analysts tell you they need to decide if they should focus their work on low review businesses or high review businesses this quarter. You want to have a coherent market strategy so you ask your web-ops team to get you some numbers on incoming traffic versus the number of reviews you've seen historically.

Here we'll use the yelp academic dataset to get our business information and an included csv the yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year file to represent the web-ops teams response to you on incoming traffic. Here we've already segmented this data by market "saturation" which is a rough gauge of how saturated we think Yelp usage is in a given city.

Setting Up our Notebook:

Here we're just doing some basic python importing. For this tutorial you'll need the following python packages:

  1. pandas
  2. matplotlib
  3. numpy
  4. json

This tutorial is made for python 2.7

In [1]:
%load_ext autoreload

In [3]:
%autoreload 2
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json

pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (12.0, 8.0)

A tiny bit about jupyter notebooks

Jupyter notebooks a really wonderful way to merge code and data and notes into a single cohesive analysis. The most important bits to understand for this tutorial is that notebooks run code based one "cells" This text is in a cell for writing notes called a markdown cell. Here we can use HTML tags and web markdown syntax.

In [4]:
# This is a code cell here text needs to either be python code or commented out like this.
a = 5
print 1


When we run a code cell (shift-enter) the notebook runs that bit of code, when we run a markdown cell what happends? Try clicking on this to enter edit mode, doing a bit of editing then hitting shfit-enter. How does HTML tagging work in this sort of cell? Can you make this BOLD?

Loading the Data

Edit the below base_file_path to point to where you've saved the yelp_academic_dataest_business.json file and the yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year.csv.

In [5]:
base_file_path = '../data/'

In [6]:
yelp_biz_dataset_file = base_file_path + 'yelp_academic_dataset_business.json'
yelp_market_saturation_file = base_file_path + 'yelp_dataset_cumulative_value_of_a_review_in_pageviews_per_year.csv'

Let's start by importing our market pageview data:

In [7]:
market_data = pd.read_csv(yelp_market_saturation_file, index_col=0)

The above loads the csv into a pandas dataframe, the most important part of pandas. It is easiest to think of a pandas dataframe as a special form of an spreadsheet. Unlike a spreadsheet we have to interact with our dataframe with code. Let's take a look at it first.

In [8]:
# To look at a dataframe use the .head() command.  By default this shows up-to the first 5 rows.

lower_bound_impact_for_first_ten_reviews mean_impact_for_first_ten_reviews upper_bound_impact_for_first_ten_reviews lower_bound_impact_for_higher_reviews mean_impact_for_higher_reviews upper_bound_impact_for_higher_reviews market_saturation
0 82.357716 135.426633 188.495550 55.497045 84.137615 97.778184 mid_market
1 23.898631 55.844838 87.791045 16.705135 22.057810 27.410485 new_market
2 36.188677 50.271565 64.354453 36.418009 44.243010 47.068010 saturated_market

This dataframe only has 3 rows so that's all we see. We can always pass a number to df.head() to get more rows if they exist.

So what can we say from this dataframe? First, what is in it? Let's look at a subsection of columns first, we do this by passing a list of columns we'd like to see, enclosed in square-braces.

In [9]:

market_saturation mean_impact_for_first_ten_reviews mean_impact_for_higher_reviews
0 mid_market 135.426633 84.137615
1 new_market 55.844838 22.057810
2 saturated_market 50.271565 44.243010

So what do these numbers mean? For each additional review on a business in a mid-market city we expect to get (on average or at the mean) an extra 135.426 cumulative page-views per year if that review is one of the first 10 reviews for that business and 84.14 additional page-views per year if that review is after those first 10 reviews. These are cumulative in this exercise because they take into account that more page-views = more reviews later and represent the total cumulative impact per year of a new review.

Does that make sense inuitively? We might imagine that the first 10 reviews allow a business to go from completely undiscovered to discovered so they generate a lot of traffic as people discover that a place exists. On the other hand we don't see nearly as much impact past the first 10 reviews. That makes sense too, suppose a place had 200 reviews, how much do we expect traffic to improve if it got another reivew?

For this tutorial I've simplified this to just a binary cut. In reality there's a time-dependent surface underlying what's going on but for the exercise let's keep it simple as we'll still capture the basic idea.

So what are the other columns in our market data? Let's transpose (rotate) our dataframe using the .T command. Notice how I can chain commands in pandas with the dot syntax.

In [10]:

0 1 2
lower_bound_impact_for_first_ten_reviews 82.3577 23.8986 36.1887
mean_impact_for_first_ten_reviews 135.427 55.8448 50.2716
upper_bound_impact_for_first_ten_reviews 188.496 87.791 64.3545
lower_bound_impact_for_higher_reviews 55.497 16.7051 36.418
mean_impact_for_higher_reviews 84.1376 22.0578 44.243
upper_bound_impact_for_higher_reviews 97.7782 27.4105 47.068
market_saturation mid_market new_market saturated_market

A little bit (I promise) on statistics

So we understand the mean impact rows in this dataframe, but what are these lower and upper bound columns? These represent our best guess on the limits of where the true impact value is given our data, the represent our uncertainty.

If you don't know any statistics, no problem, we can think of our uncertainty as simply representing out bounds on where we'd be willing to place our bets. We can understand intuitively what gives rise to this, if we examined just one business we wouldn't be very sure what to expect for every business of the millions of businesses on Yelp rather we'd want to capture that we had very limited information.

Diving a bit more into the statistics, these bounds are our 95% credible interval, this sounds really fancy but is actually pretty easy to understand.

Let's unpack what a credible interval is in human English using a simple coin toss, while this might seem a bit non-sequitor coin tosses are a great way to illustarte basic concepts.

Imagine, I tell you I have a coin and ask you to guess how often you think it will land heads if I flip it 100 times. Did you guess around 50? Why?


What you just did is called setting a "prior," you used your knowledge (prior belief) of other coins to estimate that most coins flip heads and tails with about even chances. Yay, so now I flip the coin 100 times and it comes up heads 52 times, do you think this is still a fair coin?

I imagine that you still do, but why? It didn't come up exactly 50 times heads so why do we still believe it is a fair coin? Of course a fair coin can sometimes result in not exactly 50 heads in 100 flips there's going to be some "wiggle" each time you do it. That wiggle is what in stats we call our credible interval. A 95% credible interval reperesents the region of values we'd expect to come up 95% of the time, occasionally, 5% of the time our fair coin will come up wtih more extreme numbers of heads but we expect 95% of the probability to be within the bounds.

Let's return to our dataframe.

In [11]:

0 1 2
lower_bound_impact_for_first_ten_reviews 82.3577 23.8986 36.1887
mean_impact_for_first_ten_reviews 135.427 55.8448 50.2716
upper_bound_impact_for_first_ten_reviews 188.496 87.791 64.3545
lower_bound_impact_for_higher_reviews 55.497 16.7051 36.418
mean_impact_for_higher_reviews 84.1376 22.0578 44.243
upper_bound_impact_for_higher_reviews 97.7782 27.4105 47.068
market_saturation mid_market new_market saturated_market

We can now understand what our lower and upper bound on impact means, these are our 95% credible interval on the mean impact, sometimes businesses will see lower impact from their reviews and sometimes they'll see more than the mean impact, but 95% of the time their impact will be between the lower and upper bound.

Working with json

One of the things we can do in Jupyter notebooks is define helper functions that live throughout the life of the notebook. Here I define a basic function to read in a json file, which for formatting reasons is a bit harder than csv's. Json is a formatting syntax for files that's a bit more flexible than csv but the price we pay is that we then need to put in more work to read it.

Here I define a function with python. A function represents a reusable piece of code. Once you run this cell you'll be able to call the json_reading_function from any other cell below or above this one. Cells in jupyter notebooks care about run order not place on the page. If you need to modify this function you'll need to re-run this cell (shift-enter) to updated the working memory copy of it in your notebook.

In [12]:
def json_reading_function(file_path):
    with open(file_path) as f:
        df = pd.DataFrame(json.loads(line) for line in f)
    return df

If you run the above cell does the number to the left of the cell change? What do you think this represents?

In [13]:
biz_data = json_reading_function(yelp_biz_dataset_file)

Now let's make a simple plot from our loaded biz_data. We can first look at the columns available using .head() like we did previously.

In [14]:

0 1 2
attributes {u'Take-out': True, u'Drive-Thru': False, u'Ou... {u'Happy Hour': True, u'Accepts Credit Cards':... {}
business_id 5UmKMjUEUNdYWqANhGckJw UsFtqoBl7naz8AVUBZMjQQ 3eu6MEFlq2Dg7bQh8QbdOg
categories [Fast Food, Restaurants] [Nightlife] [Auto Repair, Automotive]
city Dravosburg Dravosburg Dravosburg
full_address 4734 Lebanon Church Rd\nDravosburg, PA 15034 202 McClure St\nDravosburg, PA 15034 1 Ravine St\nDravosburg, PA 15034
hours {u'Tuesday': {u'close': u'21:00', u'open': u'1... {} {}
latitude 40.3543 40.3506 40.351
longitude -79.9007 -79.8868 -79.8891
name Mr Hoagie Clancy's Pub Joe Cislo's Auto
neighborhoods [] [] []
open True True True
review_count 4 4 3
stars 4.5 3.5 5
state PA PA PA
type business business business

Ah, lots of information here. From the Yelp academic dataset guide we know that each row in this dataframe represents just one business at a specific moment in time, it's like snapshot. Here we get the busienss location information, it's categories and city, it's hours, name, if it is still open and operational, and it's review count and average rating. Finally we get information about it's attributes as a jason blog in one of the dataframe columns.

Making a histogram

Pandas provides a super useful and easy interface for making plots, let's do a quick histogram of our review count and set the scale to logrithmic on the y-axis.

In [15]:
ax = biz_data.review_count.hist()
ax.set_xlabel('Number of reviews per biz')
ax.set_ylabel('Number of businesses')

<matplotlib.text.Text at 0x129431890>

One of the things we can do with histograms is bin them with different numbers of bins like so:

In [16]:
ax = biz_data.review_count.hist(bins=1000)
ax.set_xlabel('Number of reviews per biz')
ax.set_ylabel('Number of businesses')

<matplotlib.text.Text at 0x1294a1e10>

How are these two plots different? Why? Would the story you felt you got from each plot be different?

Let's try some other basic plotting to get a hang of things, suppose we wanted to plot number of reviews versus average rating?

In [17]:
ax = biz_data.plot(x='review_count', y='stars', linestyle='', marker='.', alpha=0.2)
ax.set_xlim([0, biz_data.review_count.max()])
ax.set_xlabel('Number of reviews')
ax.set_ylabel('Average Star Rating')

<matplotlib.text.Text at 0x12b8d5fd0>

What story about businesses does this tell us? Can your group think of any reasons the plot might have this shape? Try playing around with some of the above arguments to the plot function, what do they do and why?

Working with Groupby

Ok, let's return to our problem at hand, a coherent SEO strategy. Our SEO data comes in the form by market saturation, let's see if we can corral our business data to inform market saturation levels. Here we'll use .groupby and .agg (aggregate) syntax to slice up our data.

In [18]:
city_state_groups = biz_data.groupby(['state', 'city'])

In a groupby we ask pandas to setup artifical groups within our dataframe, when we operate on the groups we don't have to worry about anything "crossing over" from one group to another.

In [19]:
city_state_data = city_state_groups.agg({
        'business_id': pd.Series.nunique,
        'review_count': 'sum'

In [20]:

state AL AR AZ
city Montreal La Paz Avondale Ahwatukee Ahwatukee Foothills Village Anthem Apache Junction Arcadia Arlington
review_count 4 4 23 3 547 4 2180 2004 119 10
business_id 1 1 1 1 13 1 123 157 2 2

Hmm, those are some odd cities and how come Montreal only has 4 reviews and 1 biz? Something's odd. Let's do a quick sort to figure it out.

In [21]:
city_state_data.sort_values(by='review_count', ascending=False).head(10).T

city Las Vegas Phoenix Scottsdale Charlotte Tempe Pittsburgh Henderson Montréal Chandler Mesa
review_count 957676 330437 179521 131833 95650 94942 85463 69973 64921 63897
business_id 17422 10629 5139 5188 2773 3337 2839 3891 2425 3190

Ah, our data aren't fully clean, that's pretty common, for example Montreal vs Montréal. Since we're just doing a minor example we won't spend too much time cleaning our data but let's quickly see how number of reviews stack-up.

In [22]:
def clean_city_str(city_str):
    temp = city_str.replace(' ', '')
    temp = temp.lower()
    return temp

In [23]:
biz_data['city_key'] = x: clean_city_str(x))

In [24]:
city_state_data = biz_data.groupby(['state', 'city_key'], as_index=[False, False]).agg({
        'business_id': pd.Series.nunique,
        'review_count': 'sum',
        'state': 'first',
        'city_key': 'first'
city_state_data.rename(columns={'business_id': 'num_biz'}, inplace=True)

In [25]:
fig, ax = plt.subplots(figsize=(20,10))
city_state_data.sort_values(by='review_count', ascending=False).head(100).review_count.plot(kind='bar', ax=ax)

<matplotlib.text.Text at 0x12b924f90>

Uh-oh. More long-tail data, as you may have guessed by now these sorts of patterns are super common when we look at real world applications and web data. Time for us to start thinking about how to deal with it.

Tricks for dealing with long-tail data

Let's examine a few simple steps when working with long-tail data before moving back to our problem. The first trick is normalization. When we normalize we want to take out extraneous dependancies. For example we know our above cities probably have different populations that we should think about before trying to determine what we mean by market saturation. The simple raw-number of reviews in say Lead, SD and Las Vegas aren't good indicators of market saturation simply because the population, and thus the number of businesses in them are vastly different. We can take out this population effect by simply dividing our total reviews among the businesses to get the average # reviews per business in a city.

In [27]:
city_state_data['mean_num_reviews_per_biz'] = (
    city_state_data.review_count / city_state_data.num_biz.astype('float')

In [28]:
fig, ax = plt.subplots(figsize=(20,10))
).mean_num_reviews_per_biz.plot(kind='bar', ax=ax)

<matplotlib.text.Text at 0x12d2c4410>

So what's wrong with the above, why might the average be misleading here? Think about it for a second then look at the next cell.

In [29]:

state review_count num_biz city_key mean_num_reviews_per_biz
state city_key
TX dallas TX 990 1 dallas 990.000000
PA lowerlawrenceville PA 182 1 lowerlawrenceville 182.000000
AZ northscottsdale AZ 128 1 northscottsdale 128.000000
blackcanyoncity AZ 379 3 blackcanyoncity 126.333333
goldfield AZ 116 1 goldfield 116.000000
NV n.lasvegas NV 200 2 n.lasvegas 100.000000
PA pittsburgh/waterfront PA 99 1 pittsburgh/waterfront 99.000000
AZ sedona AZ 97 1 sedona 97.000000
CA burbank CA 87 1 burbank 87.000000
AZ phoenix-ahwatukee AZ 81 1 phoenix-ahwatukee 81.000000
PA greentree PA 65 1 greentree 65.000000
AZ scottdale AZ 249 4 scottdale 62.250000
PA lawrenceville PA 121 2 lawrenceville 60.500000
AZ arcadia AZ 119 2 arcadia 59.500000
NV lasvegas NV 957737 17430 lasvegas 54.947619
AZ grandcanyon AZ 52 1 grandcanyon 52.000000
NV clark NV 193 4 clark 48.250000
whenderson NV 47 1 whenderson 47.000000
AZ glendaleaz AZ 46 1 glendaleaz 46.000000
SC ft.mill SC 45 1 ft.mill 45.000000

So most of our extremely high reviews per biz cities only have a tiny number of businesses so these likely aren't terribly informative. How can we treat these? We need to take into account our knowledge of how many businesses are present as well.

In [30]:
def standard_err(x):
    return np.std(x)/np.sqrt(len(x))

In [31]:
city_state_data['std_err'] = biz_data.groupby(['state', 'city_key']).agg({
        'review_count': standard_err

Now to use this information. We'll use the standard error to set a lower bound on the mean, we're techinically treating these as normals but this is just an approximation. In reality we would use something like scipy's stats models to get confidence bounds or boostrap but for the tutorial let's keep it simple.

In [32]:
# cut to eliminate places with no deviation.
city_state_data = city_state_data[city_state_data.std_err > 0]

In [33]:
city_state_data['lower_conf_mean_review'] = city_state_data.apply(
    lambda x: max(x.mean_num_reviews_per_biz - 1.96 * x.std_err, 0),

In [34]:
ax = city_state_data.lower_conf_mean_review.hist(bins=100)
ax.set_ylabel('Number of cities')
ax.set_xlabel('lower bound on mean reviews per biz')

<matplotlib.text.Text at 0x12dd52350>

Putting it all together

So we're finally ready to ask our basic question, how can we make a coherent statement of where to focus or SEO efforts. To recap:

  1. We have data on the value of how much traffic we hope to drive per review added in our market_data dataframe segemented by market saturation.
  2. We have data on per city in the academic dataset how saturated the various markets might be in terms of # reviews per biz in our city_state_data dataframe.
  3. We have data on the number of reviews per business currently in each city in our biz_data dataframe.

Defining a clear objective

Critical to how we answer a question is identifying what we wish to use to measure success. Here we'll use a very simple proxy. We want to generate the maximal new page-views per year via our SEO startegy, we don't know how much work various types of SEO are but what we want to generate is a good sense of the trade offs.

Assign market-saturation levels to cities

From our data per city we need to define saturation levels. Looking at the distrbution let's make the following distinctions.

In [35]:
fig, ax = plt.subplots()
city_state_data.lower_conf_mean_review.hist(bins=100, ax=ax, color='#0073bb')
ax.set_ylabel('Number of cities')
ax.set_xlabel('lower bound on mean reviews per biz')

# Add transparent rectangles
head_patch = plt.matplotlib.patches.Rectangle((0,0), 1.5, 40, alpha=0.25, color='#41a700')
middle_patch = plt.matplotlib.patches.Rectangle((1.5,0), 7, 40, alpha=0.25, color='#0073bb')
tail_patch = plt.matplotlib.patches.Rectangle((8.5,0), 70, 40, alpha=0.25, color='#d32323')

# Add text annotations
ax.text(0.5,25,"New Market", color='#41a700', fontsize=16, rotation=90)
ax.text(7,25,"Mid Market", color='#0073bb', fontsize=16, rotation=90)
ax.text(30,25,"Saturated Market", color='#d32323', fontsize=16)

<matplotlib.text.Text at 0x1299abf90>

Let's now apply these boundaries to our data cities.

In [36]:
def assign_market_saturation(lower_bound_mean_reviews):
    saturation_level = None
    if lower_bound_mean_reviews <= 3:
        saturation_level = 'new_market'
    elif lower_bound_mean_reviews <= 9:
        saturation_level = 'mid_market'
        saturation_level = 'saturated_market'
    return saturation_level

In [37]:
city_state_data['market_saturation'] = city_state_data.lower_conf_mean_review.apply(

In [38]:
city_state_data.sort_values(by='review_count', ascending=False).head().T

state NV AZ NC AZ
city_key lasvegas phoenix scottsdale charlotte tempe
state NV AZ AZ NC AZ
review_count 957737 330460 179521 131853 95653
num_biz 17430 10632 5139 5190 2774
city_key lasvegas phoenix scottsdale charlotte tempe
mean_num_reviews_per_biz 54.9476 31.0816 34.9331 25.4052 34.482
std_err 1.37225 0.740482 1.05808 0.729656 1.38092
lower_conf_mean_review 52.258 29.6303 32.8592 23.9751 31.7754
market_saturation saturated_market saturated_market saturated_market saturated_market saturated_market

Cool, now that we know per city the market saturation we can join it onto our per business data.

In [39]:
state_city_saturation_df = city_state_data[['state', 'city_key', 'market_saturation']]

In [40]:
biz_data['city_key'] =

In [41]:
# Merge on market saturation data
biz_data = biz_data.merge(state_city_saturation_df, how='inner', on=['state', 'city_key'])

Where should we focus?

So let's examine the impact of our focus now, to gauge how we might see impact, suppose we could add 1 review to all businesses below 10 reviews or 1 review to all businesses above 10 reviews, what would we expect to see in page-views over the course of the year?

In [42]:
lookup_key = {
    ('<10', 'lower'): 'lower_bound_impact_for_first_ten_reviews',
    ('<10', 'mean'): 'mean_impact_for_first_ten_reviews',
    ('<10', 'upper'): 'upper_bound_impact_for_first_ten_reviews',
    ('10+', 'lower'): 'lower_bound_impact_for_higher_reviews',
    ('10+', 'mean'): 'mean_impact_for_higher_reviews',
    ('10+', 'upper'): 'upper_bound_impact_for_higher_reviews'

In [43]:
def yearly_impact_of_one_more_review(biz_row, impact='mean'):
    impact_measures = market_data[market_data.market_saturation == biz_row.market_saturation]
    if biz_row.review_count >= 10:
        lookup_review_count = '10+'
        lookup_review_count = '<10'
    return impact_measures[lookup_key[(lookup_review_count, impact)]].values[0]

In [45]:
biz_data['mean_added_yearly_pageviews'] = biz_data.apply(
    lambda x: yearly_impact_of_one_more_review(x), 
biz_data['lower_added_yearly_pageviews'] = biz_data.apply(
    lambda x: yearly_impact_of_one_more_review(x, 'lower'), 
biz_data['upper_added_yearly_pageviews'] = biz_data.apply(
    lambda x: yearly_impact_of_one_more_review(x, 'upper'), 

In [46]:
biz_data['review_bucket'] = biz_data.review_count.apply(lambda x: '10+' if x >= 10 else '<10')

In [47]:
results = biz_data.groupby(['review_bucket']).agg(
        'mean_added_yearly_pageviews': 'sum',
        'lower_added_yearly_pageviews': 'sum',
        'upper_added_yearly_pageviews': 'sum'


upper_added_yearly_pageviews lower_added_yearly_pageviews mean_added_yearly_pageviews
10+ 1.806990e+06 1.384398e+06 1.693171e+06
<10 2.766950e+06 1.511487e+06 2.139218e+06

In [48]:
ax = results.T.plot(kind='box')
ax.set_ylabel('Expected Pageviews')
ax.set_ylim([0, 3000000])
ax.set_xlabel('Review segement to target')

<matplotlib.text.Text at 0x12cdfea50>

So at first glance it looks like we're better off focusing on less reviewed businesses, but is there any subtlties here?

In [51]:
by_market_saturation_results = biz_data.groupby(['market_saturation', 'review_bucket']).agg(
        'mean_added_yearly_pageviews': 'sum',
         'lower_added_yearly_pageviews': 'sum',
        'upper_added_yearly_pageviews': 'sum'

In [52]:

mean_added_yearly_pageviews lower_added_yearly_pageviews upper_added_yearly_pageviews
market_saturation review_bucket
mid_market 10+ 5.662461e+04 3.734951e+04 6.580472e+04
<10 2.361840e+05 1.436319e+05 3.287362e+05
new_market 10+ 7.940812e+02 6.013849e+02 9.867774e+02
<10 7.092294e+03 3.035126e+03 1.114946e+04
saturated_market 10+ 1.635753e+06 1.346447e+06 1.740198e+06
<10 1.895942e+06 1.364820e+06 2.427064e+06

In [53]:
ax = by_market_saturation_results[

(array([1, 2, 3, 4, 5, 6]), <a list of 6 Text xticklabel objects>)

We can now get a rough estimate of the trade-offs we'd be willing to make in SEO. If we're told it's much harder to get early reviews we'd know that may not be worth the ROI and by what factor in mid, new, and saturated markets.