Introduction to Data Analysis and Plotting in Python

This tutorial is...

  • introductory and geared towards people new to Python, but familiar with STATA, MATLAB, R, etc.
  • interactive with simple exercises along the way

We'll be introducing Pandas for data handling and analysis, and using Matplotlib, Seaborn and Bokeh for visualization

Please execute sequentially

What is pandas?

Pandas provides objects for working with data - Series and DataFrame

Data structures have methods for manipulating data eg. indexing, sorting, grouping, filling in missing data

Pandas does not provide modeling tools eg. regression, prediction

  • These tools are found in packages such as scikit-learn and statsmodels, which are built on top of pandas

DataFrames

A DataFrame combines multiple 'columns' of data into a two-dimensional object, similar to a spreadsheet


In [1]:
from IPython.display import Image

Image('./dataframe.jpg')


Out[1]:

We will create a DataFrame by reading in a CSV file and assigning it to the variable name data

Info on the data set

  • The data come from 'Assessing the Rate of Replication in Economics', American Economics Review: Papers & Proceedings, 2017
  • The dataframe data contains 70 empirical papers from AER's 100th volume
  • Web of Science: number of Web of Science citations
  • Top 200: number of Top 200 economics journal citations
  • Google Scholar: number of Google Scholar citations
  • Replications: number of citations that are replications of the paper
  • Extensions: number of citations that are extensions of the paper
  • Both: number of citations that are either replications or extensions of the paper

In [2]:
import pandas as pd  # Import the package

data = pd.read_csv('data.csv')
data.head()


Out[2]:
ID Title Web of Science Top 200 Google Scholar Field Replications Extensions Both
0 2 Learning about a New Technology: Pineapple in ... 176 108.0 1246 Development 1.0 10.0 11.0
1 3 Multiple-Product Firms and Product Switching 111 74.0 764 Labor/IO 1.0 2.0 3.0
2 4 Momma's Got the Pill": How Anthony Comstock an... 23 15.0 98 Labor/IO 2.0 0.0 2.0
3 5 Matching and Sorting in Online Dating 75 23.0 326 Labor/IO 0.0 5.0 5.0
4 6 Entry, Exit, and Investment-Specific Technical... 9 9.0 38 Macro/Int/Trade 0.0 0.0 0.0

In [3]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 9 columns):
ID                70 non-null int64
Title             70 non-null object
Web of Science    70 non-null int64
Top 200           70 non-null float64
Google Scholar    70 non-null int64
Field             70 non-null object
Replications      70 non-null float64
Extensions        70 non-null float64
Both              70 non-null float64
dtypes: float64(4), int64(3), object(2)
memory usage: 5.0+ KB

We can access individual columns of data, returning a Series


In [4]:
data['Top 200'].head()


Out[4]:
0    108.0
1     74.0
2     15.0
3     23.0
4      9.0
Name: Top 200, dtype: float64

We can also select multiple columns, returning a new dataframe


In [5]:
data[['Title', 'Top 200']].head()


Out[5]:
Title Top 200
0 Learning about a New Technology: Pineapple in ... 108.0
1 Multiple-Product Firms and Product Switching 74.0
2 Momma's Got the Pill": How Anthony Comstock an... 15.0
3 Matching and Sorting in Online Dating 23.0
4 Entry, Exit, and Investment-Specific Technical... 9.0

We can add a new column to our dataframe like so


In [6]:
data['Total'] = data['Web of Science'] + data['Top 200'] + data['Google Scholar']
data.head()


Out[6]:
ID Title Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
0 2 Learning about a New Technology: Pineapple in ... 176 108.0 1246 Development 1.0 10.0 11.0 1530.0
1 3 Multiple-Product Firms and Product Switching 111 74.0 764 Labor/IO 1.0 2.0 3.0 949.0
2 4 Momma's Got the Pill": How Anthony Comstock an... 23 15.0 98 Labor/IO 2.0 0.0 2.0 136.0
3 5 Matching and Sorting in Online Dating 75 23.0 326 Labor/IO 0.0 5.0 5.0 424.0
4 6 Entry, Exit, and Investment-Specific Technical... 9 9.0 38 Macro/Int/Trade 0.0 0.0 0.0 56.0

If we want to know the average number of Web of Science citations...


In [7]:
data.mean()


Out[7]:
ID                 52.485714
Web of Science     42.071429
Top 200            22.257143
Google Scholar    227.571429
Replications        0.742857
Extensions          1.728571
Both                2.471429
Total             291.900000
dtype: float64

.describe() returns useful summary statistics


In [8]:
data.describe()


Out[8]:
ID Web of Science Top 200 Google Scholar Replications Extensions Both Total
count 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000
mean 52.485714 42.071429 22.257143 227.571429 0.742857 1.728571 2.471429 291.900000
std 31.465068 43.070660 23.182131 257.258375 1.741944 2.868729 3.794433 319.403935
min 2.000000 1.000000 0.000000 7.000000 0.000000 0.000000 0.000000 8.000000
25% 23.250000 17.250000 9.000000 73.250000 0.000000 0.000000 0.000000 101.000000
50% 48.500000 28.500000 15.000000 139.000000 0.000000 0.000000 1.000000 178.000000
75% 81.750000 51.000000 23.750000 267.250000 1.000000 2.750000 3.750000 330.750000
max 104.000000 195.000000 108.000000 1246.000000 9.000000 16.000000 18.000000 1530.000000

Exercises

Find the means of Top 200 and Google Scholar citations


In [ ]:

data[['Top 200', 'Google Scholar']].mean()
Top 200            22.257143
Google Scholar    227.571429
dtype: float64

Create a table of summary statistics of Top 200 citations and Google Scholar citations


In [ ]:

data[['Top 200', 'Google Scholar']].describe()
Top 200 Google Scholar
count 70.000000 70.000000
mean 22.257143 227.571429
std 23.182131 257.258375
min 0.000000 7.000000
25% 9.000000 73.250000
50% 15.000000 139.000000
75% 23.750000 267.250000
max 108.000000 1246.000000

Create a table of summary statistics of the paper's fields (according to the variable Field)


In [ ]:

data['Field'].describe()
count                  70
unique                  5
top       Macro/Int/Trade
freq                   25
Name: Field, dtype: object

Selecting and filtering

We can use integer slicing to select rows as follows


In [9]:
data[:5]


Out[9]:
ID Title Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
0 2 Learning about a New Technology: Pineapple in ... 176 108.0 1246 Development 1.0 10.0 11.0 1530.0
1 3 Multiple-Product Firms and Product Switching 111 74.0 764 Labor/IO 1.0 2.0 3.0 949.0
2 4 Momma's Got the Pill": How Anthony Comstock an... 23 15.0 98 Labor/IO 2.0 0.0 2.0 136.0
3 5 Matching and Sorting in Online Dating 75 23.0 326 Labor/IO 0.0 5.0 5.0 424.0
4 6 Entry, Exit, and Investment-Specific Technical... 9 9.0 38 Macro/Int/Trade 0.0 0.0 0.0 56.0

We might want to find the top cited papers

First we will sort our values by a column in the dataframe


In [10]:
pd.set_option('max_colwidth', 100)  # adjust column width

data[['Title', 'Google Scholar']].sort_values(by='Google Scholar', ascending=False)[:5]


Out[10]:
Title Google Scholar
0 Learning about a New Technology: Pineapple in Ghana 1246
21 The Macroeconomic Effects of Tax Changes: Estimates Based on a New Measure of Fiscal Shocks 1245
33 What Causes Industry Agglomeration? Evidence from Coagglomeration Patterns 766
1 Multiple-Product Firms and Product Switching 764
15 Social Preferences, Beliefs, and the Dynamics of Free Riding in Public Goods Experiments 736

Another way to select rows is to use row labels, ie. set a row index

Similar to the column labels, we can add row labels (the index)


In [11]:
data.set_index('Title').head()


Out[11]:
ID Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
Title
Learning about a New Technology: Pineapple in Ghana 2 176 108.0 1246 Development 1.0 10.0 11.0 1530.0
Multiple-Product Firms and Product Switching 3 111 74.0 764 Labor/IO 1.0 2.0 3.0 949.0
Momma's Got the Pill": How Anthony Comstock and Griswold v. Connecticut Shaped US Childbearing 4 23 15.0 98 Labor/IO 2.0 0.0 2.0 136.0
Matching and Sorting in Online Dating 5 75 23.0 326 Labor/IO 0.0 5.0 5.0 424.0
Entry, Exit, and Investment-Specific Technical Change 6 9 9.0 38 Macro/Int/Trade 0.0 0.0 0.0 56.0

Note: we haven't actually changed data


In [12]:
data.head()


Out[12]:
ID Title Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
0 2 Learning about a New Technology: Pineapple in Ghana 176 108.0 1246 Development 1.0 10.0 11.0 1530.0
1 3 Multiple-Product Firms and Product Switching 111 74.0 764 Labor/IO 1.0 2.0 3.0 949.0
2 4 Momma's Got the Pill": How Anthony Comstock and Griswold v. Connecticut Shaped US Childbearing 23 15.0 98 Labor/IO 2.0 0.0 2.0 136.0
3 5 Matching and Sorting in Online Dating 75 23.0 326 Labor/IO 0.0 5.0 5.0 424.0
4 6 Entry, Exit, and Investment-Specific Technical Change 9 9.0 38 Macro/Int/Trade 0.0 0.0 0.0 56.0

We need to reassign the variable name data to the new copy


In [13]:
# data.reset_index(inplace=True)   # Uncomment this if you receive an error

data = data.set_index('Title')   # Can also use data.set_index('Title', inplace=True)
data.head()


Out[13]:
ID Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
Title
Learning about a New Technology: Pineapple in Ghana 2 176 108.0 1246 Development 1.0 10.0 11.0 1530.0
Multiple-Product Firms and Product Switching 3 111 74.0 764 Labor/IO 1.0 2.0 3.0 949.0
Momma's Got the Pill": How Anthony Comstock and Griswold v. Connecticut Shaped US Childbearing 4 23 15.0 98 Labor/IO 2.0 0.0 2.0 136.0
Matching and Sorting in Online Dating 5 75 23.0 326 Labor/IO 0.0 5.0 5.0 424.0
Entry, Exit, and Investment-Specific Technical Change 6 9 9.0 38 Macro/Int/Trade 0.0 0.0 0.0 56.0

In [14]:
data.loc['Matching and Sorting in Online Dating']


Out[14]:
ID                       5
Web of Science          75
Top 200                 23
Google Scholar         326
Field             Labor/IO
Replications             0
Extensions               5
Both                     5
Total                  424
Name: Matching and Sorting in Online Dating, dtype: object

In [15]:
data.loc[['Matching and Sorting in Online Dating', 
          'Learning about a New Technology: Pineapple in Ghana']]


Out[15]:
ID Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
Title
Matching and Sorting in Online Dating 5 75 23.0 326 Labor/IO 0.0 5.0 5.0 424.0
Learning about a New Technology: Pineapple in Ghana 2 176 108.0 1246 Development 1.0 10.0 11.0 1530.0

Alternatively, we can filter our dataframe (select rows) using boolean conditions


In [16]:
data.reset_index(inplace=True)

data['Field'] == 'Development'


Out[16]:
0      True
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
40    False
41    False
42     True
43    False
44    False
45     True
46    False
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55     True
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64     True
65    False
66    False
67    False
68    False
69    False
Name: Field, Length: 70, dtype: bool

Selecting rows with this boolean condition will return only rows of the dataframe where Field == 'Development' is True


In [17]:
data[data['Field'] == 'Development']


Out[17]:
Title ID Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
0 Learning about a New Technology: Pineapple in Ghana 2 176 108.0 1246 Development 1.0 10.0 11.0 1530.0
7 Multinationals and Anti-Sweatshop Activism 9 21 8.0 134 Development 0.0 2.0 2.0 163.0
42 Determinants of Redistributive Politics: An Empirical Analysis of Land Reforms in West Bengal, I... 64 12 9.0 63 Development 0.0 1.0 1.0 84.0
45 Watta Satta: Bride Exchange and Women's Welfare in Rural Pakistan 73 6 4.0 37 Development 0.0 0.0 0.0 47.0
55 Inherited Trust and Growth 85 102 44.0 536 Development 3.0 3.0 6.0 682.0
64 Can Higher Prices Stimulate Product Use? Evidence from a Field Experiment in Zambia 97 44 21.0 265 Development 0.0 0.0 0.0 330.0

In [18]:
data[(data['Field'] == 'Development') & (data['Web of Science'] > 100)]


Out[18]:
Title ID Web of Science Top 200 Google Scholar Field Replications Extensions Both Total
0 Learning about a New Technology: Pineapple in Ghana 2 176 108.0 1246 Development 1.0 10.0 11.0 1530.0
55 Inherited Trust and Growth 85 102 44.0 536 Development 3.0 3.0 6.0 682.0

Exercises

How many papers in the sample have over 100 Google Scholar citations?

  • Hint: use len() to find the length of a dataframe

In [ ]:

len(data[data['Google Scholar'] > 100])
44

How many 'Labor/IO' papers have over 100 Google Scholar citations?


In [ ]:

len(data[(data['Google Scholar'] > 100) & (data['Field'] == 'Labor/IO')])
10

Grouping and aggregating data

We might want to summarize our data by grouping it by fields

To do this, we will use the .groupby() function


In [19]:
data.reset_index(inplace=True)

grouped = data.groupby('Field')
grouped


Out[19]:
<pandas.core.groupby.DataFrameGroupBy object at 0x114a2b588>

To return an aggregated dataframe, we need to specify the function we would like pandas to use to aggregate our groups


In [20]:
grouped.mean()


Out[20]:
index ID Web of Science Top 200 Google Scholar Replications Extensions Both Total
Field
Applied (gen.) 32.428571 49.285714 33.428571 13.142857 199.714286 0.857143 0.571429 1.428571 246.285714
Behavioral/Exp. 42.631579 65.000000 48.842105 24.000000 202.789474 1.631579 3.421053 5.052632 275.631579
Development 35.500000 55.000000 60.166667 32.333333 380.166667 0.666667 2.666667 3.333333 472.666667
Labor/IO 23.846154 35.923077 46.384615 26.461538 254.384615 0.615385 0.769231 1.384615 327.230769
Macro/Int/Trade 34.200000 51.880000 32.760000 18.880000 203.640000 0.120000 1.040000 1.160000 255.280000

In [21]:
grouped['Both'].mean()


Out[21]:
Field
Applied (gen.)     1.428571
Behavioral/Exp.    5.052632
Development        3.333333
Labor/IO           1.384615
Macro/Int/Trade    1.160000
Name: Both, dtype: float64

In [22]:
grouped['Both'].agg(['mean', 'median', 'count'])


Out[22]:
mean median count
Field
Applied (gen.) 1.428571 1.0 7
Behavioral/Exp. 5.052632 4.0 19
Development 3.333333 1.5 6
Labor/IO 1.384615 1.0 13
Macro/Int/Trade 1.160000 0.0 25

A list of built-in aggregatation functions can be found here

Plotting

Pandas provides a built-in plotting command to create matplotlib plots from your dataframes

To display our plots, we need to import first matplotlib


In [23]:
import matplotlib.pyplot as plt

grouped['Google Scholar'].mean().plot(kind='bar')
plt.show()



In [24]:
data.plot(x='Google Scholar', y='Both', kind='scatter', alpha=0.6)
plt.ylabel('Total replications and extensions')
plt.xlabel('Number of citations')
plt.grid()
plt.show()


Exercise

Plot the average number of replications and extensions for each field, sorted from highest to lowest


In [ ]:

grouped['Both'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Average number of replications/extensions by field')
plt.show()

Another plotting library: seaborn

  • seaborn is a plotting library built on top of matplotlib

  • It is geared towards producing pretty plots for statistical applications

  • You can find an example gallery of seaborn plots here


In [26]:
import seaborn as sns  # Import the package

In [27]:
sns.jointplot(x='Google Scholar', y='Both', data=data)
plt.show()



In [28]:
sns.kdeplot(data=data[data['Replications'] > 0].loc[:, 'Web of Science'], cumulative=True, bw=2)
sns.kdeplot(data=data[data['Replications'] == 0].loc[:, 'Web of Science'], cumulative=True, linestyle='--', bw=2)
plt.xlim(0, 200)
plt.xlabel('Web of Science Citations')
plt.ylabel('CDF')
plt.legend(['One or more replications', 'No replications'])
plt.show()


Bokeh example

Bokeh is a Python library that makes creating interactive plots super easy - an example gallery is here


In [30]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool
output_notebook()

# Add data source
source = ColumnDataSource(data[['Title', 'Web of Science', 'Replications', 'Extensions']])
colors = ['red', 'blue']

# Add title tooltips
hover = HoverTool(tooltips=[
    ("Title", "@Title"),
    ("(x,y)", "($x, $y)")
])

# Create figure
p = figure(tools=[hover])

# Plot scatter
for to_plot, color in zip(['Replications', 'Extensions'], colors):
    p.circle(x='Web of Science', y=to_plot, source=source, size=7, alpha=0.5, legend=[to_plot.title()], color=color)

p.legend.click_policy = 'hide'
p.xaxis.axis_label = 'Number of citations'
p.yaxis.axis_label = 'Number of replications or extensions'

show(p)


Loading BokehJS ...

Other resources

Here are some free, online resources for learning pandas/matplotlib: