This tutorial is...
We'll be introducing Pandas for data handling and analysis, and using Matplotlib, Seaborn and Bokeh for visualization
Please execute sequentially
Pandas provides objects for working with data - Series and DataFrame
Data structures have methods for manipulating data eg. indexing, sorting, grouping, filling in missing data
Pandas does not provide modeling tools eg. regression, prediction
scikit-learn and statsmodels, which are built on top of pandas
In [1]:
from IPython.display import Image
Image('./dataframe.jpg')
Out[1]:
We will create a DataFrame by reading in a CSV file and assigning it to the variable name data
data contains 70 empirical papers from AER's 100th volumeWeb of Science: number of Web of Science citationsTop 200: number of Top 200 economics journal citationsGoogle Scholar: number of Google Scholar citationsReplications: number of citations that are replications of the paperExtensions: number of citations that are extensions of the paperBoth: number of citations that are either replications or extensions of the paper
In [2]:
import pandas as pd # Import the package
data = pd.read_csv('data.csv')
data.head()
Out[2]:
In [3]:
data.info()
We can access individual columns of data, returning a Series
In [4]:
data['Top 200'].head()
Out[4]:
We can also select multiple columns, returning a new dataframe
In [5]:
data[['Title', 'Top 200']].head()
Out[5]:
We can add a new column to our dataframe like so
In [6]:
data['Total'] = data['Web of Science'] + data['Top 200'] + data['Google Scholar']
data.head()
Out[6]:
If we want to know the average number of Web of Science citations...
In [7]:
data.mean()
Out[7]:
.describe() returns useful summary statistics
In [8]:
data.describe()
Out[8]:
In [ ]:
data[['Top 200', 'Google Scholar']].mean()
Top 200 22.257143
Google Scholar 227.571429
dtype: float64
In [ ]:
data[['Top 200', 'Google Scholar']].describe()
| Top 200 | Google Scholar | |
|---|---|---|
| count | 70.000000 | 70.000000 |
| mean | 22.257143 | 227.571429 |
| std | 23.182131 | 257.258375 |
| min | 0.000000 | 7.000000 |
| 25% | 9.000000 | 73.250000 |
| 50% | 15.000000 | 139.000000 |
| 75% | 23.750000 | 267.250000 |
| max | 108.000000 | 1246.000000 |
In [ ]:
data['Field'].describe()
count 70
unique 5
top Macro/Int/Trade
freq 25
Name: Field, dtype: object
We can use integer slicing to select rows as follows
In [9]:
data[:5]
Out[9]:
We might want to find the top cited papers
First we will sort our values by a column in the dataframe
In [10]:
pd.set_option('max_colwidth', 100) # adjust column width
data[['Title', 'Google Scholar']].sort_values(by='Google Scholar', ascending=False)[:5]
Out[10]:
Another way to select rows is to use row labels, ie. set a row index
Similar to the column labels, we can add row labels (the index)
In [11]:
data.set_index('Title').head()
Out[11]:
Note: we haven't actually changed data
In [12]:
data.head()
Out[12]:
We need to reassign the variable name data to the new copy
In [13]:
# data.reset_index(inplace=True) # Uncomment this if you receive an error
data = data.set_index('Title') # Can also use data.set_index('Title', inplace=True)
data.head()
Out[13]:
In [14]:
data.loc['Matching and Sorting in Online Dating']
Out[14]:
In [15]:
data.loc[['Matching and Sorting in Online Dating',
'Learning about a New Technology: Pineapple in Ghana']]
Out[15]:
Alternatively, we can filter our dataframe (select rows) using boolean conditions
In [16]:
data.reset_index(inplace=True)
data['Field'] == 'Development'
Out[16]:
Selecting rows with this boolean condition will return only rows of the dataframe where Field == 'Development' is True
In [17]:
data[data['Field'] == 'Development']
Out[17]:
In [18]:
data[(data['Field'] == 'Development') & (data['Web of Science'] > 100)]
Out[18]:
In [ ]:
len(data[data['Google Scholar'] > 100])
44
In [ ]:
len(data[(data['Google Scholar'] > 100) & (data['Field'] == 'Labor/IO')])
10
We might want to summarize our data by grouping it by fields
To do this, we will use the .groupby() function
In [19]:
data.reset_index(inplace=True)
grouped = data.groupby('Field')
grouped
Out[19]:
To return an aggregated dataframe, we need to specify the function we would like pandas to use to aggregate our groups
In [20]:
grouped.mean()
Out[20]:
In [21]:
grouped['Both'].mean()
Out[21]:
In [22]:
grouped['Both'].agg(['mean', 'median', 'count'])
Out[22]:
A list of built-in aggregatation functions can be found here
Pandas provides a built-in plotting command to create matplotlib plots from your dataframes
To display our plots, we need to import first matplotlib
In [23]:
import matplotlib.pyplot as plt
grouped['Google Scholar'].mean().plot(kind='bar')
plt.show()
In [24]:
data.plot(x='Google Scholar', y='Both', kind='scatter', alpha=0.6)
plt.ylabel('Total replications and extensions')
plt.xlabel('Number of citations')
plt.grid()
plt.show()
In [ ]:
grouped['Both'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Average number of replications/extensions by field')
plt.show()
seabornseaborn is a plotting library built on top of matplotlib
It is geared towards producing pretty plots for statistical applications
You can find an example gallery of seaborn plots here
In [26]:
import seaborn as sns # Import the package
In [27]:
sns.jointplot(x='Google Scholar', y='Both', data=data)
plt.show()
In [28]:
sns.kdeplot(data=data[data['Replications'] > 0].loc[:, 'Web of Science'], cumulative=True, bw=2)
sns.kdeplot(data=data[data['Replications'] == 0].loc[:, 'Web of Science'], cumulative=True, linestyle='--', bw=2)
plt.xlim(0, 200)
plt.xlabel('Web of Science Citations')
plt.ylabel('CDF')
plt.legend(['One or more replications', 'No replications'])
plt.show()
Bokeh is a Python library that makes creating interactive plots super easy - an example gallery is here
In [30]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool
output_notebook()
# Add data source
source = ColumnDataSource(data[['Title', 'Web of Science', 'Replications', 'Extensions']])
colors = ['red', 'blue']
# Add title tooltips
hover = HoverTool(tooltips=[
("Title", "@Title"),
("(x,y)", "($x, $y)")
])
# Create figure
p = figure(tools=[hover])
# Plot scatter
for to_plot, color in zip(['Replications', 'Extensions'], colors):
p.circle(x='Web of Science', y=to_plot, source=source, size=7, alpha=0.5, legend=[to_plot.title()], color=color)
p.legend.click_policy = 'hide'
p.xaxis.axis_label = 'Number of citations'
p.yaxis.axis_label = 'Number of replications or extensions'
show(p)
Here are some free, online resources for learning pandas/matplotlib: