Another in a series of notebooks that describe Pandas' powerful data management tools. In this one we summarize our data in a variety of ways. Which is more interesting than it sounds.
Outline:
value_counts
method. Note: requires internet access to run.
This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [2]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for Pandas
%matplotlib inline
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
We use the IMF's data on government debt again, specifically its World Economic Outlook database, commonly referred to as the WEO. We focus on government debt expressed as a percentage of GDP, variable code GGXWDG_NGDP
.
The central question here is how the debt of Argentina, which defaulted in 2001, compared to other countries. Was it a matter of too much debt or something else?
In [4]:
url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/02/weodata/'
url2 = 'WEOOct2015all.xls'
url = url1 + url2
weo = pd.read_csv(url, sep='\t',
usecols=[1,2] + list(range(19,45)),
thousands=',',
na_values=['n/a', '--'])
print('Variable dtypes:\n', weo.dtypes.head(6), sep='')
In [5]:
# select debt variable
variables = ['GGXWDG_NGDP']
db = weo[weo['WEO Subject Code'].isin(variables)]
# drop variable code column (they're all the same)
db = db.drop('WEO Subject Code', axis=1)
# set index to country code
db = db.set_index('ISO')
# name columns
db.columns.name = 'Year'
# transpose
dbt = db.T
# see what we have
dbt.head()
Out[5]:
Example. Let's try a simple graph of the dataframe dbt
. The goal is to put Argentina in perspective by plotting it along with many other countries.
In [9]:
fig, ax = plt.subplots()
dbt.plot(ax=ax,
legend=False, color='blue', alpha=0.3,
ylim=(0,150)
)
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left')
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)
Out[9]:
Exercise.
In [ ]:
Exercise. Do the same graph with Greece (GRC) as the country of interest. How does it differ? Why do you think that is?
In [ ]:
Let's step back a minute. What we're trying to do is compare Argentina to other countries. What's the best way to do that? This isn't a question with an obvious best answer, but we can try some things, see how they look. One thing we could do is compare Argentina to the mean or median. Or to some other feature of the distribution.
We work up to this by looking first at some features of the distribution of government debt numbers across countries. Some of this we've seen, so is new.
In [6]:
dbt.shape
Out[6]:
In [20]:
# count non-missing values
dbt.count(axis=1).plot()
Out[20]:
In [ ]:
Let's take the data for 2001 -- the year of Argentina's default -- and see what how Argentina compares. Was its debt high compare to other countries?
which leads to more questions. How would we compare? Compare Argentina to the mean or median? Something else?
Let's see how that works.
In [10]:
# 2001 data
db01 = db['2001']
In [11]:
db01['ARG']
Out[11]:
In [12]:
db01.mean()
Out[12]:
In [13]:
db01.median()
Out[13]:
In [14]:
db01.describe()
Out[14]:
In [15]:
db01.quantile(q=[0.25, 0.5, 0.75])
Out[15]:
In [ ]:
Comment. If we add enough quantiles, we might as well plot the whole distribution. The easiest way to do this is with a histogram.
In [16]:
fig, ax = plt.subplots()
db01.hist(bins=15, ax=ax, alpha=0.35)
ax.set_xlabel('Government Debt (Percent of GDP)')
ax.set_ylabel('Number of Countries')
darg = db01['ARG']
ymin, ymax = ax.get_ylim()
ax.vlines(darg, ymin, ymax, color='blue', lw=2)
Out[16]:
In [ ]:
In [ ]:
In [ ]:
In [21]:
# here we compute the mean across countries at every date
dbt.mean(axis=1).head()
Out[21]:
In [22]:
# or we could do the median
dbt.median(axis=1).head()
Out[22]:
In [23]:
# or a bunch of stats at once
# NB: db not dbt (there's no axix argument here)
db.describe()
Out[23]:
In [24]:
# the other way
dbt.describe()
Out[24]:
In [ ]:
In [ ]:
Example. Let's add the mean to our graph. We make it a dashed line with linestyle='dashed'
.
In [25]:
fig, ax = plt.subplots()
dbt.plot(ax=ax,
legend=False, color='blue', alpha=0.3,
ylim=(0,150)
)
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left')
dbt.mean(axis=1).plot(ax=ax, color='black', linewidth=2, linestyle='dashed')
Out[25]:
Question. Do you think this looks better when the mean varies with time, or when we use a constant mean? Let's try it and see.
In [26]:
dbar = dbt.mean().mean()
dbar
Out[26]:
In [27]:
fig, ax = plt.subplots()
dbt.plot(ax=ax,
legend=False, color='blue', alpha=0.3,
ylim=(0,150)
)
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left')
xmin, xmax = ax.get_xlim()
ax.hlines(dbar, xmin, xmax, linewidth=2, linestyle='dashed')
Out[27]:
Exercise. Which do we like better?
Exercise. Replace the (constant) mean with the (constant) median? Which do you prefer?
In [ ]:
A categorical variable is one that takes on a small number of values. States take on one of fifty values. University students are either grad or undergrad. Students select majors and concentrations.
We're going to do two things with categorical data:
value_counts
method. This is a series method, we apply it to one series/variable at a time. We start with the combined MovieLens data we constructed in the previous notebook.
In [30]:
url = 'http://pages.stern.nyu.edu/~dbackus/Data/mlcombined.csv'
ml = pd.read_csv(url)
print('Dimensions:', ml.shape)
ml.head(10)
Out[30]:
In [ ]:
In [32]:
# which movies have the most ratings?
ml['title'].value_counts().head(10)
Out[32]:
In [41]:
ml['title'].value_counts().head(10).plot.barh(alpha=0.5)
Out[41]:
In [ ]:
In [42]:
# which people have rated the most movies?
ml['userId'].value_counts().head(10)
Out[42]:
In [ ]:
In [ ]:
In [ ]:
Next up: group data by some variable. As an example, how would we compute the average rating of each movie? If you think for a minute, you might think of these steps:
groupby
method. Pandas has tools that make that relatively easy.
In [43]:
# group
g = ml[['title', 'rating']].groupby('title')
type(g)
Out[43]:
In [ ]:
Now that we have a groupby object, what can we do with it?
In [50]:
# the number in each category
g.count().head(10)
Out[50]:
In [48]:
# what type of object have we created?
type(g.count())
Out[48]:
In [ ]:
Add this via Spencer.
Comment. Note that the combination of groupby
and count
created a dataframe with
Exercise. Take the code
counts = ml.groupby(['title', 'movieId')
Without running it, what is the index of counts
? What are its columns?
In [ ]:
In [ ]:
In [46]:
gm = g.mean()
gm.head(10)
Out[46]:
In [ ]:
In [ ]:
In [ ]:
In [59]:
# we can put them together
grouped = g.count()
grouped = grouped.rename(columns={'rating': 'Number'})
grouped['Mean'] = g.mean()
grouped.head(10)
Out[59]:
In [ ]:
In [63]:
grouped.plot.scatter(x='Number', y='Mean')
Out[63]:
In [ ]:
Exercise. Compute the median and add it to the dataframe.
In [ ]:
In [ ]:
The Brandon Rhodes video covers most of this, too.
In [ ]: