This notebook will introduce you to working with data in Python
. You will use packages like Numpy
to manipulate, work and do computations with arrays, matrices, and such, and anipulate data (see my Introduction to Python). But given the needs of economists (and other scientists) it will be advantageous for us to use pandas. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for Python. pandas allows you to import and process data in many useful ways. It interacts greatly with other packages that complement it making it a very powerful tool for data analysis.
With pandas you can
and much more. Let's start by importing pandas
and use to it download some data and create some of the figures from the lecture notes. Note that when importing pandas
it is accustomed to assign it the alias pd
. I suggest you follow this conventiuon, which will make using other peoples code and snippets easier.
In [1]:
# Let's import pandas and some other basic packages we will use
from __future__ import division
%pylab --no-import-all
%matplotlib inline
import pandas as pd
import numpy as np
The basic structures in pandas
are pd.Series
and pd.DataFrame
. You can think of a pd.Series
as a labeled vector that contains data and has a large set of functions that can be easily performed on it. A pd.DataFrame
is similar a table/matrix of multidimensional data where each column contains a pd.Series
. I know...this may not explain much, so let's start with some actual examples. Let's create two series, one containing some country names and another containing some ficticious data.
In [2]:
countries = pd.Series(['Colombia', 'Turkey', 'USA', 'Germany', 'Chile'], name='country')
print(countries)
print('\n', 'There are ', countries.shape[0], 'countries in this series.')
Notice that we have assinged a name to the series that is different than the name of the variable containing the series. Our print(countries)
statement is showing the series and its contents, its name and the dype of data it contains. Here our series is only composed of strings so it assigns it the object dtype (not important for now, but we will use this later to convert data between types, e.g. strings to integers or floats or the other way around).
Let's create the data using some of the functions we already learned.
In [3]:
np.random.seed(123456)
data = pd.Series(np.random.normal(size=(countries.shape)), name='noise')
print(data)
print('\n', 'The average in this sample is ', data.mean())
Here we have used the mean()
function of the series to compute its mean. There are many other properties/functions for these series including std()
, shape
, count()
, max()
, min()
, etc. You can access these by writing series.name_of_function_or_property
. To see what functions are available you can hit tab
after writing series.
.
Let's create a pd.DataFrame
using these two series.
In [4]:
df = pd.DataFrame([countries, data])
df
Out[4]:
Not exactly what we'd like, but don't worry, we can just transpose it so it has each country with its data in a row.
In [5]:
df = df.T
df
Out[5]:
Now let us add some more data to this dataframe. This is done easily by defining a new columns. Let's create the square of noise
, create the sum of noise
and its square, and get the length of the country's name.
In [6]:
df['noise_sq'] = df.noise**2
df['noise and its square'] = df.noise + df.noise_sq
df['name length'] = df.country.apply(len)
df
Out[6]:
This shows some of the ways in which you can create new data. Especially useful is the apply
method, which applies a function to the series. You can also apply a function to the whole dataframe, which is useful if you want to perform computations using various columns.
Let's see some other ways in which we can interact with dataframes. First, let's select some observations, e.g., all countries in the South America.
In [7]:
# Let's create a list of South American countries
south_america = ['Colombia', 'Chile']
# Select the rows for South American countries
df.loc[df.country.apply(lambda x: x in south_america)]
Out[7]:
Now let's use this to create a dummy indicating whether a country belongs to South America. To understand what is going on let's show the result of the condition for selecting rows.
In [8]:
df.country.apply(lambda x: x in south_america)
Out[8]:
So in the previous selection of rows we told pandas
which rows we wanted or not to be included by passing a series
of booleans (True
, False
). We can use this result to create the dummy, we only need to convert the output to int
.
In [9]:
df['South America'] = df.country.apply(lambda x: x in south_america).astype(int)
Now, let's plot the various series in the dataframe
In [10]:
df.plot()
Out[10]:
Not too nice nor useful. Notice that it assigned the row number to the x-axis labels. Let's change the row labels, which are contained in the dataframe's index
by assigning the country names as the index
.
In [11]:
df = df.set_index('country')
print(df)
df.plot()
Out[11]:
Better, but still not very informative. Below we will improve on this when we work with some real data.
Notice that by using the set_index
function we have assigned the index to the country names. This may be useful to select data. E.g., if we want to see only the row for Colombia
we can
In [12]:
df.loc['Colombia']
Out[12]:
One of the nice features of pandas and its ecology is that it makes obtaining data very easy. In order to exemplify this and also to revisit some of the basic facts of comparative development, let's download some data from various sources. This may require you to create accounts in order to access and download the data (sometimes the process is very simple and does not require an actual project...in other cases you need to propose a project and be approved...usually due to privacy concerns with micro-data). Don't be afraid, all these sources are free and are used a lot in research, so it is good that you learn to use them. Let's start with a list of useful sources.
To study the divergence across countries let's download and plot the historical GDP and population data. In order to keep the data and not having to download it everytime from scratch, we'll create a folder ./data
in the currect directory and save each file there. Also, we'll make sure that if the data does not exist, we download it. We'll use the os
package to create directories.
In [13]:
import os
pathout = './data/'
if not os.path.exists(pathout):
os.mkdir(pathout)
pathgraphs = './graphs/'
if not os.path.exists(pathgraphs):
os.mkdir(pathgraphs)
In [14]:
try:
maddison_new = pd.read_stata(pathout + 'Maddison2018.dta')
maddison_new_region = pd.read_stata(pathout + 'Maddison2018_region.dta')
maddison_new_1990 = pd.read_stata(pathout + 'Maddison2018_1990.dta')
except:
maddison_new = pd.read_stata('https://www.rug.nl/ggdc/historicaldevelopment/maddison/data/mpd2018.dta')
maddison_new.to_stata(pathout + 'Maddison2018.dta', write_index=False, version=117)
maddison_new_region = pd.read_stata('https://www.rug.nl/ggdc/historicaldevelopment/maddison/data/mpd2018_region_data.dta')
maddison_new_region.to_stata(pathout + 'Maddison2018_region.dta', write_index=False, version=117)
maddison_new_1990 = pd.read_stata('https://www.rug.nl/ggdc/historicaldevelopment/maddison/data/mpd2018_1990bm.dta')
maddison_new_1990.to_stata(pathout + 'Maddison2018_1990.dta', write_index=False, version=117)
In [15]:
maddison_new
Out[15]:
This dataset is in long format. Also, notice that the year is not an integer. Let's correct this
In [16]:
maddison_new['year'] = maddison_new.year.astype(int)
maddison_new
Out[16]:
In [17]:
if not os.path.exists(pathout + 'Maddison_original.xls'):
import urllib
dataurl = "http://www.ggdc.net/maddison/Historical_Statistics/horizontal-file_02-2010.xls"
urllib.request.urlretrieve(dataurl, pathout + 'Maddison_original.xls')
This dataset is not very nicely structured for importing, as you can see if you open it in Excel. I suggest you do so, so that you can better see what is going on. Notice that the first two rows really have no data. Also, every second column is empty. Moreover, there are a few empty rows. Let's import the data and clean it so we can plot and analyse it better.
In [18]:
maddison_old_pop = pd.read_excel(pathout + 'Maddison_original.xls', sheet_name="Population", skiprows=2)
maddison_old_pop
Out[18]:
In [19]:
maddison_old_gdppc = pd.read_excel(pathout + 'Maddison_original.xls', sheet_name="PerCapita GDP", skiprows=2)
maddison_old_gdppc
Out[19]:
Let's start by renaming the first column, which has the region/country names
In [20]:
maddison_old_pop.rename(columns={'Unnamed: 0':'Country'}, inplace=True)
maddison_old_gdppc.rename(columns={'Unnamed: 0':'Country'}, inplace=True)
Now let's drop all the columns that do not have data
In [21]:
maddison_old_pop = maddison_old_pop[[col for col in maddison_old_pop.columns if str(col).startswith('Unnamed')==False]]
maddison_old_gdppc = maddison_old_gdppc[[col for col in maddison_old_gdppc.columns if str(col).startswith('Unnamed')==False]]
Now, let's change the name of the columns so they reflect the underlying variable
In [22]:
maddison_old_pop.columns = ['Country'] + ['pop_'+str(col) for col in maddison_old_pop.columns[1:]]
maddison_old_gdppc.columns = ['Country'] + ['gdppc_'+str(col) for col in maddison_old_gdppc.columns[1:]]
In [23]:
maddison_old_pop
Out[23]:
In [24]:
maddison_old_gdppc
Out[24]:
Let's choose the rows that hold the aggregates by region for the main regions of the world.
In [25]:
gdppc = maddison_old_gdppc.loc[maddison_old_gdppc.Country.apply(lambda x: str(x).upper().find('TOTAL')!=-1)].reset_index(drop=True)
gdppc = gdppc.dropna(subset=['gdppc_1'])
gdppc = gdppc.loc[2:]
gdppc['Country'] = gdppc.Country.str.replace('Total', '').str.replace('Countries', '').str.replace('\d+', '').str.replace('European', 'Europe').str.strip()
gdppc = gdppc.loc[gdppc.Country.apply(lambda x: x.find('USSR')==-1 and x.find('West Asian')==-1)].reset_index(drop=True)
gdppc
Out[25]:
Let's drop missing values
In [26]:
gdppc = gdppc.dropna(axis=1, how='any')
gdppc
Out[26]:
Let's convert from wide to long format
In [27]:
gdppc = pd.wide_to_long(gdppc, ['gdppc_'], i='Country', j='year').reset_index()
gdppc
Out[27]:
We can now plot the data. Let's try two different ways. The first uses the plot
function from pandas
. The second uses the package seaborn
, which improves on the capabilities of matplotlib
. The main difference is how the data needs to be organized. Of course, these are not the only ways to plot and we can try others.
In [28]:
import matplotlib as mpl
import seaborn as sns
# Setup seaborn
sns.set()
Let's pivot the table so that each region is a column and each row is a year. This will allow us to plot using the plot
function of the pandas DataFrame
.
In [29]:
gdppc2 = gdppc.pivot_table(index='year',columns='Country',values='gdppc_',aggfunc='sum')
gdppc2
Out[29]:
Ok. Let's plot using the pandas
plot
function.
In [30]:
# Select some colors
mycolors = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
# Use seaborn to setup a color map to be used by matplotlib
my_cmap = mpl.colors.ListedColormap(sns.color_palette(mycolors).as_hex())
# Set the size of the figure and get a figure and axis object
fig, ax = plt.subplots(figsize=(30,20))
# Plot using the axis ax and colormap my_cmap
gdppc2.loc[1800:].plot(ax=ax, linewidth=8, cmap=my_cmap)
# Change options of axes, legend
ax.tick_params(axis = 'both', which = 'major', labelsize=32)
ax.tick_params(axis = 'both', which = 'minor', labelsize=16)
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax.legend(prop={'size': 40}).set_title("Region", prop = {'size':40})
# Label axes
ax.set_xlabel('Year', fontsize=36)
ax.set_ylabel('GDP per capita (1990 Int\'l US$)', fontsize=36)
Out[30]:
In [31]:
fig
Out[31]:
Now, let's use seaborn
In [32]:
gdppc['Region'] = gdppc.Country.astype('category')
gdppc['gdppc_'] = gdppc.gdppc_.astype(float)
# Plot
fig, ax = plt.subplots(figsize=(30,20))
sns.lineplot(x='year', y='gdppc_', hue='Region', data=gdppc.loc[gdppc.year>=1800].reset_index(drop=True), alpha=1, lw=8, palette=sns.color_palette(mycolors), style='Region', dashes=False, markers=False)
ax.tick_params(axis = 'both', which = 'major', labelsize=32)
ax.tick_params(axis = 'both', which = 'minor', labelsize=16)
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax.legend(title='', prop={'size': 40})
ax.set_xlabel('Year', fontsize=36)
ax.set_ylabel('GDP per capita (1990 Int\'l US$)', fontsize=36)
Out[32]:
In [33]:
fig
Out[33]: