In [1]:
import requests
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
Pandas is a Python package for data analysis. Documentation and examples: http://pandas.pydata.org/
To learn how Pandas works, we'll make use of a dataset containing long-run averages of inflation, money growth, and real GDP. The dataset is available here: http://www.briancjenkins.com/data/quantitytheory/csv/qtyTheoryData.csv. Recall that the quantity theory of money implies the following linear relationship between the long-run rate of money growth, the long-run rate of inflation, and the long-run rate of real GDP growth in a country:
\begin{align} \text{inflation} & = \text{money growth} - \text{real GDP growth}, \end{align}Generally, we treat real GDP growth and money supply growth as exogenous so this is a theory about the determination of inflation.
Now, we could download the data manually, but we might as well use Python to do it. The requests module is good for this.
In [2]:
# Use the requests module to download money growth and inflation data
url = 'http://www.briancjenkins.com/data/quantitytheory/csv/qtyTheoryData.csv'
r = requests.get(url,verify=True)
with open('qtyTheoryData.csv','wb') as newFile:
newFile.write(r.content)
In [3]:
import pandas as pd
In [4]:
# Import quantity theory data into a Pandas DataFrame called df with country names as the index.
df = pd.read_csv('qtyTheoryData.csv',index_col=0)
In [5]:
# Print the first 5 rows
print(df.head(5))
In [6]:
# Print the last 5 rows
print(df.tail())
In [7]:
# Print the type of df
print(type(df))
In [8]:
# Print the columns of df
print(df.columns)
In [9]:
# Create a new variable called money equal to the 'money growth' column and print
money = df['money growth']
print(money)
In [10]:
# Print the type of the variable money
print(type(money))
In [11]:
# Print the first 5 rows of just the inflation, money growth, and gdp growth columns
print(df[['inflation','money growth','gdp growth']].head())
The set of row coordinates is the index. Index values can be strings, numbers, or dates.
In [12]:
# Print the index of df
print(df.index)
In [13]:
# Create a new variable called usa equal to the 'United States' row and print
usa = df.loc['United States']
print(usa)
In [14]:
# Print the inflation rate of the United States
print(df.loc['United States']['inflation'])
In [15]:
# Print the inflation rate of the United States in a different way
print(df['inflation'].loc['United States'])
In [16]:
# Create a new variable called first equal to the first row in the DataFrame and print
first = df.iloc[0]
print(first)
Create new columns by name.
In [17]:
# Create a new column called 'difference' equal to the money growth column minus the inflation column and print the column
df['difference'] = df['money growth'] - df['inflation']
print(df['difference'])
In [18]:
# Print the summary statistics for df
print(df.describe())
While Pandas' describe function provides some good summary information, NumPy also has some useful functions for computing statistics. For example, the NumPy function corrcoef() computes the coefficient of correlation for two series.
In [19]:
# Print the correlation coefficient for inflation and money growth
print('corr of inflation and money growth',np.corrcoef(df['inflation'],df['money growth'])[0][1])
# Print the correlation coefficient for inflation and real GDP growth
print('corr of inflation and money growth',np.corrcoef(df['inflation'],df['gdp growth'])[0][1])
# Print the correlation coefficient for money growth and real GDP growth
print('corr of inflation and money growth',np.corrcoef(df['money growth'],df['gdp growth'])[0][1])
sort_values() returns a copy of the original DataFrame sorted along the given column. The optional argument ascending is set to True by default, but can be changed to False if you want to print the lowest first.
In [20]:
# Print rows for the countries with the 10 lowest inflation rates
print(df.sort_values('inflation').head(10))
# Print rows for the countries with the 10 lowest money growth rates
print(df.sort_values('money growth').head(10))
In [21]:
# Print rows for the countries with the 10 highest inflation rates
print(df.sort_values('inflation',ascending=False).head(10))
# Print rows for the countries with the 10 highest money growth rates
print(df.sort_values('money growth',ascending=False).head(10))
sort_index() returns a copy of the original DataFrame sorted along the index. The optional argument ascending is set to True by default, but can be changed to False if you want to print the lowest first.
In [22]:
# Print df with the index descending alphabetical order
print(df.sort_index(ascending=False))
In [23]:
# Construct a well-labeled scatter plot of inflation against money growth
plt.scatter(df['money growth'],df['inflation'],s=50,alpha = 0.25)
plt.grid()
plt.xlim([-0.2,1.2])
plt.ylim([-0.2,1.2])
plt.xlabel('money growth')
plt.ylabel('inflation')
plt.title('Average inflation against average money growth \nfor '+str(len(df.index))+' countries.')
Out[23]: