We review the material we've covered to date: Python fundamentals, data input with Pandas, and graphics with Matplotlib. Questions marked Bonus are more difficult and are there to give the experts something to do.
This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
This version was modified by (add your name in bold here). And add your initials to the notebook's name at the top.
In [1]:
# import packages
import pandas as pd # data management
import matplotlib.pyplot as plt # graphics
# IPython command, puts plots in notebook
%matplotlib inline
# check Python version
import datetime as dt
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')
Question 1.
Answers. Enter your answers below:
**name**
Question 2. Describe the type and content of these expressions:
x = 2
y = 3.0
z = "3.0"
x/y
letters = 'abcd'
letters[-1]
xyz = [x, y, z]
xyz[1]
abcd = list(letters)
abcd[-2]
case = {'a': 'A', 'b': 'B', 'c': 'C'}
case['c']
2 >= 1
x == 2
Answers. Enter your answers below:
By content and type we mean the content of the variable or expression and its type as return by the type() function.
In [2]:
# code cell for experimenting
Question 3. These get progressively more difficult:
dollars = '$1,234.5'
?dollars
. dollars
. dollars
and covert the result to a float.In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.
In [3]:
dollars = '$1,234.5'
type(dollars)
Out[3]:
In [4]:
dollars = dollars.replace('$','')
dollars
Out[4]:
In [5]:
dollars = dollars.replace(',','')
dollars
Out[5]:
In [6]:
dollars = float(dollars)
dollars
Out[6]:
In [7]:
# we can glue the pieces together
dollars = '$1,234.5'
dollars = float(dollars.replace('$','').replace(',',''))
dollars
Out[7]:
In [ ]:
Question 4.
For this problem we set letters = 'abcd'
as in problem 2.
'a'
to the upper case letter 'A'
. letters
and prints their upper case versions.letters
. On each interation, print a string consisting of the upper and lower case versions together; eg, 'Aa'
. In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.
In [8]:
'a'.upper()
Out[8]:
In [9]:
letters = 'abcd'
for letter in letters:
print(letter.upper())
In [10]:
letters = 'abcd'
for letter in letters:
print(letter.upper()+letter)
Question 5.
For this problem xyz
is the same as defined in problem 2
xyz
and prints them.xyz
and their type. In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.
In [11]:
xyz = [2, 3.0, 2/3.0]
for item in xyz:
print(item)
In [12]:
for item in xyz:
print(item, type(item))
In [13]:
for item in xyz:
if type(item) != str:
print(item, type(item))
In [ ]:
Question 6. Write code in the cell below that reads the csv file we posted at
http://pages.stern.nyu.edu/~dbackus/Data/debt.csv
Assign the contents of the file to the object debt
.
The rest of the questions in this notebook will refer to the object debt
you create below.
In [14]:
url = 'http://pages.stern.nyu.edu/~dbackus/Data/debt.csv'
debt = pd.read_csv(url)
debt.tail(3)
Out[14]:
In [15]:
# if that failed, you can generate the same data with
data = {'ARG': [137.5, 106.0, 61.8, 47.0, 39.1, 37.3, 48.6],
'DEU': [59.2, 64.6, 66.3, 64.9, 80.3, 79.0, 73.1],
'GRC': [98.1, 94.9, 102.9, 108.8, 145.7, 156.5, 177.2],
'Year': [2002, 2004, 2006, 2008, 2010, 2012, 2014]}
debt = pd.DataFrame(data)
Question 7. Let's describe the object debt
:
debt
?In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.
In [16]:
type(debt)
Out[16]:
In [17]:
debt.shape
Out[17]:
In [18]:
debt.columns
Out[18]:
In [19]:
debt.index
Out[19]:
In [20]:
debt.dtypes
Out[20]:
In [ ]:
In [ ]:
Question 8. Do the following with debt
:
Year
as the index. The next three get progressively more difficult:
Some simple plots:
Year
using a plot
method. In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.
In [22]:
debt = debt.set_index('Year')
In [23]:
rn = {"ARG": "Argentina", "DEU": "Germany", "GRC": "Greece"}
debt.rename(columns=rn)
Out[23]:
In [24]:
debt.columns = ['Argentina', 'Germany', 'Greece']
debt
Out[24]:
In [25]:
debt.mean()
Out[25]:
In [26]:
debt.mean(axis=1)
Out[26]:
In [27]:
debt.mean().mean()
Out[27]:
In [ ]:
Question 9.
plt.subplots()
. debt
data and the axis object we just created. ['red', 'green', 'blue']
. plot
ting a separate line applied to the same axis object. In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.
In [28]:
fig, ax = plt.subplots()
debt.plot(ax=ax,
linewidth=2,
color=['red', 'green', 'blue'])
ax.set_ylim(0)
ax.set_title('Public debt')
ax.set_ylabel('Public Debt (% of GDP)')
debt['Argentina'].plot(ax=ax, linewidth=4, color='red')
Out[28]:
In [ ]:
Question 10. In the figure of the previous question:
In [29]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
debt.plot(ax=ax,
linewidth=2,
linestyle='dashed',
color=['red', 'green', 'blue'])
ax.set_title('Public debt', fontsize=14, loc='right')
ax.set_ylabel('Public Debt (% of GDP)')
ax.legend(loc='lower left')
fig.savefig('debt.pdf')
In [ ]:
Question 11. We ran across this one in the OECD healthcare data. The country names had numbers appended, which served as footnotes in the original spreadsheet but looked dumb when we used them as index labels. The question is how to eliminate them. A short version of the country names is
names = ['Australia 1', 'Canada 2', 'Chile 3', 'United States 1']
Do each of these in a separate code cell:
rsplit()
method to us = names[-1]
. What do you get?rsplit
to split us
into two pieces, the country name and the number 1. How would you extract just the country name?names
.names
. Hints. rsplit
means split from the right. One input is the number of splits.
In [30]:
names = ['Australia 1', 'Canada 2', 'Chile 3', 'United States 1']
In [31]:
names[-1].rsplit()
Out[31]:
In [32]:
names[-1].rsplit(maxsplit=1)
Out[32]:
In [33]:
# apologies, this is harder than we thought
for n in range(len(names)):
item = names[n]
names[n] = item.rsplit(maxsplit=1)[0]
print(names)
In [34]:
# this one's easier
names = ['Australia 1', 'Canada 2', 'Chile 3', 'United States 1']
[item.rsplit(maxsplit=1)[0] for item in names]
Out[34]:
In [ ]: