Data Bootcamp: Exam practice & review (answers)

We review the material we've covered to date: Python fundamentals, data input with Pandas, and graphics with Matplotlib. Questions marked Bonus are more difficult and are there to give the experts something to do.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

This version was modified by (add your name in bold here). And add your initials to the notebook's name at the top.

Preliminaries

Import packages and check the date.


In [1]:
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 

# IPython command, puts plots in notebook 
%matplotlib inline

# check Python version 
import datetime as dt 
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')


Today is 2016-03-28
What version of Python are we running? 
3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]

IPython review

We review some of the basics of IPython. You won't be asked about IPython on the exam, but since the exam is an IPython notebook, it's essential for you to be able to work with one.

Question 1.

  1. How do you set/choose the current cell?
  2. How do you edit the current cell?
  3. How do you add a new cell below the current cell?
  4. How do you specify the current cell as code or text?
  5. How do you delete the current cell?
  6. How do you move the current cell up or down?
  7. How do you run the current cell?
  8. Add your name in bold to the bottom of the first cell in this notebook. Bonus: Add a link to your LinkedIn or Facebook page.
  9. How do you save the contents of your notebook?

Answers. Enter your answers below:

  1. click on it
  2. click again
  3. click on the plus (+) at the top
  4. choose the appropriate one in the menu below Help at the top
  5. choose the cell and click on the scirrors at the top
  6. choose the cell and click on the up or down arrow at the top
  7. two ways: click on the run cell icon at the top, or shift-enter
  8. **name**
  9. two ways: File, Save and Checkpoint, or cntl-S.

Python fundamentals

Question 2. Describe the type and content of these expressions:

  1. x = 2
  2. y = 3.0
  3. z = "3.0"
  4. x/y
  5. letters = 'abcd'
  6. letters[-1]
  7. xyz = [x, y, z]
  8. xyz[1]
  9. abcd = list(letters)
  10. abcd[-2]
  11. case = {'a': 'A', 'b': 'B', 'c': 'C'}
  12. case['c']
  13. 2 >= 1
  14. x == 2

Answers. Enter your answers below:

By content and type we mean the content of the variable or expression and its type as return by the type() function.

  1. content 2, type int
  2. content 3.0, type float
  3. content '3.0', type str
  4. content 0.66666, type float
  5. content 'abcd', type str
  6. content 'd', type str
  7. content [x, y, z], type list
  8. content y=3.0, type float
  9. content ['a', 'b', 'c', 'd'], type list
  10. content 'c', type str
  11. content as stated, type dictionary or dict
  12. content 'C', type str
  13. content True, type bool
  14. content True, type bool

In [2]:
# code cell for experimenting

Question 3. These get progressively more difficult:

  1. What type is dollars = '$1,234.5'?
  2. Find and apply a method that eliminates the dollar sign from dollars.
  3. Find and apply a method that eliminates the comma from dollars.
  4. Eliminate both the dollar sign and comma from dollars and covert the result to a float.
  5. Combine the last three steps in one line.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [3]:
dollars = '$1,234.5'
type(dollars)


Out[3]:
str

In [4]:
dollars = dollars.replace('$','')
dollars


Out[4]:
'1,234.5'

In [5]:
dollars = dollars.replace(',','')
dollars


Out[5]:
'1234.5'

In [6]:
dollars = float(dollars)
dollars


Out[6]:
1234.5

In [7]:
# we can glue the pieces together 
dollars = '$1,234.5'
dollars = float(dollars.replace('$','').replace(',',''))
dollars


Out[7]:
1234.5

In [ ]:

Question 4.

For this problem we set letters = 'abcd' as in problem 2.

  1. Find and apply a method that converts the lower case letter 'a' to the upper case letter 'A'.
  2. Write a loop that goes through the elements of letters and prints their upper case versions.
  3. Bonus: Write a loop that goes through the elements of letters. On each interation, print a string consisting of the upper and lower case versions together; eg, 'Aa'.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [8]:
'a'.upper()


Out[8]:
'A'

In [9]:
letters = 'abcd'
for letter in letters:
    print(letter.upper())


A
B
C
D

In [10]:
letters = 'abcd'
for letter in letters:
    print(letter.upper()+letter)


Aa
Bb
Cc
Dd

Question 5.

For this problem xyz is the same as defined in problem 2

  1. Write a loop that goes through the elements of xyz and prints them.
  2. Modify the loop to print both the elements of xyz and their type.
  3. Modify the loop to print only those elements that are not strings.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [11]:
xyz = [2, 3.0, 2/3.0]

for item in xyz:
    print(item)


2
3.0
0.6666666666666666

In [12]:
for item in xyz:
    print(item, type(item))


2 <class 'int'>
3.0 <class 'float'>
0.6666666666666666 <class 'float'>

In [13]:
for item in xyz:
    if type(item) != str:
        print(item, type(item))


2 <class 'int'>
3.0 <class 'float'>
0.6666666666666666 <class 'float'>

In [ ]:

Data input with Pandas

We explore the public indebtedness of Argentina (country code ARG), Germany (DEU), and Greece (GRC). For each one, we provide the ratio of government debt to GDP for every second year starting in 2002. The data come from the IMF's World Economic Outlook.

Question 6. Write code in the cell below that reads the csv file we posted at

http://pages.stern.nyu.edu/~dbackus/Data/debt.csv

Assign the contents of the file to the object debt.

The rest of the questions in this notebook will refer to the object debt you create below.


In [14]:
url = 'http://pages.stern.nyu.edu/~dbackus/Data/debt.csv'
debt = pd.read_csv(url)
debt.tail(3)


Out[14]:
ARG DEU GRC Year
4 39.1 80.3 145.7 2010
5 37.3 79.0 156.5 2012
6 48.6 73.1 177.2 2014

In [15]:
# if that failed, you can generate the same data with   
data = {'ARG': [137.5, 106.0, 61.8, 47.0, 39.1, 37.3, 48.6], 
        'DEU': [59.2, 64.6, 66.3, 64.9, 80.3, 79.0, 73.1],   
        'GRC': [98.1, 94.9, 102.9, 108.8, 145.7, 156.5, 177.2],
        'Year': [2002, 2004, 2006, 2008, 2010, 2012, 2014]}  
debt = pd.DataFrame(data)

Question 7. Let's describe the object debt:

  1. What type of object is debt?
  2. What are its dimensions?
  3. What are its column labels? Row labels?
  4. What dtypes are the columns?

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [16]:
type(debt)


Out[16]:
pandas.core.frame.DataFrame

In [17]:
debt.shape


Out[17]:
(7, 4)

In [18]:
debt.columns


Out[18]:
Index(['ARG', 'DEU', 'GRC', 'Year'], dtype='object')

In [19]:
debt.index


Out[19]:
Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

In [20]:
debt.dtypes


Out[20]:
ARG     float64
DEU     float64
GRC     float64
Year      int64
dtype: object

In [ ]:


In [ ]:

Question 8. Do the following with debt:

  1. Set Year as the index.
  2. Change the column labels from country codes to country names. Do this using both a dictionary and a list.
  3. Print the result to verify your changes.

The next three get progressively more difficult:

  1. Compute the mean (average) debt for each country.
  2. Bonus: Compute the mean debt for each year.
  3. Bonus: Compute the mean debt over both countries and years.

Some simple plots:

  1. Plot each country's debt against Year using a plot method.
  2. Change the linewidth to 2.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [22]:
debt = debt.set_index('Year')

In [23]:
rn = {"ARG": "Argentina", "DEU": "Germany", "GRC": "Greece"}
debt.rename(columns=rn)


Out[23]:
Argentina Germany Greece
Year
2002 137.5 59.2 98.1
2004 106.0 64.6 94.9
2006 61.8 66.3 102.9
2008 47.0 64.9 108.8
2010 39.1 80.3 145.7
2012 37.3 79.0 156.5
2014 48.6 73.1 177.2

In [24]:
debt.columns = ['Argentina', 'Germany', 'Greece']
debt


Out[24]:
Argentina Germany Greece
Year
2002 137.5 59.2 98.1
2004 106.0 64.6 94.9
2006 61.8 66.3 102.9
2008 47.0 64.9 108.8
2010 39.1 80.3 145.7
2012 37.3 79.0 156.5
2014 48.6 73.1 177.2

In [25]:
debt.mean()


Out[25]:
Argentina     68.185714
Germany       69.628571
Greece       126.300000
dtype: float64

In [26]:
debt.mean(axis=1)


Out[26]:
Year
2002    98.266667
2004    88.500000
2006    77.000000
2008    73.566667
2010    88.366667
2012    90.933333
2014    99.633333
dtype: float64

In [27]:
debt.mean().mean()


Out[27]:
88.03809523809524

In [ ]:

Python graphics with Matplotlib

We'll continue to use the data in debt. Make sure the index is the year.

Question 9.

  1. Create figure and axis objects with plt.subplots().
  2. Graph public indebtedness over time using our debt data and the axis object we just created.
  3. Change the line width to 2.
  4. Change the colors to ['red', 'green', 'blue'].
  5. Change the lower limit on the y axis to zero.
  6. Add a title to the graph.
  7. Add a label to the y axis -- something like "Public Debt (% of GDP)".
  8. Bonus: Make the line for Argentina thicker than the others. Hint: Do this by plotting a separate line applied to the same axis object.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [28]:
fig, ax = plt.subplots()
debt.plot(ax=ax, 
         linewidth=2, 
         color=['red', 'green', 'blue'])
ax.set_ylim(0)
ax.set_title('Public debt')
ax.set_ylabel('Public Debt (% of GDP)')
debt['Argentina'].plot(ax=ax, linewidth=4, color='red')


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x8a18358>

In [ ]:

Optional challenging questions

Good practice, but more than you'll see on the exam.

Question 10. In the figure of the previous question:

  1. Add a title, 14-point font, right-justified.
  2. Put the legend in the lower left corner.
  3. Change the line style to dashed. (This will take some Googling, or a good guess.)
  4. Eliminate the top and right "spines," the lines that outline the figure. [This doesn't make sense with the 538 style, which eliminates all the spines.]
  5. Save the figure as a pdf file.
  6. Change the style to 538.

In [29]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
debt.plot(ax=ax, 
         linewidth=2, 
        linestyle='dashed',
         color=['red', 'green', 'blue'])
ax.set_title('Public debt', fontsize=14, loc='right')
ax.set_ylabel('Public Debt (% of GDP)')
ax.legend(loc='lower left')
fig.savefig('debt.pdf')



In [ ]:

Question 11. We ran across this one in the OECD healthcare data. The country names had numbers appended, which served as footnotes in the original spreadsheet but looked dumb when we used them as index labels. The question is how to eliminate them. A short version of the country names is

names = ['Australia 1', 'Canada 2', 'Chile 3', 'United States 1']

Do each of these in a separate code cell:

  1. Apply the rsplit() method to us = names[-1]. What do you get?
  2. Consult the documentation for rsplit to split us into two pieces, the country name and the number 1. How would you extract just the country name?
  3. Use a loop to strip the numbers from all of the elements of names.
  4. Use a list comprehension to strip the numbers from all of the elements of names.

Hints. rsplit means split from the right. One input is the number of splits.


In [30]:
names = ['Australia 1', 'Canada 2', 'Chile 3', 'United States 1']

In [31]:
names[-1].rsplit()


Out[31]:
['United', 'States', '1']

In [32]:
names[-1].rsplit(maxsplit=1)


Out[32]:
['United States', '1']

In [33]:
# apologies, this is harder than we thought
for n in range(len(names)):
    item = names[n]
    names[n] = item.rsplit(maxsplit=1)[0]
    
print(names)


['Australia', 'Canada', 'Chile', 'United States']

In [34]:
# this one's easier 
names = ['Australia 1', 'Canada 2', 'Chile 3', 'United States 1']
[item.rsplit(maxsplit=1)[0] for item in names]


Out[34]:
['Australia', 'Canada', 'Chile', 'United States']

In [ ]: