Data Bootcamp: Exam practice & review

We review the material we've covered to date: Python fundamentals, data input with Pandas, and graphics with Matplotlib. Questions marked Bonus are more difficult and are there to give the experts something to do.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

This version was modified by (add your name in bold here). And add your initials to the notebook's name at the top.

Preliminaries

Import packages and check the date.


In [1]:
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 

# IPython command, puts plots in notebook 
%matplotlib inline

# check Python version 
import datetime as dt 
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')


Today is 2016-03-20
What version of Python are we running? 
3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]

IPython review

We review some of the basics of IPython. You won't be asked about IPython on the exam, but since the exam is an IPython notebook, it's essential for you to be able to work with one.

Question 1.

  1. How do you set/choose the current cell?
  2. How do you edit the current cell?
  3. How do you add a new cell below the current cell?
  4. How do you specify the current cell as code or text?
  5. How do you delete the current cell?
  6. How do you move the current cell up or down?
  7. How do you run the current cell?
  8. Add your name in bold to the bottom of the first cell in this notebook. Bonus: Add a link to your LinkedIn or Facebook page.
  9. How do you save the contents of your notebook?

Answers. Enter your answers below:

Python fundamentals

Question 2. Describe the type and content of these expressions:

  1. x = 2
  2. y = 3.0
  3. z = "3.0"
  4. x/y
  5. letters = 'abcd'
  6. letters[-1]
  7. xyz = [x, y, z]
  8. xyz[1]
  9. abcd = list(letters)
  10. abcd[-2]
  11. case = {'a': 'A', 'b': 'B', 'c': 'C'}
  12. case['c']
  13. 2 >= 1
  14. x == 2

Answers. Enter your answers below:


In [2]:
# code cell for experimenting

Question 3. These get progressively more difficult:

  1. What type is dollars = '$1,234.5'?
  2. Find and apply a method that eliminates the dollar sign from dollars.
  3. Find and apply a method that eliminates the comma from dollars.
  4. Eliminate both the dollar sign and comma from dollars and covert the result to a float.
  5. Combine the last three steps in one line.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [ ]:

Question 4.

For this problem we set letters = 'abcd' as in problem 2.

  1. Find and apply a method that converts the lower case letter 'a' to the upper case letter 'A'.
  2. Write a loop that goes through the elements of letters and prints their upper case versions.
  3. Bonus: Write a loop that goes through the elements of letters. On each interation, print a string consisting of the upper and lower case versions together; eg, 'Aa'.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [ ]:

Question 5.

For this problem xyz is the same as defined in problem 2

  1. Write a loop that goes through the elements of xyz and prints them.
  2. Modify the loop to print both the elements of xyz and their type.
  3. Modify the loop to print only those elements that are not strings.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [ ]:

Data input with Pandas

We explore the public indebtedness of Argentina (country code ARG), Germany (DEU), and Greece (GRC). For each one, we provide the ratio of government debt to GDP for every second year starting in 2002. The data come from the IMF's World Economic Outlook.

Question 6. Write code in the cell below that reads the csv file we posted at

http://pages.stern.nyu.edu/~dbackus/Data/debt.csv

Assign the contents of the file to the object debt.

The rest of the questions in this notebook will refer to the object debt you create below.


In [ ]:


In [3]:
# if that failed, you can generate the same data with   
data = {'ARG': [137.5, 106.0, 61.8, 47.0, 39.1, 37.3, 48.6], 
        'DEU': [59.2, 64.6, 66.3, 64.9, 80.3, 79.0, 73.1],   
        'GRC': [98.1, 94.9, 102.9, 108.8, 145.7, 156.5, 177.2],
        'Year': [2002, 2004, 2006, 2008, 2010, 2012, 2014]}  
debt = pd.DataFrame(data)

Question 7. Let's describe the object debt:

  1. What type of object is debt?
  2. What are its dimensions?
  3. What are its column labels? Row labels?
  4. What dtypes are the columns?

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [ ]:

Question 8. Do the following with debt:

  1. Set Year as the index.
  2. Change the column labels from country codes to country names. Do this using both a dictionary and a list.
  3. Print the result to verify your changes.

The next three get progressively more difficult:

  1. Compute the mean (average) debt for each country.
  2. Bonus: Compute the mean debt for each year.
  3. Bonus: Compute the mean debt over both countries and years.

Some simple plots:

  1. Plot each country's debt against Year using a plot method.
  2. Change the linewidth to 2.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [ ]:

Python graphics with Matplotlib

We'll continue to use the data in debt. Make sure the index is the year.

Question 9.

  1. Create figure and axis objects with plt.subplots().
  2. Graph public indebtedness over time using our debt data and the axis object we just created.
  3. Change the line width to 2.
  4. Change the colors to ['red', 'green', 'blue'].
  5. Change the lower limit on the y axis to zero.
  6. Add a title to the graph.
  7. Add a label to the y axis -- something like "Public Debt (% of GDP)".
  8. Bonus: Make the line for Argentina thicker than the others. Hint: Do this by plotting a separate line applied to the same axis object.

In each case, create a code cell that delivers the answer. Please write the question number in a comment in each cell.


In [ ]:

Optional challenging questions

Good practice, but more than you'll see on the exam.

Question 10. In the figure of the previous question:

  1. Add a title, 14-point font, right-justified.
  2. Put the legend in the lower left corner.
  3. Change the line style to dashed. (This will take some Googling, or a good guess.)
  4. Eliminate the top and right "spines," the lines that outline the figure.
  5. Save the figure as a pdf file.
  6. Change the style to 538.

In [ ]:

Question 11. We ran across this one in the OECD healthcare data. The country names had numbers appended, which served as footnotes in the original spreadsheet but looked dumb when we used them as index labels. The question is how to eliminate them. A short version of the country names is

names = ['Australia 1', 'Canada 2', 'Chile 3', 'United States 1']

Do each of these in a separate code cell:

  1. Apply the rsplit() method to us = names[-1]. What do you get?
  2. Consult the documentation for rsplit to split us into two pieces, the country name and the number 1. How would you extract just the country name?
  3. Use a loop to strip the numbers from all of the elements of names.
  4. Use a list comprehension to strip the numbers from all of the elements of names.

In [ ]:


In [ ]: