Goals of this homework: The objective of this homework is to prepare your development environment, get familiar with iPython in a Jupyter notebook, and then do some simple exercises with numpy, matplotlib, and pandas.
Submission Instructions: To submit your homework, rename this notebook as lastname_firstinitial_hw#.ipynb. For example, my homework submission would be: caverlee_j_hw0.ipynb. Submit this notebook via csnet. That's right, we're using csnet!! Your IPython notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so).
Late submission policy: For this homework (and this one, alone), we will not accept any late submissions. We need to get this baseline homework done so we can move on!
This homework, all subsequent homeworks, and your project will use iPython in a Jupyter notebook. iPython is just an interactive shell for programming in the Python language. A few years ago, the developers created a generic interactive shell called Jupyter that supports languages beyond just Python. This can be a bit confusing, and we'll add a bit to the confusion by referring to iPython and Jupter interchangeably in this course.
With respect to Python, we do not expect you to have any prior experience. We do, however, expect you to have proficiency in some programming language (so you've seen loops, conditionals, functions, etc.) and a willingness to experiment and learn on your own. Python is a fun language and you should be able to pick up the necessary portions as we move along; however, this may require you to spend some extra cycles consulting online documentation, referring to a Python book, or scouring over StackOverflow.
A few basic Python pointers:
We expect your code to be well-documented with appropriate comments. We prefer meaningful variable names and function names. We also expect your code to be compact and sensible -- no super-long lines, nor dense unintelligible lines of code.
In general, Python code is often run from standalone Python modules or files; for this class, we will almost exclusively only run Python from here within the iPython notebook.
Now that you're ready, let's take a look at this iPython notebook. You'll notice that it is composed of cells. Some cells have text (like this one), while others contain code and comments. This cell is written in Markdown a simple text-to-HTML language. You can find a cheat sheet for Markdown here: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet. You can toggle a cell between Markdown and code with the toggle button up there on the menu bar.
The cell below contains code and can be executed by hitting the Play button on the menu bar or by hitting shift + enter.
In [1]:
# this is a comment
# you can execute this cell by hitting shift+enter
# the output will appear immediately below
print 'Hello world!'
Since iPython is just an interactive shell around Python, you can define functions. For example:
In [ ]:
def cubed(x):
""" Return the cube of a
value """
return x ** 3
cubed(3)
You can even access the filesystem with commands like ls or pwd:
In [2]:
pwd
Out[2]:
In this part, we're going to get familiar with three important libraries -- Numpy, pandas, and matplotlib. You will use these libraries over and over this semester. This portion is mainly just a super quick intro to each; we expect you to go much deeper as we move forward on future homeworks.
Since Python is an interpreted language, it might not seem like the best choice for data analysis. Luckily, almost all of our data workflow stack is built on top of Numpy, a Python library that adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
In [5]:
import numpy as np
print "Make a 4 row x 2 column matrix of random numbers"
x = np.random.random((4, 2))
print x
print
print "Add 10 to every element"
x = x + 10
print x
print
print "Get the element at row 3, column 1"
print x[3, 1]
print
print "Get the first row"
print x[0, :]
print
Now, it's your turn. Find the maximum, minimum, and mean of the array. This does not require writing a loop. In the code cell below, type x.m
In [6]:
# your code here
x_max = x.max()
x_min = x.min()
x_mean = x.mean()
print 'Max:', x_max, 'Min:', x_min, 'Mean:', x_mean
That wasy easy! Now, let's see if you can generate 500 numbers from a uniform distribution between 0 and 10,000, inclusive. That is each random number could be 0, 1, 2, ..., 10,000 with equal chance. What is the maximum, minimum, and mean of these 500 random numbers? Hint: take a look at np.random
In [13]:
# your code here
rand_arr = np.random.randint(0, 10000 + 1, 500)
rand_max = rand_arr.max()
rand_min = rand_arr.min()
rand_mean = rand_arr.mean()
print 'Max:', rand_max, 'Min:', rand_min, 'Mean:', rand_mean
The most widespread Python plotting library is matplotlib. With it, you can create graphs, charts, basic maps, and other data visualizations. Later in the semester, we'll use seaborn, a library built on top of matplotlib that provides even more beautiful charts.
Below, we provide some simple x and y coordinates that are then plotted. You should update the plot to include:
In [27]:
# this line prepares IPython for working with matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
# your code here
x = [0, 1, 2, 3, 4]
plt.xlabel('X-axis')
y1 = [10, 12, 10, 10, 12]
y2 = [8, 9, 9, 11, 12]
plt.ylabel('Y-axis')
alice_line, = plt.plot(x, y1, label='Alice')
bob_line, = plt.plot(x, y2, label='Bob')
plt.legend(handles=[alice_line, bob_line])
Out[27]:
pandas is the standard Python package for dealing with tabular (or relational) data. You can think of it as a programmatic way of dealing with Excel-like data that has rows and columns.
Let's begin by loading some historical birth data that's stored in a comma-separated value (csv) format:
In [43]:
import pandas as pd
data_path='./births.csv'
# births = your code here -- should read in the csv file births.csv
births = pd.read_csv(data_path, sep=',')
# update the line above, then uncomment the line below
births.head()
Out[43]:
You see that we have data for the year, month, day, and gender. We can group data by different attributes to 'roll-up' the data:
In [44]:
grouped = births.groupby(['year', 'month', 'gender']).sum()
grouped.head()
Out[44]:
Can you find all of the births on April 1 for each year?
In [93]:
# your code here
# Get births in april
aprils = births[births['month'] == 4]
# Get births on the first of april
april_firsts = aprils[aprils['day'] == '1']
# Group by year & then sum the men and women
april_firsts.groupby(['year', 'month']).sum()
Out[93]:
We can even read csv files from the web. If you are missing any of the packages below, you can install them first, then re-start this notebook.
In [94]:
# reading from a csv file from the web
import pandas as pd
import io
import requests
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv'
r = requests.get(url)
bob_ross_data = pd.read_csv(io.StringIO(r.content.decode('utf-8')))
bob_ross_data.head()
Out[94]:
In this case, we have the Bob Ross data we talked about in class on the first day.
How many Bob Ross paintings contain a tree?
In [101]:
# your code here
bob_ross_data.dtypes
# get data where TREE == 1 and then get the size of that table
bob_ross_data[bob_ross_data['TREE'] == 1].size
Out[101]:
In [ ]: