CSCE 489 :: Data Science and Analytics :: Texas A&M University :: Fall 2016

Homework 0: Welcome to the Thunderdome

100 points [5% of your final grade]

Due: Wednesday, September 7 by 11:59pm [No late submissions accepted]

Goals of this homework: The objective of this homework is to prepare your development environment, get familiar with iPython in a Jupyter notebook, and then do some simple exercises with numpy, matplotlib, and pandas.

Submission Instructions: To submit your homework, rename this notebook as lastname_firstinitial_hw#.ipynb. For example, my homework submission would be: caverlee_j_hw0.ipynb. Submit this notebook via csnet. That's right, we're using csnet!! Your IPython notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so).

Late submission policy: For this homework (and this one, alone), we will not accept any late submissions. We need to get this baseline homework done so we can move on!

iPython, Python, and our Expectations

This homework, all subsequent homeworks, and your project will use iPython in a Jupyter notebook. iPython is just an interactive shell for programming in the Python language. A few years ago, the developers created a generic interactive shell called Jupyter that supports languages beyond just Python. This can be a bit confusing, and we'll add a bit to the confusion by referring to iPython and Jupter interchangeably in this course.

With respect to Python, we do not expect you to have any prior experience. We do, however, expect you to have proficiency in some programming language (so you've seen loops, conditionals, functions, etc.) and a willingness to experiment and learn on your own. Python is a fun language and you should be able to pick up the necessary portions as we move along; however, this may require you to spend some extra cycles consulting online documentation, referring to a Python book, or scouring over StackOverflow.

A few basic Python pointers:

We expect your code to be well-documented with appropriate comments. We prefer meaningful variable names and function names. We also expect your code to be compact and sensible -- no super-long lines, nor dense unintelligible lines of code.

In general, Python code is often run from standalone Python modules or files; for this class, we will almost exclusively only run Python from here within the iPython notebook.

Now that you're ready, let's take a look at this iPython notebook. You'll notice that it is composed of cells. Some cells have text (like this one), while others contain code and comments. This cell is written in Markdown a simple text-to-HTML language. You can find a cheat sheet for Markdown here: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet. You can toggle a cell between Markdown and code with the toggle button up there on the menu bar.

The cell below contains code and can be executed by hitting the Play button on the menu bar or by hitting shift + enter.


In [1]:
# this is a comment
# you can execute this cell by hitting shift+enter
# the output will appear immediately below
print 'Hello world!'


Hello world!

Since iPython is just an interactive shell around Python, you can define functions. For example:


In [ ]:
def cubed(x):
    """ Return the cube of a  
        value """
    return x ** 3

cubed(3)

You can even access the filesystem with commands like ls or pwd:


In [2]:
pwd


Out[2]:
u'/Users/clayton/Google Drive/TAMU/Classes/CSCE_489/HW0'

Getting Started with Numpy, pandas, and matplotlib

In this part, we're going to get familiar with three important libraries -- Numpy, pandas, and matplotlib. You will use these libraries over and over this semester. This portion is mainly just a super quick intro to each; we expect you to go much deeper as we move forward on future homeworks.

Into to Numpy

Since Python is an interpreted language, it might not seem like the best choice for data analysis. Luckily, almost all of our data workflow stack is built on top of Numpy, a Python library that adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.


In [5]:
import numpy as np

print "Make a 4 row x 2 column matrix of random numbers"
x = np.random.random((4, 2))
print x
print

print "Add 10 to every element"
x = x + 10
print x
print

print "Get the element at row 3, column 1"
print x[3, 1]
print

print "Get the first row"
print x[0, :]
print


Make a 4 row x 2 column matrix of random numbers
[[ 0.35188842  0.17449787]
 [ 0.81086327  0.37468028]
 [ 0.92644188  0.69636354]
 [ 0.88574105  0.3867065 ]]

Add 10 to every element
[[ 10.35188842  10.17449787]
 [ 10.81086327  10.37468028]
 [ 10.92644188  10.69636354]
 [ 10.88574105  10.3867065 ]]

Get the element at row 3, column 1
10.3867065001

Get the first row
[ 10.35188842  10.17449787]

Now, it's your turn. Find the maximum, minimum, and mean of the array. This does not require writing a loop. In the code cell below, type x.m, to find built-in operations that may help you out.


In [6]:
# your code here
x_max = x.max()
x_min = x.min()
x_mean = x.mean()

print 'Max:', x_max, 'Min:', x_min, 'Mean:', x_mean


Max: 10.9264418797 Min: 10.1744978743 Mean: 10.5758978514

That wasy easy! Now, let's see if you can generate 500 numbers from a uniform distribution between 0 and 10,000, inclusive. That is each random number could be 0, 1, 2, ..., 10,000 with equal chance. What is the maximum, minimum, and mean of these 500 random numbers? Hint: take a look at np.random


In [13]:
# your code here
rand_arr = np.random.randint(0, 10000 + 1, 500)
rand_max = rand_arr.max()
rand_min = rand_arr.min()
rand_mean = rand_arr.mean()

print 'Max:', rand_max, 'Min:', rand_min, 'Mean:', rand_mean


Max: 9990 Min: 64 Mean: 5022.274

Intro to matplotlib

The most widespread Python plotting library is matplotlib. With it, you can create graphs, charts, basic maps, and other data visualizations. Later in the semester, we'll use seaborn, a library built on top of matplotlib that provides even more beautiful charts.

Below, we provide some simple x and y coordinates that are then plotted. You should update the plot to include:

  • a label for the x-axis (call it 'X-axis')
  • a label for the y-axis (call it 'Y-axis')
  • a label for the x, y1 curve (call it 'Alice')
  • a label for the x, y2 curve (call it 'Bob')
  • a legend in the lower right corner

In [27]:
# this line prepares IPython for working with matplotlib
%matplotlib inline  

import matplotlib.pyplot as plt  

# your code here
x = [0, 1, 2, 3, 4]  
plt.xlabel('X-axis')

y1 = [10, 12, 10, 10, 12]
y2 = [8, 9, 9, 11, 12]
plt.ylabel('Y-axis')

alice_line, = plt.plot(x, y1, label='Alice')
bob_line, = plt.plot(x, y2, label='Bob')
plt.legend(handles=[alice_line, bob_line])


Out[27]:
<matplotlib.legend.Legend at 0x10c8ec710>

Intro to Pandas

pandas is the standard Python package for dealing with tabular (or relational) data. You can think of it as a programmatic way of dealing with Excel-like data that has rows and columns.

Let's begin by loading some historical birth data that's stored in a comma-separated value (csv) format:


In [43]:
import pandas as pd
data_path='./births.csv'
# births = your code here -- should read in the csv file births.csv 
births = pd.read_csv(data_path, sep=',')
# update the line above, then uncomment the line below
births.head()


Out[43]:
year month day gender births
0 1969 1 1 F 4046
1 1969 1 1 M 4440
2 1969 1 2 F 4454
3 1969 1 2 M 4548
4 1969 1 3 F 4548

You see that we have data for the year, month, day, and gender. We can group data by different attributes to 'roll-up' the data:


In [44]:
grouped = births.groupby(['year', 'month', 'gender']).sum()
grouped.head()


Out[44]:
births
year month gender
1969 1 F 143730
M 150210
2 F 132358
M 138428
3 F 144084

Can you find all of the births on April 1 for each year?


In [93]:
# your code here

# Get births in april
aprils = births[births['month'] == 4]
# Get births on the first of april
april_firsts = aprils[aprils['day'] == '1']
# Group by year & then sum the men and women
april_firsts.groupby(['year', 'month']).sum()


Out[93]:
births
year month
1969 4 9960
1970 4 10002
1971 4 9756
1972 4 7558
1973 4 7495
1974 4 8550
1975 4 8871
1976 4 8336
1977 4 8997
1978 4 8081
1979 4 8126
1980 4 10146
1981 4 9860
1982 4 10118
1983 4 9982
1984 4 8202
1985 4 9834
1986 4 10569
1987 4 10212
1988 4 10298

We can even read csv files from the web. If you are missing any of the packages below, you can install them first, then re-start this notebook.


In [94]:
# reading from a csv file from the web
import pandas as pd
import io
import requests

url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv'
r = requests.get(url)
bob_ross_data = pd.read_csv(io.StringIO(r.content.decode('utf-8')))
bob_ross_data.head()


Out[94]:
EPISODE TITLE APPLE_FRAME AURORA_BOREALIS BARN BEACH BOAT BRIDGE BUILDING BUSHES ... TOMB_FRAME TREE TREES TRIPLE_FRAME WATERFALL WAVES WINDMILL WINDOW_FRAME WINTER WOOD_FRAMED
0 S01E01 "A WALK IN THE WOODS" 0 0 0 0 0 0 0 1 ... 0 1 1 0 0 0 0 0 0 0
1 S01E02 "MT. MCKINLEY" 0 0 0 0 0 0 0 0 ... 0 1 1 0 0 0 0 0 1 0
2 S01E03 "EBONY SUNSET" 0 0 0 0 0 0 0 0 ... 0 1 1 0 0 0 0 0 1 0
3 S01E04 "WINTER MIST" 0 0 0 0 0 0 0 1 ... 0 1 1 0 0 0 0 0 0 0
4 S01E05 "QUIET STREAM" 0 0 0 0 0 0 0 0 ... 0 1 1 0 0 0 0 0 0 0

5 rows × 69 columns

In this case, we have the Bob Ross data we talked about in class on the first day.

How many Bob Ross paintings contain a tree?


In [101]:
# your code here
bob_ross_data.dtypes
# get data where TREE == 1 and then get the size of that table
bob_ross_data[bob_ross_data['TREE'] == 1].size


Out[101]:
24909

In [ ]: