In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Following is optional: set plotting styles
import seaborn; seaborn.set()

Data Analysis with NumPy and Pandas

Outline:

Python Lists vs NumPy Arrays
Keys to Using NumPy
Pandas: Diving into the Series and Dataframe
Example: Names in the Wild

Python Lists vs. NumPy Arrays

While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.

In particular, Python lists are very flexible containers that can be nested arbitrarily deep and which can hold any Python object in them, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices. In contrast, much of our modern heritage of scientific computing has been built on top of libraries written in the Fortran language, which has native support for vectors and matrices as well as a library of mathematical functions that can efficiently operate on entire arrays at once.

Review: Working with Lists

Lists in Python are collections which store values for manipulation:



In [2]:

    
L = [1, 2, 3, 4, 5]



In [3]:

    
# Zero-based Indexing
print(L[0], L[1])

1 2



In [4]:

    
# Indexing from the end
print(L[-1], L[-2])

5 4



In [5]:

    
# Slicing
L[0:3]









    Out[5]:





[1, 2, 3]



In [6]:

    
# The 0 can be left-out
L[:3]









    Out[6]:





[1, 2, 3]



In [7]:

    
# Slicing by a step size
L[0:5:2]









    Out[7]:





[1, 3, 5]



In [8]:

    
# Reversing with a negative step size
L[::-1]









    Out[8]:





[5, 4, 3, 2, 1]



In [9]:

    
# Lists of multiple types
L2 = [1, 'two', 3.14]



In [10]:

    
# Adding lists together will append them:
L + L2









    Out[10]:





[1, 2, 3, 4, 5, 1, 'two', 3.14]

Why are lists not good for data-intensive science?

Due to Python list's flexibility, they are inefficient for storing large amounts of data
Due to Python's dynamic nature, operations on large lists are inefficient.



In [11]:

    
import math

# make a large list of theta values
theta = [0.01 * i for i in range(1000000)]
sin_theta = [math.sin(t) for t in theta]
sin_theta[:10]









    Out[11]:





[0.0,
 0.009999833334166664,
 0.01999866669333308,
 0.02999550020249566,
 0.03998933418663416,
 0.04997916927067833,
 0.059964006479444595,
 0.06994284733753277,
 0.0799146939691727,
 0.08987854919801104]



In [12]:

    
%timeit [math.sin(t) for t in theta]









    



10 loops, best of 3: 140 ms per loop

Let's take a look at doing essentially the same operation using NumPy.

By convention, we'll import numpy under the shorthand np:



In [13]:

    
import numpy as np



In [14]:

    
theta = 0.01 * np.arange(1E6)

sin_theta = np.sin(theta)
sin_theta[:10]









    Out[14]:





array([ 0.        ,  0.00999983,  0.01999867,  0.0299955 ,  0.03998933,
        0.04997917,  0.05996401,  0.06994285,  0.07991469,  0.08987855])



In [15]:

    
%timeit np.sin(theta)









    



100 loops, best of 3: 14.7 ms per loop

NumPy's version of this is nearly 10x faster than the list-based Python version, and it is arguably simpler as well!

Keys to Using NumPy Effectively

There is a lot of info out there about how to use numpy. Here I want to just briefly go over some of the key concepts as we progress to talking about using Python for real-world data.

Creating Arrays

There are many, many ways to create Python arrays. We'll demonstrate a few here:



In [16]:

    
# from a list
np.array([1, 2, 3, 4])









    Out[16]:





array([1, 2, 3, 4])



In [17]:

    
# range of numbers, like Python's range()
np.arange(0, 10, 0.5)









    Out[17]:





array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])



In [18]:

    
# range of numbers between two limits
np.linspace(0, 10, 5)









    Out[18]:





array([  0. ,   2.5,   5. ,   7.5,  10. ])



In [19]:

    
# array of zeros
np.zeros(10)









    Out[19]:





array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])



In [20]:

    
# array of ones
np.ones(10)









    Out[20]:





array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])



In [21]:

    
# array of random values
np.random.rand(10)









    Out[21]:





array([ 0.06244294,  0.34856405,  0.83474136,  0.06727073,  0.36735175,
        0.45133983,  0.01958005,  0.08204342,  0.75092178,  0.46176409])

Operating on Arrays

Operations on numpy arrays are done element-wise. This means that you don't explicitly have to write for-loops in order to do these operations!



In [22]:

    
# define some arrays
x = np.arange(5)
y = np.random.random(5)



In [23]:

    
# addition – add 1 to each
x + 1









    Out[23]:





array([1, 2, 3, 4, 5])



In [24]:

    
# multiplication – multiply each by 2
y * 2









    Out[24]:





array([ 1.51538329,  0.45745721,  1.06958137,  1.83321707,  1.53266941])



In [25]:

    
# two arrays: everything is element-wise
x / y









    Out[25]:





array([ 0.        ,  4.37199364,  3.73978093,  3.27293483,  5.21965137])



In [26]:

    
# exponentiation
np.exp(x)









    Out[26]:





array([  1.        ,   2.71828183,   7.3890561 ,  20.08553692,  54.59815003])



In [27]:

    
# trigonometric functions
np.sin(x)









    Out[27]:





array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ])



In [28]:

    
# combining operations
np.cos(x) + np.sin(2 * np.pi * (x - y))









    Out[28]:





array([ 1.99883243, -0.45077954, -0.19928728, -0.48967621,  0.34109413])

Indexing and Slicing Arrays

Indexing works just like with Python lists:



In [29]:

    
x









    Out[29]:





array([0, 1, 2, 3, 4])



In [30]:

    
x[0], x[1]









    Out[30]:





(0, 1)



In [31]:

    
x[:3]









    Out[31]:





array([0, 1, 2])



In [32]:

    
x[::2]









    Out[32]:





array([0, 2, 4])



In [33]:

    
x[::-1]









    Out[33]:





array([4, 3, 2, 1, 0])

Unlike lists, NumPy arrays can have multiple dimensions, and the indexing and slicing works efficiently!



In [34]:

    
M = np.arange(20).reshape(4, 5)
M









    Out[34]:





array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])



In [35]:

    
M[1, 2]









    Out[35]:





7



In [36]:

    
M[:2, :2]









    Out[36]:





array([[0, 1],
       [5, 6]])



In [37]:

    
M[:, 1:3]









    Out[37]:





array([[ 1,  2],
       [ 6,  7],
       [11, 12],
       [16, 17]])

Masking

Another useful way of indexing arrays is to use masks. If we do a boolean operation on some array, the result is a boolean array:



In [38]:

    
M









    Out[38]:





array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])



In [39]:

    
M < 8









    Out[39]:





array([[ True,  True,  True,  True,  True],
       [ True,  True,  True, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False]], dtype=bool)

Boolean mask arrays can be used to select portions of a larger array, and operate on them



In [40]:

    
M[M < 8] = 0
M









    Out[40]:





array([[ 0,  0,  0,  0,  0],
       [ 0,  0,  0,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])



In [41]:

    
M[M == 12] *= 2
M









    Out[41]:





array([[ 0,  0,  0,  0,  0],
       [ 0,  0,  0,  8,  9],
       [10, 11, 24, 13, 14],
       [15, 16, 17, 18, 19]])



In [42]:

    
M[M % 2 == 0] = 999
M









    Out[42]:





array([[999, 999, 999, 999, 999],
       [999, 999, 999, 999,   9],
       [999,  11, 999,  13, 999],
       [ 15, 999,  17, 999,  19]])

As I mentioned, there is much more to numpy arrays, but this has covered the basic pieces needed here!

Pandas: Series and Dataframes

For data-intensive work in Python the Pandas library has become essential. Pandas originally meant Panel Data, though many users probably don't know that.

Pandas can be thought of as NumPy with built-in labels for rows and columns, but it's also much, much more than that.

Pandas does this through two fundamental object types, both built upon NumPy arrays: the Series object, and the DataFrame object.

Making a Pandas `Series`

A Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:



In [43]:

    
s = pd.Series([0.1, 0.2, 0.3, 0.4])

The series has a built-in concept of an index, which by default is the numbers 0 through N - 1



In [44]:

    
s.index









    Out[44]:





Int64Index([0, 1, 2, 3], dtype='int64')

We can access series values via the index, just like for NumPy arrays:



In [45]:

    
s[0]









    Out[45]:





0.10000000000000001

Unlike the NumPy array, though, this index can be something other than integers:



In [46]:

    
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2









    Out[46]:





a    0
b    1
c    2
d    3
dtype: int64



In [47]:

    
s2['c']









    Out[47]:





2

In this way, a Series object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.

In fact, it's possible to construct a series directly from a Python dictionary:



In [48]:

    
pop_dict = {'California': 38332521,
            'Texas': 26448193,
            'New York': 19651127,
            'Florida': 19552860,
            'Illinois': 12882135}
populations = pd.Series(pop_dict)
populations









    Out[48]:





California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

Note that because Python dictionaries are an unordered object, the order of the resulting series will not match the order of the dictionary definition.

We can index or slice the populations as expected:



In [49]:

    
populations['California']









    Out[49]:





38332521



In [50]:

    
populations['California':'Illinois']









    Out[50]:





California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

DataFrames: Multi-dimensional Data

A dataframe, essentially, is a multi-dimensional object to hold labeled data. You can think of it as multiple Series object which share the same index.

One of the most common ways of creating a dataframe is from a dictionary of arrays or lists. Note that in the IPython notebook with the correct settings, the dataframe will display in a rich HTML view:



In [51]:

    
data = {'state': ['California', 'Texas', 'New York', 'Florida', 'Illinois'],
        'population': [38332521, 26448193, 19651127, 19552860, 12882135],
        'area':[423967, 695662, 141297, 170312, 149995]}
states = pd.DataFrame(data)
states

If we don't like what the index looks like, we can reset it:



In [52]:

    
states = states.set_index('state')
states

To access a Series representing a column in the data, use typical slicing syntax:



In [53]:

    
states['area']









    Out[53]:





state
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

To access a row, you need to use a special row-location operator:



In [54]:

    
states.loc['California']









    Out[54]:





area            423967
population    38332521
Name: California, dtype: int64

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.

For example there's arithmetic. Let's compute the area in square miles and add a column to the data



In [55]:

    
states['density'] = states['population'] / states['area']
states

We can even use masking the way we did in NumPy:



In [56]:

    
states[states['density'] > 100]

And we can do things like sorting the items in the array, and indexing to take the first two rows:



In [57]:

    
states.sort_index(by='density', ascending=False)[:3]

One useful method to use is the describe method, which computes summary statistics for each column:



In [58]:

    
states.describe()









    Out[58]:






  
    
      
      area
      population
      density
    
  
  
    
      count
            5.000000
              5.000000
         5.000000
    
    
      mean
       316246.600000
       23373367.200000
        93.639859
    
    
      std
       242437.411951
        9640385.580443
        37.672251
    
    
      min
       141297.000000
       12882135.000000
        38.018740
    
    
      25%
       149995.000000
       19552860.000000
        85.883763
    
    
      50%
       170312.000000
       19651127.000000
        90.413926
    
    
      75%
       423967.000000
       26448193.000000
       114.806121
    
    
      max
       695662.000000
       38332521.000000
       139.076746

There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.

Example: Names in the Wild

This example is drawn from Wes McKinney's excellent book on the Pandas library, O'Reilly's Python for Data Analysis.

We'll be taking a look at a freely available dataset: the database of names given to babies in the United States over the last century.

First things first, we need to download the data, which can be found at http://www.ssa.gov/oact/babynames/limits.html. If you uncomment the following commands, it will do this automatically (note that these are linux shell commands; they will not work on Windows):



In [59]:

    
# !curl -O http://www.ssa.gov/oact/babynames/names.zip



In [60]:

    
# !mkdir -p data/names
# !mv names.zip data/names/
# !cd data/names/ && unzip names.zip

Now we should have a data/names directory which contains a number of text files, one for each year of data:



In [61]:

    
!ls data/names









    



NationalReadMe.pdf yob1913.txt        yob1947.txt        yob1981.txt
yob1880.txt        yob1914.txt        yob1948.txt        yob1982.txt
yob1881.txt        yob1915.txt        yob1949.txt        yob1983.txt
yob1882.txt        yob1916.txt        yob1950.txt        yob1984.txt
yob1883.txt        yob1917.txt        yob1951.txt        yob1985.txt
yob1884.txt        yob1918.txt        yob1952.txt        yob1986.txt
yob1885.txt        yob1919.txt        yob1953.txt        yob1987.txt
yob1886.txt        yob1920.txt        yob1954.txt        yob1988.txt
yob1887.txt        yob1921.txt        yob1955.txt        yob1989.txt
yob1888.txt        yob1922.txt        yob1956.txt        yob1990.txt
yob1889.txt        yob1923.txt        yob1957.txt        yob1991.txt
yob1890.txt        yob1924.txt        yob1958.txt        yob1992.txt
yob1891.txt        yob1925.txt        yob1959.txt        yob1993.txt
yob1892.txt        yob1926.txt        yob1960.txt        yob1994.txt
yob1893.txt        yob1927.txt        yob1961.txt        yob1995.txt
yob1894.txt        yob1928.txt        yob1962.txt        yob1996.txt
yob1895.txt        yob1929.txt        yob1963.txt        yob1997.txt
yob1896.txt        yob1930.txt        yob1964.txt        yob1998.txt
yob1897.txt        yob1931.txt        yob1965.txt        yob1999.txt
yob1898.txt        yob1932.txt        yob1966.txt        yob2000.txt
yob1899.txt        yob1933.txt        yob1967.txt        yob2001.txt
yob1900.txt        yob1934.txt        yob1968.txt        yob2002.txt
yob1901.txt        yob1935.txt        yob1969.txt        yob2003.txt
yob1902.txt        yob1936.txt        yob1970.txt        yob2004.txt
yob1903.txt        yob1937.txt        yob1971.txt        yob2005.txt
yob1904.txt        yob1938.txt        yob1972.txt        yob2006.txt
yob1905.txt        yob1939.txt        yob1973.txt        yob2007.txt
yob1906.txt        yob1940.txt        yob1974.txt        yob2008.txt
yob1907.txt        yob1941.txt        yob1975.txt        yob2009.txt
yob1908.txt        yob1942.txt        yob1976.txt        yob2010.txt
yob1909.txt        yob1943.txt        yob1977.txt        yob2011.txt
yob1910.txt        yob1944.txt        yob1978.txt        yob2012.txt
yob1911.txt        yob1945.txt        yob1979.txt        yob2013.txt
yob1912.txt        yob1946.txt        yob1980.txt

Let's take a quick look at one of these files:



In [62]:

    
!head data/names/yob1880.txt

Each file is just a comma-separated list of names, genders, and counts of babies with that name in each year.

We can load these files using pd.read_csv, which is specifically designed for this:



In [63]:

    
names1880 = pd.read_csv('data/names/yob1880.txt')
names1880.head()

Oops! Something went wrong. Our algorithm tried to be smart, and use the first line as index labels. Let's fix this by specifying the index names manually:



In [64]:

    
names1880 = pd.read_csv('data/names/yob1880.txt',
                        names=['name', 'gender', 'births'])
names1880.head()

That looks better. Now we can start playing with the data a bit.

GroupBy: aggregates on values

First let's think about how we might count the total number of females and males born in the US in 1880.

If you're used to NumPy, you might be tempted to use masking like this:

First, we can get a mask over all females & males, and then use it to select a subset of the data:



In [65]:

    
males = names1880[names1880.gender == 'M']
females = names1880[names1880.gender == 'F']

Now we can take the sum of the births for each of these:



In [66]:

    
males.births.sum(), females.births.sum()









    Out[66]:





(110491, 90993)

But there's an easier way to do this, using one of Pandas' very powerful features: groupby:



In [67]:

    
grouped = names1880.groupby('gender')
grouped









    Out[67]:





<pandas.core.groupby.DataFrameGroupBy object at 0x10eefb890>

This grouped object is now an abstract representation of the data, where the data is split on the given column. In order to actually do something with this data, we need to specify an aggregation operation to do across the data. In this case, what we want is the sum:



In [68]:

    
grouped.sum()

We can do other aggregations as well:



In [69]:

    
grouped.size()









    Out[69]:





gender
F          942
M         1058
dtype: int64



In [70]:

    
grouped.mean()

Or, if we wish, we can get a description of the grouping:



In [71]:

    
grouped.describe()









    Out[71]:






  
    
      
      
      births
    
    
      gender
      
      
    
  
  
    
      F
      count
        942.000000
    
    
      mean
         96.595541
    
    
      std
        328.152904
    
    
      min
          5.000000
    
    
      25%
          7.000000
    
    
      50%
         13.000000
    
    
      75%
         43.750000
    
    
      max
       7065.000000
    
    
      M
      count
       1058.000000
    
    
      mean
        104.433837
    
    
      std
        561.232488
    
    
      min
          5.000000
    
    
      25%
          7.000000
    
    
      50%
         12.000000
    
    
      75%
         41.000000
    
    
      max
       9655.000000

Concatenating multiple data sources

But here we've just been looking at a single year. Let's try to put together all the data in all the years. To do this, we'll have to use pandas concat function to concatenate all the data together. First we'll create a function which loads the data as we did the above data:



In [72]:

    
def load_year(year):
    data = pd.read_csv('data/names/yob{0}.txt'.format(year),
                       names=['name', 'gender', 'births'])
    data['year'] = year
    return data

Now let's load all the data into a list, and call pd.concat on that list:



In [73]:

    
names = pd.concat([load_year(year) for year in range(1880, 2014)])
names.head()

It looks like we've done it!

Let's start with something easy: we'll use groupby again to see the total number of births per year:



In [74]:

    
births = names.groupby('year').births.sum()
births.head()









    Out[74]:





year
1880    201484
1881    192700
1882    221537
1883    216952
1884    243468
Name: births, dtype: int64

We can use the plot() method to see a quick plot of these (note that because we used the %matplotlib inline magic at the start of the notebook, the resulting plot will be shown inline within the notebook).



In [75]:

    
births.plot();

The so-called "baby boom" generation after the second world war is abundantly clear!

We can also use other aggregates: let's see how many names are used each year:



In [76]:

    
names.groupby('year').births.count().plot();

Apparently there's been a huge increase of the diversity of names with time!

groupby can also be used to add columns to the data: think of it as a view of the data that you're modifying. Let's add a column giving the frequency of each name within each year & gender:



In [77]:

    
def add_frequency(group):
    group['birth_freq'] = group.births / group.births.sum()
    return group

names = names.groupby(['year', 'gender']).apply(add_frequency)
names.head()

Notice that the apply() function iterates over each group, and calls a function which modifies the group. This result is then re-constructed into a container which looks ike the original dataframe.

Pivot Tables

Next we'll discuss Pivot Tables, which are an even more powerful way of (re)organizing your data.

Let's say that we want to plot the men and women separately. We could do this by using masking, as follows:



In [78]:

    
men = names[names.gender == 'M']
women = names[names.gender == 'W']

And then we could proceed as above, using groupby to group on the year. But we would end up with two different views of the data. A better way to do this is to use a pivot_table, which is essentially a groupby in multiple dimensions at once:



In [79]:

    
births = names.pivot_table('births',
                           index='year', columns='gender',
                           aggfunc=sum)
births.head()

Note that this has grouped the index by the value of year, and grouped the columns by the value of gender. Let's plot the results now:



In [80]:

    
births.plot(title='Total Births');

Name Evolution Over Time

Some names have shifted from being girls names to being boys names. Let's take a look at some of these:



In [81]:

    
names_to_check = ['Allison', 'Alison']

# filter on just the names we're interested in
births = names[names.name.isin(names_to_check)]

# pivot table to get year vs. gender
births = births.pivot_table('births', index='year', columns='gender')

# fill all NaNs with zeros
births = births.fillna(0)

# normalize along columns
births = births.div(births.sum(1), axis=0)

births.plot(title='Fraction of babies named Allison');

We can see that prior to about 1905, all babies named Allison were male. Over the 20th century, this reversed, until the end of the century nearly all Allisons were female!

There's some noise in this data: we can smooth it out a bit by using a 5-year rolling mean:



In [82]:

    
pd.rolling_mean(births, 5).plot(title="Allisons: 5-year moving average");

This gives a smoother picture of the transition, and is an example of the bias/variance tradeoff that we'll often see in modeling: a smoother model has less variance (variation due to sampling or other noise) but at the expense of more bias (the model systematically mis-represents the data slightly).

We'll discuss this type of tradeoff more in coming sessions.

Where to Find More

We've just scratched the surface of what can be done with Pandas, but we'll get a chance to play with this more in the breakout session coming up.

For more information on using Pandas, check out the pandas documentation or the book Python for Data Analysis by Pandas creator Wes McKinney.

	Mary	F	7065
0	Anna	F	2604
1	Emma	F	2003
2	Elizabeth	F	1939
3	Minnie	F	1746
4	Margaret	F	1578

	name	gender	births
0	Mary	F	7065
1	Anna	F	2604
2	Emma	F	2003
3	Elizabeth	F	1939
4	Minnie	F	1746

	area	population	state
0	423967	38332521	California
1	695662	26448193	Texas
2	141297	19651127	New York
3	170312	19552860	Florida
4	149995	12882135	Illinois

	area	population
state
California	423967	38332521
Texas	695662	26448193
New York	141297	19651127
Florida	170312	19552860
Illinois	149995	12882135

	area	population	density
count	5.000000	5.000000	5.000000
mean	316246.600000	23373367.200000	93.639859
std	242437.411951	9640385.580443	37.672251
min	141297.000000	12882135.000000	38.018740
25%	149995.000000	19552860.000000	85.883763
50%	170312.000000	19651127.000000	90.413926
75%	423967.000000	26448193.000000	114.806121
max	695662.000000	38332521.000000	139.076746

		births
gender
F	count	942.000000
	mean	96.595541
	std	328.152904
	min	5.000000
	25%	7.000000
	50%	13.000000
	75%	43.750000
	max	7065.000000
M	count	1058.000000
	mean	104.433837
	std	561.232488
	min	5.000000
	25%	7.000000
	50%	12.000000
	75%	41.000000
	max	9655.000000

	name	gender	births	year
0	Mary	F	7065	1880
1	Anna	F	2604	1880
2	Emma	F	2003	1880
3	Elizabeth	F	1939	1880
4	Minnie	F	1746	1880

gender	F	M
year
1880	90993	110491
1881	91954	100746
1882	107850	113687
1883	112322	104630
1884	129022	114446