Agenda

  • Numpy
  • Pandas
  • Lab

Introduction

Download this notebook for our code-along:

From our class GitHub repository "notebooks" directory, download the file "02_numpy_and_pandas.ipynb". Move this to a convenient place where you are grouping your class related materials.

Introduction to Numpy ("num-pi")

  • Overview
  • ndarray
  • Indexing and Slicing

More info: http://wiki.scipy.org/Tentative_NumPy_Tutorial

Numpy Overview

  • Why use the Numpy package for Data?
    • It makes our data science life easier!!
      • Numpy brings decades of C math into Python!
      • Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
      • NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

In [1]:
## from numpy import * #Load all the numpy packages
%matplotlib inline
import numpy as np

A Note About "import as"

import * 

This loads all submodules. Note: this is a waste of memory when incorporated into deployed code. We'll use it here by example - it's fine to use for learning purposes and legibility.

As we'll see later, the proper convention is to use:

import numpy as np

And then to specifically call needed methods:

Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items.


In [2]:
# Creating arrays
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4))
d = np.arange(0,11,1)

What are these functions?

arange?

In [3]:
# Note the way each array is printed:
print("a: ",a)
print("b: ",b)
print("c: ",c)
print("d: ",d)


('a: ', array([ 0.,  0.,  0.]))
('b: ', array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]]))
('c: ', array([[[3, 5, 2, 5],
        [1, 2, 4, 8],
        [3, 1, 6, 1]],

       [[1, 3, 3, 4],
        [8, 4, 3, 2],
        [6, 3, 9, 6]]]))
('d: ', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]))

Arithmetic in arrays is element wise


In [4]:
>>> a = np.array( [20,30,40,50] )
>>> b = np.arange( 4 )
>>> b


Out[4]:
array([0, 1, 2, 3])

In [5]:
>>> c = a-b
>>> c


Out[5]:
array([20, 29, 38, 47])

In [6]:
>>> b**2


Out[6]:
array([0, 1, 4, 9])

Indexing, Slicing and Iterating


In [7]:
# one-dimensional arrays work like lists:
a = np.arange(10)**2

In [8]:
a


Out[8]:
array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [9]:
a[2:5]


Out[9]:
array([ 4,  9, 16])

Multidimensional arrays use tuples with commas for indexing

with (row,column) conventions beginning, as always in Python, from 0


In [10]:
b = np.random.randint(1,100,(4,4))

In [11]:
b


Out[11]:
array([[17, 59, 98, 10],
       [66, 74, 35, 74],
       [52, 94, 57, 49],
       [58, 64, 16, 56]])

In [12]:
# Guess the output
print(b[2,3])


49

In [13]:
# Guess the output
print(b[0,0])


17

In [14]:
# Guess the output
b[0:3,1]


Out[14]:
array([59, 74, 94])

In [15]:
#Guess the output
b[:,1]


Out[15]:
array([59, 74, 94, 64])

In [16]:
#Guess the output
b[1:3,:]


Out[16]:
array([[66, 74, 35, 74],
       [52, 94, 57, 49]])

Introduction to Pandas

Pandas, Know it! Love it! We will use it extensively.

Will allow us to use short lines of code to do important manipulation and viewing of our data

  • Object Creation
  • Viewing data
  • Selection
  • Missing data
  • Grouping
  • Reshaping
  • Time series
  • Plotting
  • i/o

pandas.pydata.org

Pandas Overview

Source: pandas.pydata.org


In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [18]:
dates = pd.date_range('20140101',periods=6)
dates


Out[18]:
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [19]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns


Out[19]:
Index([u'A', u'B', u'C', u'D'], dtype='object')

In [20]:
# Index, columns, underlying numpy data
print df


                   A         B         C         D
2014-01-01  0.776330 -0.577392 -0.625759  0.471408
2014-01-02 -0.434345  0.411061  0.092662  0.336645
2014-01-03  0.106360 -0.524343 -1.393945 -0.866345
2014-01-04  0.417913 -0.276387 -0.202947  0.978929
2014-01-05  1.462674  0.846637  0.587974  0.559929
2014-01-06  0.803688  0.257939  0.856312 -0.309411

In [21]:
df.T


Out[21]:
2014-01-01 00:00:00 2014-01-02 00:00:00 2014-01-03 00:00:00 2014-01-04 00:00:00 2014-01-05 00:00:00 2014-01-06 00:00:00
A 0.776330 -0.434345 0.106360 0.417913 1.462674 0.803688
B -0.577392 0.411061 -0.524343 -0.276387 0.846637 0.257939
C -0.625759 0.092662 -1.393945 -0.202947 0.587974 0.856312
D 0.471408 0.336645 -0.866345 0.978929 0.559929 -0.309411

In [22]:
df


Out[22]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929
2014-01-06 0.803688 0.257939 0.856312 -0.309411

In [23]:
temp = df.T
temp


Out[23]:
2014-01-01 00:00:00 2014-01-02 00:00:00 2014-01-03 00:00:00 2014-01-04 00:00:00 2014-01-05 00:00:00 2014-01-06 00:00:00
A 0.776330 -0.434345 0.106360 0.417913 1.462674 0.803688
B -0.577392 0.411061 -0.524343 -0.276387 0.846637 0.257939
C -0.625759 0.092662 -1.393945 -0.202947 0.587974 0.856312
D 0.471408 0.336645 -0.866345 0.978929 0.559929 -0.309411

In [35]:
df = df.T

In [36]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : 'foo' })
    

df2


Out[36]:
A B C D E
0 1.0 2013-01-02 1.0 3 foo
1 1.0 2013-01-02 1.0 3 foo
2 1.0 2013-01-02 1.0 3 foo
3 1.0 2013-01-02 1.0 3 foo

In [37]:
# With specific dtypes
df2.dtypes


Out[37]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

Viewing Data


In [38]:
df


Out[38]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929
2014-01-06 0.803688 0.257939 0.856312 -0.309411

In [39]:
df.head()


Out[39]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929

In [40]:
df.tail(3)


Out[40]:
A B C D
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929
2014-01-06 0.803688 0.257939 0.856312 -0.309411

In [41]:
df.index


Out[41]:
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [42]:
df.describe()


Out[42]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.522103 0.022919 -0.114284 0.195193
std 0.652240 0.571638 0.822806 0.667093
min -0.434345 -0.577392 -1.393945 -0.866345
25% 0.184248 -0.462354 -0.520056 -0.147897
50% 0.597121 -0.009224 -0.055142 0.404026
75% 0.796848 0.372780 0.464146 0.537799
max 1.462674 0.846637 0.856312 0.978929

In [43]:
df


Out[43]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929
2014-01-06 0.803688 0.257939 0.856312 -0.309411

In [44]:
z = df.sort_values(by="B")
z


Out[44]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-06 0.803688 0.257939 0.856312 -0.309411
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-05 1.462674 0.846637 0.587974 0.559929

Selection


In [45]:
df[['A','B']]


Out[45]:
A B
2014-01-01 0.776330 -0.577392
2014-01-02 -0.434345 0.411061
2014-01-03 0.106360 -0.524343
2014-01-04 0.417913 -0.276387
2014-01-05 1.462674 0.846637
2014-01-06 0.803688 0.257939

In [46]:
df[0:3]


Out[46]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345

In [47]:
# By label
df.loc[dates[0]]


Out[47]:
A    0.776330
B   -0.577392
C   -0.625759
D    0.471408
Name: 2014-01-01 00:00:00, dtype: float64

In [48]:
# multi-axis by label
df.loc[:,['A','B']]


Out[48]:
A B
2014-01-01 0.776330 -0.577392
2014-01-02 -0.434345 0.411061
2014-01-03 0.106360 -0.524343
2014-01-04 0.417913 -0.276387
2014-01-05 1.462674 0.846637
2014-01-06 0.803688 0.257939

In [49]:
# Date Range
df.loc['20140102':'20140104',['B']]


Out[49]:
B
2014-01-02 0.411061
2014-01-03 -0.524343
2014-01-04 -0.276387

In [50]:
# Fast access to scalar
df.at[dates[1],'B']


Out[50]:
0.41106066924913387

Operations


In [51]:
df.describe()


Out[51]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.522103 0.022919 -0.114284 0.195193
std 0.652240 0.571638 0.822806 0.667093
min -0.434345 -0.577392 -1.393945 -0.866345
25% 0.184248 -0.462354 -0.520056 -0.147897
50% 0.597121 -0.009224 -0.055142 0.404026
75% 0.796848 0.372780 0.464146 0.537799
max 1.462674 0.846637 0.856312 0.978929

In [52]:
df.mean(),df.mean(1) # Operation on two different axes
# there are pandas' methods for min, max etc. as well


Out[52]:
(A    0.522103
 B    0.022919
 C   -0.114284
 D    0.195193
 dtype: float64, 2014-01-01    0.011147
 2014-01-02    0.101506
 2014-01-03   -0.669568
 2014-01-04    0.229377
 2014-01-05    0.864303
 2014-01-06    0.402132
 Freq: D, dtype: float64)

Boolean Indexing


In [53]:
df


Out[53]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929
2014-01-06 0.803688 0.257939 0.856312 -0.309411

In [54]:
df[df.A < 0] # Basically a 'where' operation


Out[54]:
A B C D
2014-01-02 -0.434345 0.411061 0.092662 0.336645

In [55]:
df.A < 0


Out[55]:
2014-01-01    False
2014-01-02     True
2014-01-03    False
2014-01-04    False
2014-01-05    False
2014-01-06    False
Freq: D, Name: A, dtype: bool

In [56]:
# Two conditions added together
df[(df.A < 0) & (df.B > .5) ]


Out[56]:
A B C D

In [57]:
# Two conditions ORed together
df[(df.A < -1) | (df.D > 0)]


Out[57]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929

Merge


In [58]:
np.random.randn(10,4)


Out[58]:
array([[ 0.11306038, -1.77140811,  0.07280936, -0.14668009],
       [-0.68492678, -0.19256862, -0.86461244,  0.31432357],
       [ 0.47392041,  0.27729701,  0.71623443,  1.94010053],
       [ 0.19398862,  1.05932087,  1.23240103,  0.87530107],
       [ 0.28308285,  0.6611329 , -0.17589002, -0.58357836],
       [-0.4042394 , -0.43824254,  1.01473845, -1.44265652],
       [-1.12937782, -0.47272561, -0.47982388,  0.43282168],
       [ 0.87037526,  1.4418829 , -1.69837435, -0.86826889],
       [ 0.41261133,  0.84335971, -0.66000055,  0.2896288 ],
       [-2.33263201, -0.29674019,  0.06493032,  0.6763015 ]])

In [59]:
##Concatenating pandas objects together
# create a dataframe to use as an example
df2 = pd.DataFrame(np.random.randn(10,4))
df2


Out[59]:
0 1 2 3
0 1.424766 -0.027515 1.084747 -0.368493
1 2.032886 0.599020 -1.533923 0.538320
2 1.715762 -1.033603 0.326035 0.718633
3 -1.021294 -0.601098 0.109353 -0.357211
4 -0.656773 2.830440 0.138265 0.227426
5 0.717391 -0.914680 -1.471915 0.862612
6 -1.196063 -0.376883 0.711914 0.946665
7 -0.398913 -1.525097 0.397494 1.337845
8 0.548093 -1.060878 -0.183606 -1.448564
9 -1.576554 0.335038 -0.562043 -1.016224

In [60]:
# Break it into pieces
pieces = [df2[:3], df2[3:7],df2[7:]]
pieces


Out[60]:
[          0         1         2         3
 0  1.424766 -0.027515  1.084747 -0.368493
 1  2.032886  0.599020 -1.533923  0.538320
 2  1.715762 -1.033603  0.326035  0.718633,
           0         1         2         3
 3 -1.021294 -0.601098  0.109353 -0.357211
 4 -0.656773  2.830440  0.138265  0.227426
 5  0.717391 -0.914680 -1.471915  0.862612
 6 -1.196063 -0.376883  0.711914  0.946665,
           0         1         2         3
 7 -0.398913 -1.525097  0.397494  1.337845
 8  0.548093 -1.060878 -0.183606 -1.448564
 9 -1.576554  0.335038 -0.562043 -1.016224]

In [61]:
pd.concat(pieces)


Out[61]:
0 1 2 3
0 1.424766 -0.027515 1.084747 -0.368493
1 2.032886 0.599020 -1.533923 0.538320
2 1.715762 -1.033603 0.326035 0.718633
3 -1.021294 -0.601098 0.109353 -0.357211
4 -0.656773 2.830440 0.138265 0.227426
5 0.717391 -0.914680 -1.471915 0.862612
6 -1.196063 -0.376883 0.711914 0.946665
7 -0.398913 -1.525097 0.397494 1.337845
8 0.548093 -1.060878 -0.183606 -1.448564
9 -1.576554 0.335038 -0.562043 -1.016224

In [ ]:
# Also can "Join" and "Append"

Setting


In [62]:
df_posA = df.copy() # Without "copy" it would act on the dataset

df_posA[df_posA.A < 0] = -1*df_posA

In [65]:
print df_posA
print df


                   A         B         C         D
2014-01-01  0.776330 -0.577392 -0.625759  0.471408
2014-01-02  0.434345 -0.411061 -0.092662 -0.336645
2014-01-03  0.106360 -0.524343 -1.393945 -0.866345
2014-01-04  0.417913 -0.276387 -0.202947  0.978929
2014-01-05  1.462674  0.846637  0.587974  0.559929
2014-01-06  0.803688  0.257939  0.856312 -0.309411
                   A         B         C         D
2014-01-01  0.776330 -0.577392 -0.625759  0.471408
2014-01-02 -0.434345  0.411061  0.092662  0.336645
2014-01-03  0.106360 -0.524343 -1.393945 -0.866345
2014-01-04  0.417913 -0.276387 -0.202947  0.978929
2014-01-05  1.462674  0.846637  0.587974  0.559929
2014-01-06  0.803688  0.257939  0.856312 -0.309411

In [ ]:
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))

In [ ]:
s1

In [ ]:
df['F'] = s1

In [ ]:
df

Missing Data


In [66]:
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

In [68]:
df1.loc[dates[0]:dates[1],'E'] = 1

In [69]:
df1


Out[69]:
A B C D E
2014-01-01 0.776330 -0.577392 -0.625759 0.471408 1.0
2014-01-02 -0.434345 0.411061 0.092662 0.336645 1.0
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345 NaN
2014-01-04 0.417913 -0.276387 -0.202947 0.978929 NaN

In [70]:
# find where values are null
pd.isnull(df1)


Out[70]:
A B C D E
2014-01-01 False False False False False
2014-01-02 False False False False False
2014-01-03 False False False False True
2014-01-04 False False False False True

Applying functions


In [71]:
df


Out[71]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 -0.434345 0.411061 0.092662 0.336645
2014-01-03 0.106360 -0.524343 -1.393945 -0.866345
2014-01-04 0.417913 -0.276387 -0.202947 0.978929
2014-01-05 1.462674 0.846637 0.587974 0.559929
2014-01-06 0.803688 0.257939 0.856312 -0.309411

In [72]:
df.apply(np.cumsum)


Out[72]:
A B C D
2014-01-01 0.776330 -0.577392 -0.625759 0.471408
2014-01-02 0.341984 -0.166331 -0.533097 0.808053
2014-01-03 0.448344 -0.690674 -1.927042 -0.058292
2014-01-04 0.866257 -0.967061 -2.129989 0.920637
2014-01-05 2.328931 -0.120424 -1.542015 1.480567
2014-01-06 3.132619 0.137515 -0.685703 1.171156

In [73]:
df.apply(lambda x: x.max() - x.min())


Out[73]:
A    1.897019
B    1.424029
C    2.250257
D    1.845274
dtype: float64

In [74]:
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()


Out[74]:
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Grouping


In [75]:
df3 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})

In [76]:
df3


Out[76]:
A B C D
0 foo one -0.400972 0.014368
1 bar one 1.535471 0.669310
2 foo two -0.961644 1.858687
3 bar three -1.455985 -0.564102
4 foo two -0.104956 -0.235652
5 bar two -0.771425 -0.090535
6 foo one -0.568128 0.491529
7 foo three -1.153396 -0.006657

In [78]:
df3.groupby(['A','B']).mean()


Out[78]:
C D
A B
bar one 1.535471 0.669310
three -1.455985 -0.564102
two -0.771425 -0.090535
foo one -0.484550 0.252948
three -1.153396 -0.006657
two -0.533300 0.811517

Reshaping


In [ ]:
# You can also stack or unstack levels

In [79]:
a = df3.groupby(['A','B']).sum()

In [ ]:
# Pivot Tables
# pd.pivot_table(df3,values=['C','D'],rows=['A'],cols=['B'])

Time Series


In [80]:
import pandas as pd
import numpy as np

In [ ]:
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')

In [ ]:
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [ ]:
ts

In [ ]:
# Built in resampling
ts.resample('1Min',how='mean') # Resample secondly to 1Minutely

In [ ]:
# Many additional time series features
# ts. #use tab

Plotting


In [ ]:
ts.plot()

In [ ]:
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)

In [ ]:
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)

In [ ]:
# Pandas plot function will print with labels as default

In [ ]:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

I/O

I/O is straightforward with, for example, pd.read_csv or df.to_csv

Next Steps

Recommended Resources

Name Description
Official Pandas Tutorials Wes & Company's selection of tutorials and lectures
Julia Evans Pandas Cookbook Great resource with examples from weather, bikes and 311 calls
Learn Pandas Tutorials A great series of Pandas tutorials from Dave Rojas
Research Computing Python Data PYNBs A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas