Agenda

Numpy
Pandas
Lab

Introduction

Download this notebook for our code-along:

From our class GitHub repository "notebooks" directory, download the file "02_numpy_and_pandas.ipynb". Move this to a convenient place where you are grouping your class related materials.

Introduction to Numpy ("num-pi")

Overview
ndarray
Indexing and Slicing

More info: http://wiki.scipy.org/Tentative_NumPy_Tutorial

Numpy Overview

Why use the Numpy package for Data?
- It makes our data science life easier!!
  - Numpy brings decades of C math into Python!
  - Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
  - NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)



In [1]:

    
## from numpy import * #Load all the numpy packages
%matplotlib inline
import numpy as np

A Note About "import as"

import *

This loads all submodules. Note: this is a waste of memory when incorporated into deployed code. We'll use it here by example - it's fine to use for learning purposes and legibility.

As we'll see later, the proper convention is to use:

import numpy as np

And then to specifically call needed methods:

Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items.



In [2]:

    
# Creating arrays
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4))
d = np.arange(0,11,1)

What are these functions?

arange?



In [3]:

    
# Note the way each array is printed:
print("a: ",a)
print("b: ",b)
print("c: ",c)
print("d: ",d)









    



('a: ', array([ 0.,  0.,  0.]))
('b: ', array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]]))
('c: ', array([[[3, 5, 2, 5],
        [1, 2, 4, 8],
        [3, 1, 6, 1]],

       [[1, 3, 3, 4],
        [8, 4, 3, 2],
        [6, 3, 9, 6]]]))
('d: ', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]))

Arithmetic in arrays is element wise



In [4]:

    
>>> a = np.array( [20,30,40,50] )
>>> b = np.arange( 4 )
>>> b









    Out[4]:





array([0, 1, 2, 3])



In [5]:

    
>>> c = a-b
>>> c









    Out[5]:





array([20, 29, 38, 47])



In [6]:

    
>>> b**2









    Out[6]:





array([0, 1, 4, 9])

Indexing, Slicing and Iterating



In [7]:

    
# one-dimensional arrays work like lists:
a = np.arange(10)**2



In [8]:

    
a









    Out[8]:





array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])



In [9]:

    
a[2:5]









    Out[9]:





array([ 4,  9, 16])

Multidimensional arrays use tuples with commas for indexing

with (row,column) conventions beginning, as always in Python, from 0



In [10]:

    
b = np.random.randint(1,100,(4,4))



In [11]:

    
b









    Out[11]:





array([[17, 59, 98, 10],
       [66, 74, 35, 74],
       [52, 94, 57, 49],
       [58, 64, 16, 56]])



In [12]:

    
# Guess the output
print(b[2,3])



In [13]:

    
# Guess the output
print(b[0,0])



In [14]:

    
# Guess the output
b[0:3,1]









    Out[14]:





array([59, 74, 94])



In [15]:

    
#Guess the output
b[:,1]









    Out[15]:





array([59, 74, 94, 64])



In [16]:

    
#Guess the output
b[1:3,:]









    Out[16]:





array([[66, 74, 35, 74],
       [52, 94, 57, 49]])

Introduction to Pandas

Pandas, Know it! Love it! We will use it extensively.

Will allow us to use short lines of code to do important manipulation and viewing of our data

Object Creation
Viewing data
Selection
Missing data
Grouping
Reshaping
Time series
Plotting
i/o

pandas.pydata.org

Pandas Overview

Source: pandas.pydata.org



In [17]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



In [18]:

    
dates = pd.date_range('20140101',periods=6)
dates









    Out[18]:





DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')



In [19]:

    
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns









    Out[19]:





Index([u'A', u'B', u'C', u'D'], dtype='object')



In [20]:

    
# Index, columns, underlying numpy data
print df









    



                   A         B         C         D
2014-01-01  0.776330 -0.577392 -0.625759  0.471408
2014-01-02 -0.434345  0.411061  0.092662  0.336645
2014-01-03  0.106360 -0.524343 -1.393945 -0.866345
2014-01-04  0.417913 -0.276387 -0.202947  0.978929
2014-01-05  1.462674  0.846637  0.587974  0.559929
2014-01-06  0.803688  0.257939  0.856312 -0.309411



In [21]:

    
df.T









    Out[21]:






  
    
      
      2014-01-01 00:00:00
      2014-01-02 00:00:00
      2014-01-03 00:00:00
      2014-01-04 00:00:00
      2014-01-05 00:00:00
      2014-01-06 00:00:00
    
  
  
    
      A
      0.776330
      -0.434345
      0.106360
      0.417913
      1.462674
      0.803688
    
    
      B
      -0.577392
      0.411061
      -0.524343
      -0.276387
      0.846637
      0.257939
    
    
      C
      -0.625759
      0.092662
      -1.393945
      -0.202947
      0.587974
      0.856312
    
    
      D
      0.471408
      0.336645
      -0.866345
      0.978929
      0.559929
      -0.309411



In [22]:

    
df



In [23]:

    
temp = df.T
temp









    Out[23]:






  
    
      
      2014-01-01 00:00:00
      2014-01-02 00:00:00
      2014-01-03 00:00:00
      2014-01-04 00:00:00
      2014-01-05 00:00:00
      2014-01-06 00:00:00
    
  
  
    
      A
      0.776330
      -0.434345
      0.106360
      0.417913
      1.462674
      0.803688
    
    
      B
      -0.577392
      0.411061
      -0.524343
      -0.276387
      0.846637
      0.257939
    
    
      C
      -0.625759
      0.092662
      -1.393945
      -0.202947
      0.587974
      0.856312
    
    
      D
      0.471408
      0.336645
      -0.866345
      0.978929
      0.559929
      -0.309411



In [35]:

    
df = df.T



In [36]:

    
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : 'foo' })
    

df2



In [37]:

    
# With specific dtypes
df2.dtypes









    Out[37]:





A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

Viewing Data



In [38]:

    
df



In [39]:

    
df.head()



In [40]:

    
df.tail(3)



In [41]:

    
df.index









    Out[41]:





DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')



In [42]:

    
df.describe()



In [43]:

    
df



In [44]:

    
z = df.sort_values(by="B")
z

Selection



In [45]:

    
df[['A','B']]



In [46]:

    
df[0:3]



In [47]:

    
# By label
df.loc[dates[0]]









    Out[47]:





A    0.776330
B   -0.577392
C   -0.625759
D    0.471408
Name: 2014-01-01 00:00:00, dtype: float64



In [48]:

    
# multi-axis by label
df.loc[:,['A','B']]



In [49]:

    
# Date Range
df.loc['20140102':'20140104',['B']]



In [50]:

    
# Fast access to scalar
df.at[dates[1],'B']









    Out[50]:





0.41106066924913387

Operations



In [51]:

    
df.describe()



In [52]:

    
df.mean(),df.mean(1) # Operation on two different axes
# there are pandas' methods for min, max etc. as well









    Out[52]:





(A    0.522103
 B    0.022919
 C   -0.114284
 D    0.195193
 dtype: float64, 2014-01-01    0.011147
 2014-01-02    0.101506
 2014-01-03   -0.669568
 2014-01-04    0.229377
 2014-01-05    0.864303
 2014-01-06    0.402132
 Freq: D, dtype: float64)

Boolean Indexing



In [53]:

    
df



In [54]:

    
df[df.A < 0] # Basically a 'where' operation



In [55]:

    
df.A < 0









    Out[55]:





2014-01-01    False
2014-01-02     True
2014-01-03    False
2014-01-04    False
2014-01-05    False
2014-01-06    False
Freq: D, Name: A, dtype: bool



In [56]:

    
# Two conditions added together
df[(df.A < 0) & (df.B > .5) ]



In [57]:

    
# Two conditions ORed together
df[(df.A < -1) | (df.D > 0)]

Merge



In [58]:

    
np.random.randn(10,4)









    Out[58]:





array([[ 0.11306038, -1.77140811,  0.07280936, -0.14668009],
       [-0.68492678, -0.19256862, -0.86461244,  0.31432357],
       [ 0.47392041,  0.27729701,  0.71623443,  1.94010053],
       [ 0.19398862,  1.05932087,  1.23240103,  0.87530107],
       [ 0.28308285,  0.6611329 , -0.17589002, -0.58357836],
       [-0.4042394 , -0.43824254,  1.01473845, -1.44265652],
       [-1.12937782, -0.47272561, -0.47982388,  0.43282168],
       [ 0.87037526,  1.4418829 , -1.69837435, -0.86826889],
       [ 0.41261133,  0.84335971, -0.66000055,  0.2896288 ],
       [-2.33263201, -0.29674019,  0.06493032,  0.6763015 ]])



In [59]:

    
##Concatenating pandas objects together
# create a dataframe to use as an example
df2 = pd.DataFrame(np.random.randn(10,4))
df2



In [60]:

    
# Break it into pieces
pieces = [df2[:3], df2[3:7],df2[7:]]
pieces









    Out[60]:





[          0         1         2         3
 0  1.424766 -0.027515  1.084747 -0.368493
 1  2.032886  0.599020 -1.533923  0.538320
 2  1.715762 -1.033603  0.326035  0.718633,
           0         1         2         3
 3 -1.021294 -0.601098  0.109353 -0.357211
 4 -0.656773  2.830440  0.138265  0.227426
 5  0.717391 -0.914680 -1.471915  0.862612
 6 -1.196063 -0.376883  0.711914  0.946665,
           0         1         2         3
 7 -0.398913 -1.525097  0.397494  1.337845
 8  0.548093 -1.060878 -0.183606 -1.448564
 9 -1.576554  0.335038 -0.562043 -1.016224]



In [61]:

    
pd.concat(pieces)



In [ ]:

    
# Also can "Join" and "Append"

Setting



In [62]:

    
df_posA = df.copy() # Without "copy" it would act on the dataset

df_posA[df_posA.A < 0] = -1*df_posA



In [65]:

    
print df_posA
print df









    



                   A         B         C         D
2014-01-01  0.776330 -0.577392 -0.625759  0.471408
2014-01-02  0.434345 -0.411061 -0.092662 -0.336645
2014-01-03  0.106360 -0.524343 -1.393945 -0.866345
2014-01-04  0.417913 -0.276387 -0.202947  0.978929
2014-01-05  1.462674  0.846637  0.587974  0.559929
2014-01-06  0.803688  0.257939  0.856312 -0.309411
                   A         B         C         D
2014-01-01  0.776330 -0.577392 -0.625759  0.471408
2014-01-02 -0.434345  0.411061  0.092662  0.336645
2014-01-03  0.106360 -0.524343 -1.393945 -0.866345
2014-01-04  0.417913 -0.276387 -0.202947  0.978929
2014-01-05  1.462674  0.846637  0.587974  0.559929
2014-01-06  0.803688  0.257939  0.856312 -0.309411



In [ ]:

    
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))



In [ ]:

    
s1



In [ ]:

    
df['F'] = s1



In [ ]:

    
df

Missing Data



In [66]:

    
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])



In [68]:

    
df1.loc[dates[0]:dates[1],'E'] = 1



In [69]:

    
df1



In [70]:

    
# find where values are null
pd.isnull(df1)









    Out[70]:






  
    
      
      A
      B
      C
      D
      E
    
  
  
    
      2014-01-01
      False
      False
      False
      False
      False
    
    
      2014-01-02
      False
      False
      False
      False
      False
    
    
      2014-01-03
      False
      False
      False
      False
      True
    
    
      2014-01-04
      False
      False
      False
      False
      True

Applying functions



In [71]:

    
df



In [72]:

    
df.apply(np.cumsum)



In [73]:

    
df.apply(lambda x: x.max() - x.min())









    Out[73]:





A    1.897019
B    1.424029
C    2.250257
D    1.845274
dtype: float64



In [74]:

    
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()









    Out[74]:





0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Grouping



In [75]:

    
df3 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})



In [76]:

    
df3



In [78]:

    
df3.groupby(['A','B']).mean()

Reshaping



In [ ]:

    
# You can also stack or unstack levels



In [79]:

    
a = df3.groupby(['A','B']).sum()



In [ ]:

    
# Pivot Tables
# pd.pivot_table(df3,values=['C','D'],rows=['A'],cols=['B'])

Time Series



In [80]:

    
import pandas as pd
import numpy as np



In [ ]:

    
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')



In [ ]:

    
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)



In [ ]:

    
ts



In [ ]:

    
# Built in resampling
ts.resample('1Min',how='mean') # Resample secondly to 1Minutely



In [ ]:

    
# Many additional time series features
# ts. #use tab

Plotting



In [ ]:

    
ts.plot()



In [ ]:

    
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)



In [ ]:

    
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)



In [ ]:

    
# Pandas plot function will print with labels as default



In [ ]:

    
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

I/O

I/O is straightforward with, for example, pd.read_csv or df.to_csv

Next Steps

Recommended Resources

Name	Description
Official Pandas Tutorials	Wes & Company's selection of tutorials and lectures
Julia Evans Pandas Cookbook	Great resource with examples from weather, bikes and 311 calls
Learn Pandas Tutorials	A great series of Pandas tutorials from Dave Rojas
Research Computing Python Data PYNBs	A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.522103	0.022919	-0.114284	0.195193
std	0.652240	0.571638	0.822806	0.667093
min	-0.434345	-0.577392	-1.393945	-0.866345
25%	0.184248	-0.462354	-0.520056	-0.147897
50%	0.597121	-0.009224	-0.055142	0.404026
75%	0.796848	0.372780	0.464146	0.537799
max	1.462674	0.846637	0.856312	0.978929

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.522103	0.022919	-0.114284	0.195193
std	0.652240	0.571638	0.822806	0.667093
min	-0.434345	-0.577392	-1.393945	-0.866345
25%	0.184248	-0.462354	-0.520056	-0.147897
50%	0.597121	-0.009224	-0.055142	0.404026
75%	0.796848	0.372780	0.464146	0.537799
max	1.462674	0.846637	0.856312	0.978929

	A	B	C	D
2014-01-01	0.776330	-0.577392	-0.625759	0.471408
2014-01-02	0.341984	-0.166331	-0.533097	0.808053
2014-01-03	0.448344	-0.690674	-1.927042	-0.058292
2014-01-04	0.866257	-0.967061	-2.129989	0.920637
2014-01-05	2.328931	-0.120424	-1.542015	1.480567
2014-01-06	3.132619	0.137515	-0.685703	1.171156

		C	D
A	B
bar	one	1.535471	0.669310
	three	-1.455985	-0.564102
	two	-0.771425	-0.090535
foo	one	-0.484550	0.252948
	three	-1.153396	-0.006657
	two	-0.533300	0.811517

	2014-01-01 00:00:00	2014-01-02 00:00:00	2014-01-03 00:00:00	2014-01-04 00:00:00	2014-01-05 00:00:00	2014-01-06 00:00:00
A	0.776330	-0.434345	0.106360	0.417913	1.462674	0.803688
B	-0.577392	0.411061	-0.524343	-0.276387	0.846637	0.257939
C	-0.625759	0.092662	-1.393945	-0.202947	0.587974	0.856312
D	0.471408	0.336645	-0.866345	0.978929	0.559929	-0.309411

	A	B	C	D	E
0	1.0	2013-01-02	1.0	3	foo
1	1.0	2013-01-02	1.0	3	foo
2	1.0	2013-01-02	1.0	3	foo
3	1.0	2013-01-02	1.0	3	foo

	0	1	2	3
0	1.424766	-0.027515	1.084747	-0.368493
1	2.032886	0.599020	-1.533923	0.538320
2	1.715762	-1.033603	0.326035	0.718633
3	-1.021294	-0.601098	0.109353	-0.357211
4	-0.656773	2.830440	0.138265	0.227426
5	0.717391	-0.914680	-1.471915	0.862612
6	-1.196063	-0.376883	0.711914	0.946665
7	-0.398913	-1.525097	0.397494	1.337845
8	0.548093	-1.060878	-0.183606	-1.448564
9	-1.576554	0.335038	-0.562043	-1.016224

	A	B	C	D	E
2014-01-01	False	False	False	False	False
2014-01-02	False	False	False	False	False
2014-01-03	False	False	False	False	True
2014-01-04	False	False	False	False	True

	A	B	C	D
0	foo	one	-0.400972	0.014368
1	bar	one	1.535471	0.669310
2	foo	two	-0.961644	1.858687
3	bar	three	-1.455985	-0.564102
4	foo	two	-0.104956	-0.235652
5	bar	two	-0.771425	-0.090535
6	foo	one	-0.568128	0.491529
7	foo	three	-1.153396	-0.006657