Python for Data Science

I wrote this notebook while figuring out what tools Python offers to solve data science problems i.e. those kind of programming task where spreadsheet like data has to be handled and analyzed with a statistical set of mind. Consequently the following code snippets are mainly using packages like numpy, pandas and matplotlib. It comprises the topics:

Random Numbers
Numpy Arrays and Matrices
Broadcasting of values in matrices
Pandas Series and DataFrames
I/O for DataFrames
Combining, Merging and Rearranging of DataFrames
Histograms
Plotting
Data Grouping and Aggregation
Pivot and Cross Tables
Handling of Time Series and Finanzial Data

Acknowledgment: Lots of the code down below is inspired by the very helpfull book "Python for Data Analysis" by Wes McKinney published at O'Reilly.



In [4]:

    
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rand
from datetime import datetime, timedelta
import dateutil
import misc as hlp
import scipy



In [4]:

    
%pylab inline









    



Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Random Numbers:



In [101]:

    
rand.randint(0,2,size=(5,2))









    Out[101]:





array([[0, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [0, 0]])



In [102]:

    
rand.normal(0, 1, size=(10,2))









    Out[102]:





array([[-0.09675095,  1.41022276],
       [ 0.08331002, -1.01390688],
       [-1.02670791,  0.78575858],
       [-0.56622282,  1.28850691],
       [-1.44063462, -0.23504318],
       [-2.60446661,  0.46358683],
       [ 0.08546987, -2.18813767],
       [ 0.59816834, -0.70568697],
       [-1.06188424,  0.34101838],
       [ 0.65117497, -0.34601664]])



In [110]:

    
random_data = np.concatenate([rand.normal(0,.5,(100,2)),rand.normal(5,1,(200,2))])
plt.scatter(random_data[:,0],random_data[:,1])









    Out[110]:





<matplotlib.collections.PathCollection at 0xa7eb940>



In [33]:

    
np.random.randn(2,3) # normal distribution
np.random.rand(2,3)  # uniform distribution









    Out[33]:





array([[ 0.6062381 ,  0.73490793,  0.54977326],
       [ 0.37106928,  0.74039583,  0.96761176]])



In [28]:

    
np.random.permutation(10)









    Out[28]:





array([4, 0, 6, 2, 7, 1, 3, 5, 8, 9])



In [120]:

    
t = np.arange(5)
np.random.shuffle(t)
t









    Out[120]:





array([0, 4, 2, 1, 3])



In [2]:

    
n = 10000
v1 = rand.normal(0,.5,n)
v2 = rand.normal(10,.75,n)
v3 = v1*3+10
v4 = rand.uniform(-10,10,n)
v5 = v1*v2
v6 = v1+v2+v3
mat = np.concatenate([v1,v2,v3,v4,v5,v6]).reshape(6,n).T
mat.shape









    Out[2]:





(10000L, 6L)



In [59]:

    
v1.std()==v1.var()**.5









    Out[59]:





True



In [3]:

    
np.cov(mat.T)









    Out[3]:





array([[  2.47965623e-01,   4.08848491e-03,   7.43896868e-01,
          1.98422062e-02,   2.48179344e+00,   9.95950976e-01],
       [  4.08848491e-03,   5.61573123e-01,   1.22654547e-02,
          3.87392111e-02,   3.97894902e-02,   5.77927062e-01],
       [  7.43896868e-01,   1.22654547e-02,   2.23169060e+00,
          5.95266187e-02,   7.44538032e+00,   2.98785293e+00],
       [  1.98422062e-02,   3.87392111e-02,   5.95266187e-02,
          3.33664381e+01,   2.45949459e-01,   1.18108036e-01],
       [  2.48179344e+00,   3.97894902e-02,   7.44538032e+00,
          2.45949459e-01,   2.49790676e+01,   9.96696326e+00],
       [  9.95950976e-01,   5.77927062e-01,   2.98785293e+00,
          1.18108036e-01,   9.96696326e+00,   4.56173096e+00]])



In [4]:

    
np.corrcoef(mat.T)









    Out[4]:





array([[ 1.        ,  0.01095629,  1.        ,  0.00689826,  0.99719888,
         0.93643447],
       [ 0.01095629,  1.        ,  0.01095629,  0.00894937,  0.01062373,
         0.36108134],
       [ 1.        ,  0.01095629,  1.        ,  0.00689826,  0.99719888,
         0.93643447],
       [ 0.00689826,  0.00894937,  0.00689826,  1.        ,  0.00851928,
         0.00957326],
       [ 0.99719888,  0.01062373,  0.99719888,  0.00851928,  1.        ,
         0.93370549],
       [ 0.93643447,  0.36108134,  0.93643447,  0.00957326,  0.93370549,
         1.        ]])



In [5]:

    
df = DataFrame({'v1':v1,'v2':v2,'v3':v3,'v4':v4,'v5':v5,'v6':v6})
df[:3]



In [77]:

    
df.cov()



In [6]:

    
df.cov()
# corr is standardized cov: each cov value has to be devided by the product of the std dev of the correlated variables
# df.cov().div(df.std()[0]*df.std()[1])



In [26]:

    
df.corr()



In [8]:

    
U,S,V = np.linalg.svd(df)



In [16]:

    
Series(S)









    Out[16]:





0    2.461739e+03
1    5.776395e+02
2    5.118762e+02
3    5.232475e+01
4    4.228912e+00
5    3.276289e-12
dtype: float64



In [14]:

    
V = DataFrame(V)
V[V<0.001]=0
V



In [27]:

    
df.corr()



In [29]:

    
stddf = df.sub(df.mean()).describe()
print(stddf.cov())









    



                 v1               v2               v3               v4  \
v1  12499774.901795  12499761.023268  12499550.957176  12498879.460149   
v2  12499761.023268  12499747.356090  12499538.057453  12498867.884048   
v3  12499550.957176  12499538.057453  12499331.626895  12498666.028934   
v4  12498879.460149  12498867.884048  12498666.028934  12498012.387348   
v5  12499221.653703  12499212.106789  12499017.962439  12498372.752814   
v6  12499501.996624  12499489.483402  12499284.438758  12498621.464544   

                 v5               v6  
v1  12499221.653703  12499501.996624  
v2  12499212.106789  12499489.483402  
v3  12499017.962439  12499284.438758  
v4  12498372.752814  12498621.464544  
v5  12498757.668284  12498976.866422  
v6  12498976.866422  12499237.969240



In [30]:

    
latent, coeff = linalg.eig(stddf.cov())



In [35]:

    
DataFrame(coeff)

For sparse matrices:



In [6]:

    
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rand
from datetime import datetime, timedelta
import dateutil
from scipy.stats import pareto
import scipy.stats
import random as rnd



In [29]:

    
vals = np.array([1,2,3],dtype=float64)
rowi = np.array([1,5,10])
coli = np.array([5,13,100])
sparseM = scipy.sparse.coo_matrix((vals,(rowi,coli)),shape=(max(rowi)+1,max(coli)+1))



In [40]:

    
sparseM.shape









    Out[40]:





(11, 101)



In [31]:

    
(U,d,V) = scipy.sparse.linalg.svds(sparseM) # use parameter 'k=' to limit number of computed singular vectors



In [39]:

    
U.shape









    Out[39]:





(11, 6)



In [35]:

    
d.shape









    Out[35]:





(6,)



In [34]:

    
V.shape









    Out[34]:





(6, 101)



In [ ]:

    
sparseM == U.dot(diag(d)).dot(V.transpose())

Arrays & Matrices:



In [195]:

    
# false tolerant value retieval:
dict = {'a':1,'b':2}
dict.get('t',-1)  # return -1 if key is not in the dictionary









    Out[195]:





-1



In [2]:

    
x = np.array(range(3))
print(x.dot(x))
x * x   # elementwise *









    



5






    Out[2]:





array([0, 1, 4])



In [8]:

    
x = np.array([0, 1])
y = np.array([2, 3])
x * y



In [2]:

    
e = np.identity(3)
e









    Out[2]:





array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])



In [3]:

    
m = np.arange(6).reshape(3,2)
m









    Out[3]:





array([[0, 1],
       [2, 3],
       [4, 5]])



In [133]:

    
m[0,:]









    Out[133]:





array([0, 1])



In [128]:

    
m[:,:1]









    Out[128]:





array([[0],
       [2],
       [4]])



In [7]:

    
# elementwise multiplication
np.multiply(e,np.arange(9).reshape((3,3))) # == e * np.arange(9).reshape((3,3))









    Out[7]:





array([[ 0.,  0.,  0.],
       [ 0.,  4.,  0.],
       [ 0.,  0.,  8.]])



In [125]:

    
# real matrix multiplication
np.dot(e,m)









    Out[125]:





array([[ 0.,  1.],
       [ 2.,  3.],
       [ 4.,  5.]])



In [130]:

    
E = np.matrix(e)
M = np.matrix(m)
E*M









    Out[130]:





matrix([[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.]])



In [132]:

    
M[0,:]









    Out[132]:





matrix([[0, 1]])



In [131]:

    
M[:,0]









    Out[131]:





matrix([[0],
        [2],
        [4]])

Use sparse Matrices for big tables:



In [ ]:

    
sparseM = scipy.sparse.coo_matrix((values),(row_indices,column_indices)),shape=(max(row_indices)+1,max(column_indices)+1))

Broadcasting:



In [10]:

    
arr_1d = np.random.normal(size=3)
arr_1d









    Out[10]:





array([-1.58033814, -0.29357562,  0.45257015])



In [11]:

    
arr_1d[:, np.newaxis]









    Out[11]:





array([[-1.58033814],
       [-0.29357562],
       [ 0.45257015]])



In [9]:

    
arr_1d[np.newaxis, :]









    Out[9]:





array([[-1.71826784, -0.81097857, -0.0277865 ]])



In [12]:

    
arr = randn(3, 4, 5)
arr.shape









    Out[12]:





(3L, 4L, 5L)



In [17]:

    
depth_means = arr.mean(2)
print(depth_means.shape)
print(depth_means[:, :, np.newaxis].shape)









    



(3L, 4L)
(3L, 4L, 1L)



In [15]:

    
demean = arr - depth_means[:, :, np.newaxis]
demean.mean(2)









    Out[15]:





array([[  0.00000000e+00,  -6.66133815e-17,  -2.22044605e-17,
          0.00000000e+00],
       [  1.11022302e-17,   0.00000000e+00,   0.00000000e+00,
         -6.66133815e-17],
       [  2.22044605e-17,  -4.44089210e-17,   0.00000000e+00,
          4.44089210e-17]])



In [22]:

    
arr = np.zeros((4, 3))
arr[:2] = [[-1.37], [0.509]]
arr









    Out[22]:





array([[-1.37 , -1.37 , -1.37 ],
       [ 0.509,  0.509,  0.509],
       [ 0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ]])



In [24]:

    
arr = np.zeros((4, 3))
arr[:,:2] = [-1.37, 0.509]
arr









    Out[24]:





array([[-1.37 ,  0.509,  0.   ],
       [-1.37 ,  0.509,  0.   ],
       [-1.37 ,  0.509,  0.   ],
       [-1.37 ,  0.509,  0.   ]])

Series & DataFrames



In [6]:

    
ser = Series([1,4,"three",5.])
print(ser.values)









    



Int64Index([0, 1, 2, 3], dtype=int64)
[1 4 'three' 5.0]



In [9]:

    
ser  = Series([1,4,"three",5.], index=['dim1','dim2','name','val'])
print(ser.index)









    



Index([dim1, dim2, name, val], dtype=object)



In [18]:

    
print(ser['val'])
print(ser[(ser == 5)|(ser == 1)])









    



5.0
dim1    1
val     5
dtype: object



In [19]:

    
ser2 = Series({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000})
print(ser2)









    



Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64



In [21]:

    
ser3 = Series({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}, index = ['Texas','Ohio','Berlin','Oregon','Utah'])
print(ser3)









    



Texas     71000
Ohio      35000
Berlin      NaN
Oregon    16000
Utah       5000
dtype: float64



In [22]:

    
ser3.index = ['T','O','B','OR','U']
print(ser3)









    



T     71000
O     35000
B       NaN
OR    16000
U      5000
dtype: float64



In [80]:

    
ser4 = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(ser4)
print(ser4.reindex(['a','aa','b','c','d','e'],fill_value=99))









    



d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
a     -5.3
aa    99.0
b      7.2
c      3.6
d      4.5
e     99.0
dtype: float64



In [81]:

    
ser5 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
ser5.reindex(range(6), method='ffill')









    Out[81]:





0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object



In [75]:

    
ser4









    Out[75]:





d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64



In [45]:

    
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],'popp': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = DataFrame(data, columns=['year', 'state', 'popp', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
df









    Out[45]:






  
    
      
      year
      state
      popp
      debt
    
  
  
    
      one
       2000
         Ohio
       1.5
       NaN
    
    
      two
       2001
         Ohio
       1.7
       NaN
    
    
      three
       2002
         Ohio
       3.6
       NaN
    
    
      four
       2001
       Nevada
       2.4
       NaN
    
    
      five
       2002
       Nevada
       2.9
       NaN



In [49]:

    
df.ix['four'] # == df.ix[3]









    Out[49]:





year       2001
state    Nevada
popp        2.4
debt        NaN
Name: four, dtype: object



In [71]:

    
print(df.ix['one','state'])
df.ix[['one','two'],['state','popp']]









    



Ohio






    Out[71]:






  
    
      
      state
      popp
    
  
  
    
      one
       Ohio
       1.5
    
    
      two
       Ohio
       1.7



In [48]:

    
df['popp'] == df.popp









    Out[48]:





one      True
two      True
three    True
four     True
five     True
Name: popp, dtype: bool



In [51]:

    
df['newcol'] = np.arange(5)
df



In [53]:

    
del df['debt']
df



In [85]:

    
df.T.drop(['year','newcol'])









    Out[85]:






  
    
      
      one
      two
      three
      four
      five
    
  
  
    
      state
       Ohio
       Ohio
       Ohio
       Nevada
       Nevada
    
    
      popp
        1.5
        1.7
        3.6
          2.4
          2.9



In [83]:

    
df.T.drop('one',axis=1)



In [56]:

    
df.values









    Out[56]:





array([[2000, 'Ohio', 1.5, 0],
       [2001, 'Ohio', 1.7, 1],
       [2002, 'Ohio', 3.6, 2],
       [2001, 'Nevada', 2.4, 3],
       [2002, 'Nevada', 2.9, 4]], dtype=object)



In [67]:

    
2.4 in df.values









    Out[67]:





True



In [91]:

    
df.T.ix['year':'popp'] # last element ie 'popp' is included



In [96]:

    
df[df.popp > 2]



In [100]:

    
df[df.popp>2].newcol.sum()









    Out[100]:





9



In [5]:

    
dn = DataFrame(np.arange(12).reshape(4,3),columns=['col2','col1','col3'],index=['1','2','4','3'])
dn



In [6]:

    
func_on_series = lambda x:x.max()
dn.apply(func_on_series)









    Out[6]:





col2     9
col1    10
col3    11
dtype: int64



In [7]:

    
dn.apply(func_on_series,axis=1)









    Out[7]:





1     2
2     5
4     8
3    11
dtype: int64



In [8]:

    
dn.applymap(lambda x:x+1) # == dn + 1



In [9]:

    
dn.sort('col1',ascending=0)



In [10]:

    
dn.sort(['col1','col2'],ascending=[0,1])



In [121]:

    
dn.sort_index(ascending=False).sort_index(axis=1)



In [124]:

    
dn.col1.order()









    Out[124]:





1     1
2     4
4     7
3    10
Name: col1, dtype: int64



In [133]:

    
dn.col2 = [34,34,3,99]
dn



In [134]:

    
dn.sort_index(by=['col2','col1'],ascending=False)



In [142]:

    
dn2 = dn.reindex([str(x) for x in [1,2,4,3,4]])
dn2



In [144]:

    
dn2.ix['4']



In [21]:

    
dn



In [22]:

    
dn.idxmax()









    Out[22]:





col2    3
col1    3
col3    3
dtype: object



In [153]:

    
dn2.count()









    Out[153]:





col2    5
col1    5
col3    5
dtype: int64



In [23]:

    
Series([1,2,np.NAN,2]).count()









    Out[23]:





3



In [24]:

    
dn[dn==4] = np.nan
dn



In [25]:

    
dn.col1.isnull()









    Out[25]:





1    False
2     True
4    False
3    False
Name: col1, dtype: bool



In [28]:

    
np.where(dn.col1.isnull(),-100,dn.col1)









    Out[28]:





1      1
2   -100
4      7
3     10
Name: col1, dtype: float64



In [162]:

    
type(12.8)









    Out[162]:





float



In [15]:

    
data = DataFrame([[1., 6.5, 3.], [1., np.NAN, np.NAN],[np.NAN, np.NAN, np.NAN], [np.NAN, 6.5, 3.]])
data



In [17]:

    
data.fillna(method='bfill') # oder ffill



In [10]:

    
data.dropna(how='all')



In [14]:

    
data.fillna(0, inplace=True) # inplace=True modifiziert die Originaldaten
data



In [141]:

    
data = DataFrame({'vals':rand.normal(0,1,size=10),'weights':rand.randint(2,size=10)})
data['results']=data.vals*data.weights
data



In [142]:

    
data.vals.corr(data.results)









    Out[142]:





0.8647573352382677

In/Out



In [ ]:

    
pd.read_csv('some.csv', names=['a', 'b', 'c', 'd', 'message'], header=None) # otherwise first line will define column names
pd.read_table('some.txt', sep='\t', index_col='message' ) # column messages will become index of table
pd.read_csv('some.csv', na_values = {'message':'xxx','a':'NULL'}, converters={'message':lambda x:x[0:1]}) # for each column it can be defined what entry represents a NaN and how it should be parsed by a custom function



In [ ]:

    
data.to_csv(sys.stdout, index=False, header=False, sep='|') # write



In [ ]:

    
# JSON: read in lists of dicts to get rows where keys become columns
tweets = DataFrame(list_of_dicts, columns=list_of_dicts[0].keys())

data.save('binary.data')      # pandas only format
data = pd.load('binary.data')

store = pd.HDFStore('mydata.h5') # common scientific format HDF5 good for: write once, read many



In [ ]:

    
from pandas.io.parsers import TextParser
from lxml.html import parse from urllib2 import urlopen

parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options'))
doc = parsed.getroot()

# create DataFrame from table tag in html file:
def parse_options_data( doc ):
    tables = doc.findall('.//table') # select table
    table = tables[0]
    rows = table.findall('.//tr') # select rows of table
    header = _unpack(rows[0], kind='th') # extract header row
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data, names=header).get_chunk() # create DataFrame



In [ ]:

    
import pandas.io.sql as sql
import sqlite3
con = sqlite3.connect(':myDB:')
sql.read_frame('select * from test', con)

Combining, Merging and Rearranging of Data



In [ ]:

    
#    a.combine_first(b)
#    a's values prioritized, use values from b to fill holes

# data1.update(data2, overwrite=False)



In [ ]:

    
pd.merge(df1, df2, on='key', how='left') # use a list to join with multiple keys, 'inner' is default, 'right', 'outer' are the other option of the how parameter
df1.join(df2, on='key') # same but in a neat way
df1.join([df2,df3]) # join three DataFrames

pd.merge(left1, right1, left_on='key', right_index=True) # joins tables by using column 'key' of left table to match with index of right table 
left1.join(right1) # same but in a neat way



In [7]:

    
arr = np.arange(12).reshape((3, 4))
print(arr)
print(np.concatenate([arr,arr],axis=1))
print(np.concatenate([arr,arr]))









    



[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]



In [39]:

    
s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4, 5], index=['c', 'd', 'e', 'b'])

def psdf(data):
    print("/////////////////\n%s" % data)
    
psdf(pd.concat([s2, s1]))
psdf(pd.concat([s1, s2],axis=1))
psdf(pd.concat([s1, s2],axis=1,join='inner'))









    



/////////////////
c    2
d    3
e    4
b    5
a    0
b    1
dtype: int64
/////////////////
    0   1
a   0 NaN
b   1   5
c NaN   2
d NaN   3
e NaN   4
/////////////////
   0  1
b  1  5



In [31]:

    
# rotating between row and column indices
data = DataFrame(np.arange(6).reshape((2, 3)), index=pd.Index(['Ohio', 'Colorado'], name='state'), columns=pd.Index(['one', 'two', 'three'], name='number'))
psdf(data)
psdf(data.stack())
psdf(data.stack().unstack())









    



/////////////////
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
/////////////////
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64
/////////////////
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5



In [56]:

    
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})
psdf(data)
psdf(data.drop_duplicates())









    



/////////////////
    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4
/////////////////
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4



In [60]:

    
data['k3'] = range(7)
psdf(data)
psdf(data.replace([1,2],[111,222]))
psdf(data.replace({1:111,2:222})) # should do the same but is not working









    



/////////////////
    k1  k2  k3
0  one   1   0
1  one   1   1
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6
/////////////////
    k1   k2   k3
0  one  111    0
1  one  111  111
2  one  222  222
3  two    3    3
4  two    3    4
5  two    4    5
6  two    4    6
/////////////////
    k1  k2  k3
0  one   1   0
1  one   1   1
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6

Histograms Logic:



In [47]:

    
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats









    Out[47]:





Categorical: 
array(['(18, 25]', '(18, 25]', '(18, 25]', '(25, 35]', '(18, 25]',
       '(18, 25]', '(35, 60]', '(25, 35]', '(60, 100]', '(35, 60]',
       '(35, 60]', '(25, 35]'], dtype=object)
Levels (4): Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype=object)



In [50]:

    
cats.labels









    Out[50]:





array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1])



In [51]:

    
cats.levels









    Out[51]:





Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)



In [53]:

    
pd.value_counts(cats)









    Out[53]:





(19.959, 30.25]    6
(30.25, 40.5]      3
(40.5, 50.75]      2
(50.75, 61]        1
dtype: int64



In [54]:

    
cats = pd.cut(ages,4)  # create equally sized bins
print(cats.levels)
pd.value_counts(cats)









    



Index([(19.959, 30.25], (30.25, 40.5], (40.5, 50.75], (50.75, 61]], dtype=object)






    Out[54]:





(19.959, 30.25]    6
(30.25, 40.5]      3
(40.5, 50.75]      2
(50.75, 61]        1
dtype: int64



In [55]:

    
cats = pd.qcut(ages,4)  # create bins with equilly sized number of elements
print(cats.levels)
pd.value_counts(cats)









    



Index([[20, 22.75], (22.75, 29], (29, 38], (38, 61]], dtype=object)






    Out[55]:





(29, 38]       3
[20, 22.75]    3
(38, 61]       3
(22.75, 29]    3
dtype: int64

String Manipulations



In [117]:

    
import re
data = Series({'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan})
data









    Out[117]:





Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object



In [82]:

    
data.str[:2]









    Out[82]:





Dave      da
Rob       ro
Steve     st
Wes      NaN
dtype: object



In [83]:

    
data.str.join(sep=".")









    Out[83]:





Dave     d.a.v.e.@.g.o.o.g.l.e...c.o.m
Rob          r.o.b.@.g.m.a.i.l...c.o.m
Steve    s.t.e.v.e.@.g.m.a.i.l...c.o.m
Wes                                NaN
dtype: object



In [77]:

    
data.c1.str.replace("c",'999')









    Out[77]:





0    ab999
1    999de
2      fgh
Name: c1, dtype: object



In [119]:

    
matches = data.str.match('([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})', flags=re.IGNORECASE)
matches.str[1]









    Out[119]:





Dave     google
Rob       gmail
Steve     gmail
Wes         NaN
dtype: object

Plotting



In [16]:

    
randomwalk = rand.randn(40)
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax1.plot(randomwalk.cumsum(),'o-k')









    Out[16]:





[<matplotlib.lines.Line2D at 0x108b30d90>]



In [17]:

    
comp1 = np.random.normal(0, 1, size=200) # N(0, 1)
comp2 = np.random.normal(10, 2, size=200) # N(10, 4)
values = Series(np.concatenate([comp1, comp2]))
values.hist(bins=100, alpha=0.3, color='k', normed=True)
values.plot(kind='kde', style='k--')









    Out[17]:





<matplotlib.axes.AxesSubplot at 0x108b3d8d0>



In [18]:

    
# making some histogram:
s = Series(np.random.randn(100000))
hist = np.histogram(s,bins=100)
plt.bar(hist[1][:-1],hist[0],width=.1)









    Out[18]:





<Container object of 100 artists>



In [154]:

    
tmpdata = DataFrame(np.random.normal(size=(100,3)))
tmpdata['cat'] = 'A'
data = DataFrame(np.random.normal(10,2,size=(150,3)))
data['cat']='B'
tmpdata2 = DataFrame(np.random.normal(50,2,size=(150,3)))
tmpdata2['cat']='C'
data = pd.concat([tmpdata,data,tmpdata2])
del tmpdata
data[:4]



In [19]:

    
hlp.plot_categorical_data(data,[0,1])









    












    Out[19]:





(<matplotlib.figure.Figure at 0x108b46a90>,
 <matplotlib.axes.AxesSubplot at 0x108e0d910>)



In [20]:

    
fig, ax = plt.subplots()
ax.scatter(data[data.cat=='A'][0],data[data.cat=='A'][1],marker='.',facecolor='b')
ax.scatter(data[data.cat=='B'][0],data[data.cat=='B'][1],marker='x',facecolor='r')
plt.show()



In [21]:

    
pd.scatter_matrix(data)









    Out[21]:





array([[<matplotlib.axes.AxesSubplot object at 0x108eefe90>,
        <matplotlib.axes.AxesSubplot object at 0x109876790>,
        <matplotlib.axes.AxesSubplot object at 0x109895d50>],
       [<matplotlib.axes.AxesSubplot object at 0x109881810>,
        <matplotlib.axes.AxesSubplot object at 0x109a31390>,
        <matplotlib.axes.AxesSubplot object at 0x109a516d0>],
       [<matplotlib.axes.AxesSubplot object at 0x109a39bd0>,
        <matplotlib.axes.AxesSubplot object at 0x109a88c90>,
        <matplotlib.axes.AxesSubplot object at 0x109aa8fd0>]], dtype=object)

Data Grouping and Aggregation



In [18]:

    
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a','a'],'key2' : ['one', 'two', 'one', 'two', 'one','three'],'data1' : np.random.randn(6),'data2' : np.random.randn(6)})
df



In [35]:

    
df.groupby('key1').min()



In [22]:

    
for name,group in df.groupby('key2'):
    print("Group %s:" % name)
    print(group)









    



Group one:
      data1     data2 key1 key2
0  0.739360  0.621588    a  one
2  1.403078  1.752346    b  one
4  0.277675  0.737866    a  one
Group three:
      data1    data2 key1   key2
5  0.424053 -0.54142    a  three
Group two:
      data1     data2 key1 key2
1 -0.106926  0.235336    a  two
3 -0.444629 -0.142198    b  two



In [21]:

    
pieces = dict(list(df.groupby('key1')))
pieces['a']



In [37]:

    
df.groupby('key2')['data1'].count() # == ... .size()









    Out[37]:





key2
one      3
three    1
two      2
dtype: int64



In [29]:

    
people = DataFrame(np.random.randn(5, 5),columns=['a', 'b', 'c', 'd', 'e'],index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people



In [30]:

    
# Grouping by groups of keys:
people.groupby({'a':'aa','b':'bb','c':'aa','d':'aa','e':'bb'},axis=1).sum()



In [40]:

    
# Grouping by function:
people.groupby(lambda idx:idx[0]).first()

Aggregation functions convert an array to a scalar.



In [34]:

    
# call multiple and customized aggregation functions:
df.groupby('key2').agg([('Mittel','mean'),'sum',('User function',(lambda arr: min(arr)))])









    Out[34]:






  
    
      
      data1
      data2
    
    
      
      Mittel
      sum
      User function
      Mittel
      sum
      User function
    
    
      key2
      
      
      
      
      
      
    
  
  
    
      one
       0.806705
       2.420114
       0.277675
       1.037267
       3.111800
       0.621588
    
    
      three
       0.424053
       0.424053
       0.424053
      -0.541420
      -0.541420
      -0.541420
    
    
      two
      -0.275777
      -0.551555
      -0.444629
       0.046569
       0.093138
      -0.142198



In [44]:

    
# call different aggregation functions on different columns:
df.groupby('key2').agg({'data1':[min,max],'data2':sum})

Transformation functions convert an array to an array of equal size.



In [53]:

    
df.groupby('key1').transform(abs)



In [54]:

    
def demean(arr):
    return arr - arr.mean()

df.groupby('key1').transform(demean)



In [55]:

    
df.groupby('key1').transform(demean).mean()









    Out[55]:





data1    1.850372e-17
data2    1.850372e-17
dtype: float64

Apply function can return panda objects that can be concatenated



In [56]:

    
df.groupby('key1').apply(lambda dataframe:dataframe.describe())



In [58]:

    
frame = DataFrame({'data1': np.random.randn(1000),'data2': np.random.randn(1000)})
factor = pd.cut(frame.data1, 4)
factor[:10]









    Out[58]:





Categorical: 
array(['(1.349, 2.816]', '(-1.583, -0.117]', '(-0.117, 1.349]',
       '(-1.583, -0.117]', '(-1.583, -0.117]', '(-1.583, -0.117]',
       '(-0.117, 1.349]', '(-0.117, 1.349]', '(-0.117, 1.349]',
       '(-1.583, -0.117]'], dtype=object)
Levels (4): Index(['(-3.0546, -1.583]', '(-1.583, -0.117]',
                   '(-0.117, 1.349]', '(1.349, 2.816]'], dtype=object)



In [67]:

    
grouped = frame.data2.groupby(factor)
grouped.min()









    Out[67]:





data1
(-3.0546, -1.583]   -2.788622
(-1.583, -0.117]    -3.383861
(-0.117, 1.349]     -2.574235
(1.349, 2.816]      -2.831513
dtype: float64



In [80]:

    
def get_stats(group,some_parameter):
    return {'min': group.min() + some_parameter, 'max': group.max(),'count': group.count(), 'mean': group.mean()}

grouped.apply(get_stats, some_parameter=100).unstack()









    Out[80]:






  
    
      
      count
      max
      mean
      min
    
    
      data1
      
      
      
      
    
  
  
    
      (-3.0546, -1.583]
        55
       1.771513
      -0.023555
       97.211378
    
    
      (-1.583, -0.117]
       397
       2.776994
       0.009898
       96.616139
    
    
      (-0.117, 1.349]
       451
       2.804875
      -0.046946
       97.425765
    
    
      (1.349, 2.816]
        97
       2.460343
      -0.162391
       97.168487

Pivot and Cross Tables



In [184]:

    
tips = pd.read_csv('ch08/tips.csv')
tips[:10]



In [163]:

    
tips.pivot_table(rows=['sex', 'smoker']) # == tips.groupby(['sex','smoker']).mean()



In [177]:

    
tips.pivot_table(['tip','total_bill'],rows=['day','smoker'],cols='sex')



In [183]:

    
tips.pivot_table(['tip','total_bill'],rows=['day','smoker'],cols='sex',aggfunc=sum,margins=True)



In [193]:

    
# counting how often does a certain pair of values occures:
ser1 = Series(rand.randint(0,2,10))
ser2 = Series(rand.randint(10,12,10))
ser3 = Series(rand.randint(20,22,10))
dt = DataFrame([ser1,ser2,ser3])
print(dt)
pd.crosstab(ser1,ser2,margins=True)









    



    0   1   2   3   4   5   6   7   8   9
0   1   0   0   0   1   1   1   0   1   0
1  10  11  10  11  10  10  10  10  10  11
2  20  20  21  20  20  21  21  21  21  21






    Out[193]:






  
    
      col_0
      10
      11
      All
    
    
      row_0
      
      
      
    
  
  
    
      0
       2
       3
        5
    
    
      1
       5
       0
        5
    
    
      All
       7
       3
       10



In [194]:

    
pd.crosstab([ser1,ser2],ser3,margins=True)

Time Series



In [42]:

    
now = datetime.now()
now.year, now.month, now.day, now.hour, now.minute









    Out[42]:





(2013, 8, 30, 11, 45)



In [51]:

    
delta = now-datetime.now()
delta.days









    Out[51]:





-1



In [53]:

    
now.strftime('%d.%m.%Y')









    Out[53]:





'30.08.2013'



In [56]:

    
datetime.strptime("2012-12-15", '%Y-%m-%d')









    Out[56]:





datetime.datetime(2012, 12, 15, 0, 0)



In [67]:

    
dateutil.parser.parse("07:56:59 / 12. Sep 1976", dayfirst=True)









    Out[67]:





datetime.datetime(1976, 9, 12, 7, 56, 59)



In [72]:

    
# parse arrays of date strings with pandas:
pd.to_datetime(['7/6/2011',None])









    Out[72]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2011-07-06 00:00:00, NaT]
Length: 2, Freq: None, Timezone: None



In [103]:

    
ts = Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts[:5]









    Out[103]:





2000-01-01   -1.977406
2000-01-02   -1.265072
2000-01-03   -0.156686
2000-01-04    0.429408
2000-01-05   -0.104964
Freq: D, dtype: float64



In [104]:

    
ts['2000-02']









    Out[104]:





2000-02-01    0.322620
2000-02-02    0.565578
2000-02-03    0.715117
2000-02-04   -0.666173
2000-02-05   -0.600000
2000-02-06   -0.987512
2000-02-07    2.849058
2000-02-08    0.025071
2000-02-09    0.099177
2000-02-10    1.119427
2000-02-11   -1.600575
2000-02-12    0.606665
2000-02-13   -0.344078
2000-02-14   -0.080394
2000-02-15   -0.445799
2000-02-16   -1.199052
2000-02-17   -0.408375
2000-02-18   -0.388830
2000-02-19   -0.339986
2000-02-20    1.979030
2000-02-21    0.742477
2000-02-22   -0.474137
2000-02-23   -0.000435
2000-02-24   -0.431915
2000-02-25    0.807162
2000-02-26   -0.719529
2000-02-27   -1.702265
2000-02-28   -0.412564
2000-02-29    1.988327
Freq: D, dtype: float64



In [105]:

    
ts['2000-01-05':'2000-01-08']









    Out[105]:





2000-01-05   -0.104964
2000-01-06    0.242240
2000-01-07    0.007631
2000-01-08   -2.602129
Freq: D, dtype: float64



In [ ]:

    
# use ts.between_time to select between hours of a day



In [84]:

    
pd.date_range('1/1/2000',periods=2,freq="D")









    Out[84]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, 2000-01-02 00:00:00]
Length: 2, Freq: D, Timezone: None



In [91]:

    
pd.date_range('1/1/2000',periods=3,freq="WOM-3FRI") # every Friday of the 3rd week of a month ie lots of options!









    Out[91]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-21 00:00:00, ..., 2000-03-17 00:00:00]
Length: 3, Freq: WOM-3FRI, Timezone: None



In [90]:

    
ts[:'2000-01-2'].resample('H')









    Out[90]:





2000-01-01 00:00:00   -0.174162
2000-01-01 01:00:00         NaN
2000-01-01 02:00:00         NaN
2000-01-01 03:00:00         NaN
2000-01-01 04:00:00         NaN
2000-01-01 05:00:00         NaN
2000-01-01 06:00:00         NaN
2000-01-01 07:00:00         NaN
2000-01-01 08:00:00         NaN
2000-01-01 09:00:00         NaN
2000-01-01 10:00:00         NaN
2000-01-01 11:00:00         NaN
2000-01-01 12:00:00         NaN
2000-01-01 13:00:00         NaN
2000-01-01 14:00:00         NaN
2000-01-01 15:00:00         NaN
2000-01-01 16:00:00         NaN
2000-01-01 17:00:00         NaN
2000-01-01 18:00:00         NaN
2000-01-01 19:00:00         NaN
2000-01-01 20:00:00         NaN
2000-01-01 21:00:00         NaN
2000-01-01 22:00:00         NaN
2000-01-01 23:00:00         NaN
2000-01-02 00:00:00   -0.024849
Freq: H, dtype: float64



In [92]:

    
# creating info about (percental) change of data points use something like: ts / ts.shift(1) - 1



In [98]:

    
import pytz
ts = pd.date_range('3/9/2012 9:30', periods=3, freq='D').tz_localize('CET') # ==  pd.date_range('3/9/2012 9:30', periods=3, freq='D', tz='CET')
ts









    Out[98]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-09 09:30:00, ..., 2012-03-11 09:30:00]
Length: 3, Freq: D, Timezone: CET



In [100]:

    
ts.tz_convert('US/Eastern')









    Out[100]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-09 03:30:00, ..., 2012-03-11 04:30:00]
Length: 3, Freq: D, Timezone: US/Eastern



In [101]:

    
ts.tz_convert('Asia/Shanghai')









    Out[101]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-09 16:30:00, ..., 2012-03-11 16:30:00]
Length: 3, Freq: D, Timezone: Asia/Shanghai



In [102]:

    
# periods are time spans:
p = pd.Period(2007, freq='A-DEC') # ie. (1.1.2007 - 31.12.2007)
p + 1 # ie. (1.1.2008 - 31.12.2008)









    Out[102]:





Period('2008', 'A-DEC')

Financial Data Analysis



In [22]:

    
import pandas.io.data as web
data = web.get_data_yahoo('GOOG', '2013-01-01')



In [23]:

    
data[-10:]



In [ ]:

	v1	v2	v3	v4	v5	v6
0	0.856301	9.804413	12.568904	-3.444391	8.395533	23.229618
1	0.168428	10.670377	10.505283	5.160607	1.797187	21.344088
2	-0.098325	8.586418	9.705026	7.626106	-0.844257	18.193119

	v1	v2	v3	v4
v1	0.249482	-0.001939	0.748447	-0.001653
v2	-0.001939	0.569149	-0.005817	-0.053476
v3	0.748447	-0.005817	2.245341	-0.004959
v4	-0.001653	-0.053476	-0.004959	33.431387

	v1	v2	v3	v4	v5	v6
v1	0.247966	0.004088	0.743897	0.019842	2.481793	0.995951
v2	0.004088	0.561573	0.012265	0.038739	0.039789	0.577927
v3	0.743897	0.012265	2.231691	0.059527	7.445380	2.987853
v4	0.019842	0.038739	0.059527	33.366438	0.245949	0.118108
v5	2.481793	0.039789	7.445380	0.245949	24.979068	9.966963
v6	0.995951	0.577927	2.987853	0.118108	9.966963	4.561731

	0	1	2	3	4	5
0	0.408259	0.427506	-0.097629	-0.297606	0.717532	-0.193934
1	0.408258	0.368728	-0.099061	-0.487072	-0.670932	-0.012789
2	0.408252	0.153725	-0.178969	0.736024	-0.166302	-0.456380
3	0.408230	-0.215985	0.885130	-0.001184	0.000964	-0.056935
4	0.408242	-0.780292	-0.387778	-0.231219	0.047279	-0.135699
5	0.408250	0.046274	-0.121654	0.281065	0.071459	0.855738

	vals	weights	results
0	-0.333944	0	-0.000000
1	-0.087153	0	-0.000000
2	-0.817727	0	-0.000000
3	0.525605	1	0.525605
4	-0.804634	0	-0.000000
5	2.216917	1	2.216917
6	-1.486997	1	-1.486997
7	-1.054359	0	-0.000000
8	0.768125	0	0.000000
9	-0.724449	0	-0.000000

	v1	v2	v3	v4	v5	v6
v1	1.000000	0.010956	1.000000	0.006898	0.997199	0.936434
v2	0.010956	1.000000	0.010956	0.008949	0.010624	0.361081
v3	1.000000	0.010956	1.000000	0.006898	0.997199	0.936434
v4	0.006898	0.008949	0.006898	1.000000	0.008519	0.009573
v5	0.997199	0.010624	0.997199	0.008519	1.000000	0.933705
v6	0.936434	0.361081	0.936434	0.009573	0.933705	1.000000

	year	state	popp	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN

	two	three	four	five
year	2001	2002	2001	2002
state	Ohio	Ohio	Nevada	Nevada
popp	1.7	3.6	2.4	2.9
newcol	1	2	3	4

	0	1	2	cat
0	-0.281100	0.792052	1.172131	A
1	-0.561721	-1.543955	-0.663505	A
2	-0.864066	-0.940743	0.084267	A
3	0.490830	-1.012574	-0.456108	A

	data1	data2	key1	key2
0	0.739360	0.621588	a	one
1	-0.106926	0.235336	a	two
2	1.403078	1.752346	b	one
3	-0.444629	-0.142198	b	two
4	0.277675	0.737866	a	one
5	0.424053	-0.541420	a	three

	a	b	c	d	e
Joe	0.708303	0.340645	-0.351854	-1.377107	0.019698
Steve	1.367789	-0.205284	1.639879	0.262293	-1.069834
Wes	1.302231	1.047179	-0.317145	-2.083820	0.058323
Jim	1.095670	-0.898387	-0.656303	-0.885540	2.002418
Travis	-0.025436	-0.371457	0.611380	0.578107	-0.124972

	aa	bb
Joe	-1.020658	0.360343
Steve	3.269961	-1.275117
Wes	-1.098734	1.105502
Jim	-0.446173	1.104031
Travis	1.164051	-0.496429

	data1			data2
	Mittel	sum	User function	Mittel	sum	User function
key2
one	0.806705	2.420114	0.277675	1.037267	3.111800	0.621588
three	0.424053	0.424053	0.424053	-0.541420	-0.541420	-0.541420
two	-0.275777	-0.551555	-0.444629	0.046569	0.093138	-0.142198

	data1	data2
0	0.739360	0.621588
1	0.106926	0.235336
2	1.403078	1.752346
3	0.444629	0.142198
4	0.277675	0.737866
5	0.424053	0.541420

	data1	data2
0	0.405820	0.358246
1	-0.440466	-0.028006
2	0.923854	0.947272
3	-0.923854	-0.947272
4	-0.055866	0.474524
5	0.090512	-0.804763

		data1	data2
key1
a	count	4.000000	4.000000
	mean	0.333541	0.263343
	std	0.351195	0.577910
	min	-0.106926	-0.541420
	25%	0.181525	0.041147
	50%	0.350864	0.428462
	75%	0.502880	0.650658
	max	0.739360	0.737866
b	count	2.000000	2.000000
	mean	0.479224	0.805074
	std	1.306526	1.339645
	min	-0.444629	-0.142198
	25%	0.017298	0.331438
	50%	0.479224	0.805074
	75%	0.941151	1.278710
	max	1.403078	1.752346

	count	max	mean	min
data1
(-3.0546, -1.583]	55	1.771513	-0.023555	97.211378
(-1.583, -0.117]	397	2.776994	0.009898	96.616139
(-0.117, 1.349]	451	2.804875	-0.046946	97.425765
(1.349, 2.816]	97	2.460343	-0.162391	97.168487

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
5	25.29	4.71	Male	No	Sun	Dinner	4
6	8.77	2.00	Male	No	Sun	Dinner	2
7	26.88	3.12	Male	No	Sun	Dinner	4
8	15.04	1.96	Male	No	Sun	Dinner	2
9	14.78	3.23	Male	No	Sun	Dinner	2

		size	tip	total_bill
sex	smoker
Female	No	2.592593	2.773519	18.105185
Female	Yes	2.242424	2.931515	17.977879
Male	No	2.711340	3.113402	19.791237
Male	Yes	2.500000	3.051167	22.284500

	Open	High	Low	Close	Volume	Adj Close
Date
2013-10-25	1028.82	1028.82	1010.74	1015.20	2030500	1015.20
2013-10-28	1015.20	1023.43	1012.99	1015.00	1158700	1015.00
2013-10-29	1019.10	1036.94	1013.50	1036.24	1605000	1036.24
2013-10-30	1037.43	1037.51	1026.00	1030.42	1324100	1030.42
2013-10-31	1028.93	1041.52	1023.97	1030.58	1616400	1030.58
2013-11-01	1031.79	1036.00	1025.10	1027.04	1283300	1027.04
2013-11-04	1031.50	1032.37	1022.03	1026.11	1138800	1026.11
2013-11-05	1020.35	1031.65	1017.42	1021.52	1181400	1021.52
2013-11-06	1025.60	1027.00	1015.37	1022.75	912900	1022.75
2013-11-07	1022.61	1023.93	1007.64	1007.95	1679600	1007.95