version 0.2, May 2016
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Donne Martin and Wes McKinney's Python for Data Analysis book.
In [1]:
import pandas as pd
import numpy as np
Create a Series:
In [2]:
ser_1 = pd.Series([1, 1, 2, -3, -5, 8, 13])
ser_1
Out[2]:
Get the array representation of a Series:
In [3]:
ser_1.values
Out[3]:
Index objects are immutable and hold the axis labels and metadata such as names and axis names.
Get the index of the Series:
In [4]:
ser_1.index
Out[4]:
Create a Series with a custom index:
In [5]:
ser_2 = pd.Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2
Out[5]:
Get a value from a Series:
In [6]:
ser_2[4] == ser_2['e']
Out[6]:
Get a set of values from a Series by passing in a list:
In [7]:
ser_2[['c', 'a', 'b']]
Out[7]:
Get values great than 0:
In [8]:
ser_2[ser_2 > 0]
Out[8]:
Scalar multiply:
In [9]:
ser_2 * 2
Out[9]:
Apply a numpy math function:
In [10]:
import numpy as np
np.exp(ser_2)
Out[10]:
A Series is like a fixed-length, ordered dict.
Create a series by passing in a dict:
In [11]:
dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = pd.Series(dict_1)
ser_3
Out[11]:
Re-order a Series by passing in an index (indices not found are NaN):
In [12]:
index = ['foo', 'bar', 'baz', 'qux']
ser_4 = pd.Series(dict_1, index=index)
ser_4
Out[12]:
Check for NaN with the pandas method:
In [13]:
pd.isnull(ser_4)
Out[13]:
Check for NaN with the Series method:
In [14]:
ser_4.isnull()
Out[14]:
Series automatically aligns differently indexed data in arithmetic operations:
In [15]:
ser_3 + ser_4
Out[15]:
Name a Series:
In [16]:
ser_4.name = 'foobarbazqux'
Name a Series index:
In [17]:
ser_4.index.name = 'label'
In [18]:
ser_4
Out[18]:
Rename a Series' index in place:
In [19]:
ser_4.index = ['fo', 'br', 'bz', 'qx']
ser_4
Out[19]:
A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.
Create a DataFrame:
In [20]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
'year' : [2012, 2013, 2014, 2014, 2015],
'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1
Out[20]:
Create a DataFrame specifying a sequence of columns:
In [21]:
df_2 = pd.DataFrame(data_1, columns=['year', 'state', 'pop'])
df_2
Out[21]:
Like Series, columns that are not present in the data are NaN:
In [22]:
df_3 = pd.DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])
df_3
Out[22]:
Retrieve a column by key, returning a Series:
In [23]:
df_3['state']
Out[23]:
Retrive a column by attribute, returning a Series:
In [24]:
df_3.year
Out[24]:
Retrieve a row by position:
In [25]:
df_3.ix[0]
Out[25]:
Update a column by assignment:
In [26]:
df_3['unempl'] = np.arange(5)
df_3
Out[26]:
Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):
In [27]:
unempl = pd.Series([6.0, 6.0, 6.1], index=[2, 3, 4])
df_3['unempl'] = unempl
df_3
Out[27]:
Assign a new column that doesn't exist to create a new column:
In [28]:
df_3['state_dup'] = df_3['state']
df_3
Out[28]:
Delete a column:
In [29]:
del df_3['state_dup']
df_3
Out[29]:
Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):
In [30]:
pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = pd.DataFrame(pop)
df_4
Out[30]:
Transpose the DataFrame:
In [31]:
df_4.T
Out[31]:
Create a DataFrame from a dict of Series:
In [32]:
data_2 = {'VA' : df_4['VA'][1:],
'MD' : df_4['MD'][2:]}
df_5 = pd.DataFrame(data_2)
df_5
Out[32]:
Set the DataFrame index name:
In [33]:
df_5.index.name = 'year'
df_5
Out[33]:
Set the DataFrame columns name:
In [34]:
df_5.columns.name = 'state'
df_5
Out[34]:
Return the data contained in a DataFrame as a 2D ndarray:
In [35]:
df_5.values
Out[35]:
If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:
In [36]:
df_3.values
Out[36]:
Create a new object with the data conformed to a new index. Any missing values are set to NaN.
In [37]:
df_3
Out[37]:
Reindexing rows returns a new frame with the specified index:
In [38]:
df_3.reindex(list(reversed(range(0, 6))))
Out[38]:
Missing values can be set to something other than NaN:
In [39]:
df_3.reindex(range(6, 0), fill_value=0)
Out[39]:
Interpolate ordered data like a time series:
In [40]:
ser_5 = pd.Series(['foo', 'bar', 'baz'], index=[0, 2, 4])
In [41]:
ser_5.reindex(range(5), method='ffill')
Out[41]:
In [42]:
ser_5.reindex(range(5), method='bfill')
Out[42]:
Reindex columns:
In [43]:
df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])
Out[43]:
Reindex rows and columns while filling rows:
In [44]:
df_3.reindex(index=list(reversed(range(0, 6))),
fill_value=0,
columns=['state', 'pop', 'unempl', 'year'])
Out[44]:
Reindex using ix:
In [45]:
df_6 = df_3.ix[range(0, 7), ['state', 'pop', 'unempl', 'year']]
df_6
Out[45]:
Drop rows from a Series or DataFrame:
In [46]:
df_7 = df_6.drop([0, 1])
df_7
Out[46]:
Drop columns from a DataFrame:
In [47]:
df_7 = df_7.drop('unempl', axis=1)
df_7
Out[47]:
Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.
In [48]:
ser_2
Out[48]:
Select a value from a Series:
In [49]:
ser_2[0] == ser_2['a']
Out[49]:
Select a slice from a Series:
In [50]:
ser_2[1:4]
Out[50]:
Select specific values from a Series:
In [51]:
ser_2[['b', 'c', 'd']]
Out[51]:
Select from a Series based on a filter:
In [52]:
ser_2[ser_2 > 0]
Out[52]:
Select a slice from a Series with labels (note the end point is inclusive):
In [53]:
ser_2['a':'c']
Out[53]:
Assign to a Series slice (note the end point is inclusive):
In [54]:
ser_2['a':'b'] = 0
ser_2
Out[54]:
Pandas supports indexing into a DataFrame.
In [55]:
df_6
Out[55]:
Select specified columns from a DataFrame:
In [56]:
df_6[['pop', 'unempl']]
Out[56]:
Select a slice from a DataFrame:
In [57]:
df_6[:2]
Out[57]:
Select from a DataFrame based on a filter:
In [58]:
df_6[df_6['pop'] > 5]
Out[58]:
Perform a scalar comparison on a DataFrame:
In [59]:
df_6 > 5
Out[59]:
Perform a scalar comparison on a DataFrame, retain the values that pass the filter:
In [60]:
df_6[df_6 > 5]
Out[60]:
Select a slice of rows from a DataFrame (note the end point is inclusive):
In [61]:
df_6.ix[2:3]
Out[61]:
Select a slice of rows from a specific column of a DataFrame:
In [62]:
df_6.ix[0:2, 'pop']
df_6
Out[62]:
Select rows based on an arithmetic operation on a specific row:
In [63]:
df_6.ix[df_6.unempl > 5.0]
Out[63]:
Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:
In [64]:
np.random.seed(0)
ser_6 = pd.Series(np.random.randn(5),
index=['a', 'b', 'c', 'd', 'e'])
ser_6
Out[64]:
In [65]:
np.random.seed(1)
ser_7 = pd.Series(np.random.randn(5),
index=['a', 'c', 'e', 'f', 'g'])
ser_7
Out[65]:
In [66]:
ser_6 + ser_7
Out[66]:
Set a fill value instead of NaN for indices that do not overlap:
In [67]:
ser_6.add(ser_7, fill_value=0)
Out[67]:
Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:
In [68]:
np.random.seed(0)
df_8 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
columns=['a', 'b', 'c'])
df_8
Out[68]:
In [69]:
np.random.seed(1)
df_9 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
columns=['b', 'c', 'd'])
df_9
Out[69]:
In [70]:
df_8 + df_9
Out[70]:
Set a fill value instead of NaN for indices that do not overlap:
In [71]:
df_10 = df_8.add(df_9, fill_value=0)
df_10
Out[71]:
Like NumPy, pandas supports arithmetic operations between DataFrames and Series.
Match the index of the Series on the DataFrame's columns, broadcasting down the rows:
In [72]:
ser_8 = df_10.ix[0]
df_11 = df_10 - ser_8
df_11
Out[72]:
Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:
In [73]:
ser_9 = pd.Series(range(3), index=['a', 'd', 'e'])
ser_9
Out[73]:
In [74]:
df_11 - ser_9
Out[74]:
Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:
In [75]:
df_10
Out[75]:
In [76]:
ser_10 = pd.Series([100, 200, 300])
ser_10
Out[76]:
In [77]:
df_10.sub(ser_10, axis=0)
Out[77]:
NumPy ufuncs (element-wise array methods) operate on pandas objects:
In [78]:
df_11 = np.abs(df_11)
df_11
Out[78]:
Apply a function on 1D arrays to each column:
In [79]:
func_1 = lambda x: x.max() - x.min()
df_11.apply(func_1)
Out[79]:
Apply a function on 1D arrays to each row:
In [80]:
df_11.apply(func_1, axis=1)
Out[80]:
Apply a function and return a DataFrame:
In [81]:
func_2 = lambda x: pd.Series([x.min(), x.max()], index=['min', 'max'])
df_11.apply(func_2)
Out[81]:
Apply an element-wise Python function to a DataFrame:
In [82]:
func_3 = lambda x: '%.2f' %x
df_11.applymap(func_3)
Out[82]:
Apply an element-wise Python function to a Series:
In [83]:
df_11['a'].map(func_3)
Out[83]:
In [84]:
ser_4
Out[84]:
Sort a Series by its index:
In [85]:
ser_4.sort_index()
Out[85]:
Sort a Series by its values:
In [86]:
ser_4.order()
Out[86]:
In [87]:
df_12 = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['three', 'one', 'two'],
columns=['c', 'a', 'b', 'd'])
df_12
Out[87]:
Sort a DataFrame by its index:
In [88]:
df_12.sort_index()
Out[88]:
Sort a DataFrame by columns in descending order:
In [89]:
df_12.sort_index(axis=1, ascending=False)
Out[89]:
Sort a DataFrame's values by column:
In [90]:
df_12.sort_index(by=['d', 'c'])
Out[90]:
Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:
In [91]:
ser_11 = pd.Series([7, -5, 7, 4, 2, 0, 4, 7])
ser_11 = ser_11.order()
ser_11
Out[91]:
In [92]:
ser_11.rank()
Out[92]:
Rank a Series according to when they appear in the data:
In [93]:
ser_11.rank(method='first')
Out[93]:
Rank a Series in descending order, using the maximum rank for the group:
In [94]:
ser_11.rank(ascending=False, method='max')
Out[94]:
DataFrames can rank over rows or columns.
In [95]:
df_13 = pd.DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13
Out[95]:
Rank a DataFrame over rows:
In [96]:
df_13.rank()
Out[96]:
Rank a DataFrame over columns:
In [97]:
df_13.rank(axis=1)
Out[97]:
Labels do not have to be unique in Pandas:
In [98]:
ser_12 = pd.Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12
Out[98]:
In [99]:
ser_12.index.is_unique
Out[99]:
Select Series elements:
In [100]:
ser_12['foo']
Out[100]:
Select DataFrame elements:
In [101]:
df_14 = pd.DataFrame(np.random.randn(5, 4),
index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14
Out[101]:
In [102]:
df_14.ix['bar']
Out[102]:
Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.
In [103]:
df_15 = pd.DataFrame(np.random.randn(10, 3),
columns=['a', 'b', 'c'])
df_15['cat1'] = (np.random.rand(10) * 3).round(0)
df_15['cat2'] = (np.random.rand(10)).round(0)
In [104]:
df_15
Out[104]:
In [105]:
df_15.sum()
Out[105]:
In [106]:
df_15.sum(axis=1)
Out[106]:
In [107]:
df_15.mean(axis=0)
Out[107]:
In [108]:
df_15.max()
Out[108]:
In [109]:
df_15.idxmax()
Out[109]:
In [110]:
df_15['a'].describe()
Out[110]:
In [111]:
df_15['cat1'].value_counts()
Out[111]:
In [112]:
pd.pivot_table(df_15, index='cat1', aggfunc=np.mean)
Out[112]:
In [113]:
pd.pivot_table(df_15, index='cat1', columns='cat2', values='b', aggfunc=np.sum)
Out[113]:
In [ ]: