Forked from 10 Minutes to pandas
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
http://pandas.pydata.org/pandas-docs/stable/10min.html#object-creation
Creating a Series by passing a list of values, letting pandas create a default integer index:
In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[2]:
Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:
In [3]:
dates = pd.date_range('20130101', periods=6)
dates
Out[3]:
In [4]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Out[4]:
Creating a DataFrame by passing a dict of objects that can be converted to series-like.
In [5]:
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
Out[5]:
In [6]:
df2.dtypes
Out[6]:
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
In [13]: df2.<TAB>
df2.A df2.boxplot
df2.abs df2.C
df2.add df2.clip
df2.add_prefix df2.clip_lower
df2.add_suffix df2.clip_upper
df2.align df2.columns
df2.all df2.combine
df2.any df2.combineAdd
df2.append df2.combine_first
df2.apply df2.combineMult
df2.applymap df2.compound
df2.as_blocks df2.consolidate
df2.asfreq df2.convert_objects
df2.as_matrix df2.copy
df2.astype df2.corr
df2.at df2.corrwith
df2.at_time df2.count
df2.axes df2.cov
df2.B df2.cummax
df2.between_time df2.cummin
df2.bfill df2.cumprod
df2.blocks df2.cumsum
df2.bool df2.D
As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.
In [7]:
df.head()
Out[7]:
In [8]:
df.tail(3)
Out[8]:
In [9]:
df.index
Out[9]:
In [10]:
df.columns
Out[10]:
In [11]:
df.values
Out[11]:
In [12]:
df.describe()
Out[12]:
Transposing your data
In [13]:
df.T
Out[13]:
Sorting by an axis
In [14]:
df.sort_index(axis=1, ascending=False)
Out[14]:
Sorting by values
In [15]:
df.sort_values(by='B')
Out[15]:
While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix.
In [16]:
df['A']
Out[16]:
In [17]:
df[0:3]
Out[17]:
In [18]:
df['20130102':'20130104']
Out[18]:
For getting a cross section using a label
SKIPPING FOR NOW. http://pandas.pydata.org/pandas-docs/stable/10min.html#selection-by-label
Select via the position of the passed integers
In [19]:
df.iloc[3]
Out[19]:
By integer slices, acting similar to numpy/python
In [20]:
df.iloc[3:5,0:2]
Out[20]:
By lists of integer position locations, similar to the numpy/python style
In [21]:
df.iloc[[1,2,4],[0,2]]
Out[21]:
For slicing rows explicitly
In [22]:
df.iloc[1:3,:]
Out[22]:
For slicing columns explicitly
In [23]:
df.iloc[:,1:3]
Out[23]:
For getting a value explicitly
In [24]:
df.iloc[1,1]
Out[24]:
For getting fast access to a scalar (equiv to the prior method)
In [25]:
df.iat[1,1]
Out[25]:
Using a single column’s values to select data.
In [26]:
df[df.A > 0]
Out[26]:
A where
operation for getting.
In [27]:
df > 0
Out[27]:
In [28]:
df[df > 0]
Out[28]:
Using the isin()
method for filtering:
In [29]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2
Out[29]:
In [30]:
df2[df2['E'].isin(['two','four'])]
Out[30]:
In [31]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1
Out[31]:
In [32]:
df['F'] = s1
Setting values by label
In [33]:
df.at[dates[0],'A'] = 0
Setting values by position
In [34]:
df.iat[0,1] = 0
Setting by assigning with a numpy array
In [35]:
df.loc[:,'D'] = np.array([5] * len(df))
The result of the prior setting operations
In [36]:
df
Out[36]:
A where
operation with setting.
In [37]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2
Out[37]:
http://pandas.pydata.org/pandas-docs/stable/10min.html#missing-data
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
In [38]:
df
Out[38]:
In [39]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[39]:
To drop any rows that have missing data.
In [40]:
df1.dropna(how='any')
Out[40]:
Filling missing data
In [41]:
df1.fillna(value=5)
Out[41]:
To get the boolean mask where values are nan
In [42]:
pd.isnull(df1)
Out[42]:
In [43]:
df.mean()
Out[43]:
Same operation on the other axis
In [44]:
df.mean(axis=1)
Out[44]:
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.
In [45]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
Out[45]:
In [46]:
df.sub(s, axis='index')
Out[46]:
In [47]:
df.apply(np.cumsum)
Out[47]:
In [48]:
df.apply(lambda x: x.max() - x.min())
Out[48]:
In [49]:
s = pd.Series(np.random.randint(0, 7, size=10))
s
Out[49]:
In [50]:
s.value_counts()
Out[50]:
In [51]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [52]:
s.str.lower()
Out[52]:
In [57]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[57]:
In [58]:
pieces = [df[:3], df[3:7], df[7:]]
pieces
Out[58]:
In [59]:
pd.concat(pieces)
Out[59]:
In [61]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left
Out[61]:
In [62]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right
Out[62]:
In [63]:
pd.merge(left, right, on='key')
Out[63]:
In [65]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
Out[65]:
In [67]:
s = df.iloc[3]
s
Out[67]:
In [68]:
df.append(s, ignore_index=True)
Out[68]:
http://pandas.pydata.org/pandas-docs/stable/10min.html#grouping
By “group by” we are referring to a process involving one or more of the following steps
In [69]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
Out[69]:
Grouping and then applying a function sum
to the resulting groups.
In [71]:
df.groupby('A').sum()
Out[71]:
Grouping by multiple columns forms a hierarchical index, which we then apply the function.
In [72]:
df.groupby(['A','B']).sum()
Out[72]:
In [53]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
Out[53]:
On DataFrame, plot() is a convenience to plot all of the columns with labels:
In [54]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
Out[54]:
http://pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out
In [55]:
df.to_csv('foo.csv')
In [56]:
pd.read_csv('foo.csv').head()
Out[56]: