In [1]:

    
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from IPython.display import Image
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')



In [2]:

    
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')

see Installation_Customization_Resources.txt for useful infos and links

IPython: beyond plain Python

IPython provides a rich toolkit to help you make the most out of using Python, with:

TAB COMPLETION
Powerful Python shells (terminal and Qt-based).
A web-based notebook with the same core features but support for code, text, mathematical expressions, inline plots and other rich media.
Support for interactive data visualization and use of GUI toolkits.
Flexible, embeddable interpreters to load into your own projects.
Easy to use, high performance tools for parallel computing.

http://ipython.org/ipython-doc/stable/interactive/tutorial.html

http://ipython.readthedocs.io/en/stable/

Jupyter notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Checkout the "command palette" for useful commands and shortcuts.



In [3]:

    
from IPython.core.display import HTML
HTML("<iframe src=https://jupyter.readthedocs.io/en/latest/_images/notebook_components.png width=800 height=400></iframe>")









    Out[3]:

Esc and Enter to toggle between "Command" and "Edit" mode --> see colour indicator of frame surrounding the cell
when in "Command" mode:
- use cursor keys to navigate
- a to insert cell Above
- b to insert cell Below
- dd to delete cell
when in "Edit" mode:
- Tab for tab completion of file-names
- Shift + Tab in parenthesis of function to display docstring
- Cmd + / toggle line comment

Running code, getting help

In the notebook, to run a cell of code, hit Shift + Enter. This executes the cell and puts the cursor in the next cell below, or makes a new one if you are at the end. Alternately, you can use:

Alt + Enter to force the creation of a new cell unconditionally (useful when inserting new content in the middle of an existing notebook).
Control + Enter executes the cell and keeps the cursor in the same cell, useful for quick experimentation of snippets that you don't need to keep permanently.
Esc and Enter to toggle between "Command" and "Edit" mode --> see colour indicator of frame surrounding the cell
h for Help or checkout "Help" --> "Keyboard shortcuts" from the menu
Shift + Tab inside of the brackets of a function to get the Docstring
Cmd + / to toggle line comment

Shell commands



In [4]:

    
!pwd









    



/Users/dblyon/modules/cpr/PandasIntro



In [5]:

    
!ls -la









    



total 3496
drwxr-xr-x  17 dblyon  staff     578 Apr 28 15:21 .
drwxr-xr-x  14 dblyon  staff     476 Apr 27 23:12 ..
drwxr-xr-x  13 dblyon  staff     442 Apr 27 23:23 .git
-rw-r--r--   1 dblyon  staff    1045 Apr 27 23:12 .gitignore
drwxr-xr-x   7 dblyon  staff     238 Apr 27 23:21 .ipynb_checkpoints
-rw-r--r--   1 dblyon  staff   18747 Apr 28 15:02 Exercises_part_A.ipynb
-rw-r--r--   1 dblyon  staff   67287 Apr 28 15:03 Exercises_part_A_with_Solutions.ipynb
-rw-r--r--   1 dblyon  staff   20111 Apr 27 23:20 Exercises_part_B.ipynb
-rw-r--r--   1 dblyon  staff  263580 Apr 28 15:00 Exercises_part_B_with_Solutions.ipynb
-rw-r--r--   1 dblyon  staff    3867 Apr 27 23:12 Installation_Customization_Resources.txt
-rw-r--r--   1 dblyon  staff    1067 Apr 27 23:12 LICENSE
-rw-r--r--   1 dblyon  staff  685284 Apr 27 23:12 Pandas_Cheat_Sheet.pdf
-rw-r--r--   1 dblyon  staff  700478 Apr 28 15:21 Pandas_Introduction.ipynb
-rw-r--r--@  1 dblyon  staff    3429 Apr 27 23:16 Readme.md
-rw-r--r--   1 dblyon  staff    3478 Apr 27 23:12 Readme.txt
drwxr-xr-x   9 dblyon  staff     306 Apr 27 23:12 data
drwxr-xr-x   4 dblyon  staff     136 Apr 27 23:12 misc



In [6]:

    
var = !ls
len(var), var[:3]









    Out[6]:





(12,
 ['Exercises_part_A.ipynb',
  'Exercises_part_A_with_Solutions.ipynb',
  'Exercises_part_B.ipynb'])



In [7]:

    
%%bash
echo "My shell is:" $SHELL
echo "My disk usage is:"
df -h









    



My shell is: /bin/zsh
My disk usage is:
Filesystem                          Size   Used  Avail Capacity  iused      ifree %iused  Mounted on
/dev/disk1                         465Gi  404Gi   61Gi    87%  1939768 4293027511    0%   /
devfs                              190Ki  190Ki    0Bi   100%      658          0  100%   /dev
map -hosts                           0Bi    0Bi    0Bi   100%        0          0  100%   /net
map auto_home                        0Bi    0Bi    0Bi   100%        0          0  100%   /home
localhost:/IasP10O30clRK_kqj-ftyr  465Gi  465Gi    0Bi   100%        0          0  100%   /Volumes/MobileBackups
/dev/disk2s1                       1.8Ti  414Gi  1.4Ti    23%  3392373   11868959   22%   /Volumes/Speedy
/dev/disk3s1                       2.7Ti  2.3Ti  476Gi    83% 13196722 4281770557    0%   /Volumes/Data

helpful IPython commands

command	description
?	Introduction and overview of IPython’s features.
%quickref	Quick reference.
help	Python’s own help system.
?object	Details about ‘object’, use ‘??object’ for extra details.



In [8]:

    
# ?math.log



In [9]:

    
help(math.log)









    



Help on built-in function log in module math:

log(...)
    log(x[, base])
    
    Return the logarithm of x to the given base.
    If the base not specified, returns the natural logarithm (base e) of x.

press Shift + Tab inside of the brackets to get the Docstring



In [10]:

    
# math.log()

Magic



In [11]:

    
# %magic



In [12]:

    
values = range(1, 1001)



In [13]:

    
%%timeit 
results = []
for val in values:
    new_val = math.log(val, 10)
    results.append(new_val)









    



1000 loops, best of 3: 346 µs per loop



In [14]:

    
%%timeit 
results = [math.log(val, 10) for val in values]









    



1000 loops, best of 3: 258 µs per loop



In [15]:

    
%%timeit 
results = np.log10(values)









    



The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 59.2 µs per loop

Introduction to Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features:

Easy handling of missing data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes
Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Pandas cheat sheet

(download the pdf or find it in the git repo of the current workshop)

https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

Object creation

Creating a Series by passing a list of values, letting pandas create a default integer index:



In [16]:

    
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s









    Out[16]:





0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a numpy array, with labeled columns:



In [17]:

    
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df

Creating a DataFrame by passing a dict of objects that can be converted to series-like.



In [18]:

    
df2 = pd.DataFrame({'A' : 1.,                    
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D' : np.array([3] * 4, dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo',
                    'tinyRick': pd.Series([2000])})
df2

Having specific dtypes (data types)



In [19]:

    
df2.dtypes









    Out[19]:





A                  float64
B           datetime64[ns]
C                  float32
D                    int32
E                 category
F                   object
tinyRick           float64
dtype: object

change the dtype



In [20]:

    
df2["D"] = df2["D"].astype('float64')

Viewing Data

See the top & bottom rows of the frame



In [21]:

    
df.head()



In [22]:

    
df.tail(3)

Display the index, columns, and the underlying numpy data



In [23]:

    
df.index









    Out[23]:





RangeIndex(start=0, stop=6, step=1)



In [24]:

    
df.columns









    Out[24]:





Index([u'A', u'B', u'C', u'D'], dtype='object')



In [25]:

    
df.values









    Out[25]:





array([[-0.16989951,  0.28932253,  0.57323584,  1.82355794],
       [ 1.87684958,  1.27069089,  0.88613196,  1.50705393],
       [-0.28295392, -0.4410113 ,  0.18430394,  1.66179244],
       [ 0.22780917,  0.01556611, -1.71968866,  0.07524179],
       [ 1.29375737, -0.55121983,  2.05866642,  0.03974571],
       [ 1.21808418,  1.16472342, -1.38369711, -0.12868522]])

Describe shows a quick statistic summary of your data excluding NaN (Not a Number) values



In [26]:

    
df.describe()

Concise summary of a DataFrame



In [27]:

    
df2.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 7 columns):
A           4 non-null float64
B           4 non-null datetime64[ns]
C           4 non-null float32
D           4 non-null float64
E           4 non-null category
F           4 non-null object
tinyRick    1 non-null float64
dtypes: category(1), datetime64[ns](1), float32(1), float64(3), object(1)
memory usage: 228.0+ bytes

Transposing your data



In [28]:

    
df.T

Sorting by values



In [29]:

    
df.sort_values(by='B')

Selection

important ones:

.loc works on label of index and boolean array
.iloc works on integer position (from 0 to length-1 of the axis)

for completeness/comparison:

.ix first performs label based, if that fails then it falls to integer based --> don't use it (deprecated soon, prone to mistakes)
.at get scalar values. It's a very fast loc
.iat Get scalar values. It's a very fast iloc

CAVEATS: label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index.

Getting

Selecting a single column, which yields a Series.



In [30]:

    
df["A"]









    Out[30]:





0   -0.169900
1    1.876850
2   -0.282954
3    0.227809
4    1.293757
5    1.218084
Name: A, dtype: float64

equivalent



In [31]:

    
df.A









    Out[31]:





0   -0.169900
1    1.876850
2   -0.282954
3    0.227809
4    1.293757
5    1.218084
Name: A, dtype: float64

selecting with a list of column names, yields a data frame (also with a single column name)



In [32]:

    
df[["A", "C"]]

Selecting via [], which slices the rows.



In [33]:

    
df[0:3]

Selection by Label

df.loc[rows, columns]

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
A list or array of labels ['a', 'b', 'c']
A slice object with labels 'a':'f', (note that contrary to usual python slices, both the start and the stop are included!)
A boolean array
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)



In [34]:

    
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df



In [35]:

    
# Let's transpose the DataFrame and change the order of the columns (to force us to the string labels of the novel row index).
# row-index: Strings, columns: Integers
dfx = df.T
import random
columns = dfx.columns.tolist()
random.shuffle(columns)
dfx.columns = columns
dfx



In [36]:

    
dfx.loc["A":"C", ]



In [37]:

    
dfx.loc[["C", "A", "D"], ]



In [38]:

    
dfx.loc[["A", "D"], 0:2]

Selection by Position

df.iloc[rows, columns]

.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics). Allowed inputs are:

An integer e.g. 5
A list or array of integers [4, 3, 0]
A slice object with ints 1:7
A boolean array
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)



In [39]:

    
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df



In [40]:

    
### transpose
# df = df.T
# df



In [41]:

    
df.iloc[0:3, :]

uncomment the transpose and execute the two cells above again and notice that this doesn't raise an Error compared to label based indexing.



In [42]:

    
df.iloc[[1, 3, 5], 1:3]

Boolean indexing



In [43]:

    
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df

Using a single column’s values to select data.



In [44]:

    
df[df["A"] > 0]

A where operation for getting.



In [45]:

    
df[df > 0]

This comparison yields a Pandas.Series of Booleans.



In [46]:

    
cond = df["A"] > 0
print type(cond)
cond









    



<class 'pandas.core.series.Series'>






    Out[46]:





0    False
1    False
2    False
3    False
4     True
5     True
Name: A, dtype: bool

Summing boolean arrays can come in handy (True=1, False=0)



In [47]:

    
sum(cond), len(cond) # or cond.sum(), cond.shape[0]









    Out[47]:





(2, 6)

Frequent use case: combining a Series of Bools with specific column names to select data.



In [48]:

    
df.loc[cond, ["A", "C"]]

Let's add a column to the DataFrame



In [49]:

    
df['E'] = ['one', 'one','two','three','four','three']



In [50]:

    
df

Using the isin() method for filtering:



In [51]:

    
df[df['E'].isin(['two','four'])]

Setting

Setting a new column automatically aligns the data by the indexes



In [52]:

    
dates = pd.date_range('20130101', periods=6)
dfx = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
dfx



In [53]:

    
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1 = s1.sort_values(ascending=False)
s1









    Out[53]:





2013-01-07    6
2013-01-06    5
2013-01-05    4
2013-01-04    3
2013-01-03    2
2013-01-02    1
dtype: int64



In [54]:

    
dfx["F"] = s1
dfx

Setting values by label



In [55]:

    
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df['E'] = ['one', 'one','two','three','four','three']
df



In [56]:

    
df.loc[5, "A"] = 0.815

Setting values by position



In [57]:

    
df.iloc[0, 4] = "zero"

delete (drop) some rows



In [58]:

    
df.drop([3, 5])

delete (drop) a column



In [59]:

    
df = df.drop("E", axis=1)
df

Axis

default: axis=0
axis=0 is "column-wise", for each column along the rows, horizontal
axis=1 is "row-wise", for each row along the columns, vertical



In [60]:

    
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df

A where operation with setting.



In [61]:

    
df[df > 0] = -df
df

Multiply with a scalar.



In [62]:

    
df = df * -1
df

Row wise division



In [63]:

    
df["F"] = df["A"] / df["B"]
df

Missing Data

pandas primarily uses the value np.nan to represent missing data.



In [64]:

    
df.loc[0, ] = np.nan
df.loc[2, ["A", "C"]] = np.nan
df

To drop any rows that have missing data.



In [65]:

    
df.dropna()

Drop only rows where all values are missing.



In [66]:

    
df.dropna(how='all')

Fill missing values



In [67]:

    
df.fillna(value=5)

see replace method to replace values given in 'to_replace' with 'value'.

To get the boolean mask where values are nan



In [68]:

    
df.isnull() # or pd.isnull(df)









    Out[68]:






  
    
      
      A
      B
      C
      D
      F
    
  
  
    
      0
      True
      True
      True
      True
      True
    
    
      1
      False
      False
      False
      False
      False
    
    
      2
      True
      False
      True
      False
      False
    
    
      3
      False
      False
      False
      False
      False
    
    
      4
      False
      False
      False
      False
      False
    
    
      5
      False
      False
      False
      False
      False



In [69]:

    
df.notnull() == -df.isnull()









    Out[69]:






  
    
      
      A
      B
      C
      D
      F
    
  
  
    
      0
      True
      True
      True
      True
      True
    
    
      1
      True
      True
      True
      True
      True
    
    
      2
      True
      True
      True
      True
      True
    
    
      3
      True
      True
      True
      True
      True
    
    
      4
      True
      True
      True
      True
      True
    
    
      5
      True
      True
      True
      True
      True

Duplicates



In [70]:

    
dfx = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
dfx

Drop duplicates except for the first occurrence, considering only a certain column.



In [71]:

    
dfx.drop_duplicates(subset="A")

see also duplicated method to return boolean Series denoting duplicate rows, optionally only considering certain columns



In [72]:

    
df

Operations

Stats

Operations in general exclude missing data.

Performing a descriptive statistic



In [73]:

    
df.mean()









    Out[73]:





A    1.273531
B    0.656253
C    0.473226
D    0.598566
F    2.764622
dtype: float64

On the other axis



In [74]:

    
df.median(axis=1)









    Out[74]:





0         NaN
1    1.089165
2    0.403498
3    1.382410
4    1.184605
5    0.304343
dtype: float64

The full range of basic statistics that are quickly calculable and built into the base Pandas package are:

Function	Description
count	Number of non-null observations
sum	Sum of values
mean	Mean of values
mad	Mean absolute deviation
median	Arithmetic median of values
min	Minimum
max	Maximum
mode	Mode
abs	Absolute Value
prod	Product of values
std	Bessel-corrected sample standard deviation
var	Unbiased variance
sem	Standard error of the mean
skew	Sample skewness (3rd moment)
kurt	Sample kurtosis (4th moment)
quantile	Sample quantile (value at %)
cumsum	Cumulative sum
cumprod	Cumulative product
cummax	Cumulative maximum
cummin	Cumulative minimum

The need for custom functions is minimal unless you have very specific requirements. c.f. http://pandas.pydata.org/pandas-docs/stable/basics.html

e.g. compute pairwise covariance of columns, excluding NA/null values



In [75]:

    
df.cov()

Apply

Applying functions to the data



In [76]:

    
df = pd.DataFrame(np.random.randint(0,100,size=(6, 4)), columns=list('ABCD'))
df



In [77]:

    
df.apply(np.cumsum, axis=0)



In [78]:

    
df.apply(lambda x: x.max() - x.min())









    Out[78]:





A    83
B    63
C    53
D    79
dtype: int64

Merge

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concat



In [79]:

    
df = pd.DataFrame(np.random.randn(10, 4))
df



In [80]:

    
# break it into multiple pieces
pieces = [df[3:7], df[:3], df[7:]]
pd.concat(pieces)

Join

SQL style merges.



In [81]:

    
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['bar', 'foo'], 'rval': [4, 5]})
left



In [82]:

    
right



In [83]:

    
pd.merge(left, right, on='key')

Append

Append rows to a dataframe.



In [84]:

    
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df



In [85]:

    
df.append(s, ignore_index=True)

Grouping

By “group by” we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure



In [86]:

    
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

Grouping and then applying a function sum to the resulting groups.



In [87]:

    
df.groupby('A').sum()

Grouping by multiple columns forms a hierarchical index, which we then apply the function.



In [88]:

    
df.groupby(['A','B']).sum()

Series.nlargest: Return the largest n elements. Series.nsmallest: Return the smallest n elements.



In [89]:

    
df.groupby('A')["C"].nlargest(2)









    Out[89]:





A     
bar  1    1.480251
     3    0.640351
foo  4    0.538425
     6   -0.057440
Name: C, dtype: float64

Splitting

Iterate over a groupby object, just to illustrate the split part of the groupby method.



In [90]:

    
grouped = df.groupby(['A','B'])
for name, group in grouped:
    print "Name:", name
    print group
    print "#"*50









    



Name: ('bar', 'one')
     A    B         C         D
1  bar  one  1.480251  0.311373
##################################################
Name: ('bar', 'three')
     A      B         C         D
3  bar  three  0.640351 -0.532001
##################################################
Name: ('bar', 'two')
     A    B         C         D
5  bar  two -0.436455 -1.024308
##################################################
Name: ('foo', 'one')
     A    B         C         D
0  foo  one -0.595697 -0.216727
6  foo  one -0.057440  1.501428
##################################################
Name: ('foo', 'three')
     A      B         C         D
7  foo  three -0.260092 -0.684502
##################################################
Name: ('foo', 'two')
     A    B         C         D
2  foo  two -1.836842 -0.377590
4  foo  two  0.538425  0.530609
##################################################

Applying

Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
- Compute group sums or means
- Compute group sizes / counts
Transformation: perform some group-specific computations and return a like-indexed. Some examples:
- Standardizing data (zscore) within group
- Filling NAs within groups with a value derived from each group
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discarding data that belongs to groups with only a few members
- Filtering out data based on the group sum or mean

Aggregation

"reduces" the DataFrame, meaning the df_original.shape > df_aggregated.shape



In [91]:

    
grouped.aggregate(["sum", "count", "median", "mean"]) # see `Operations` above for more built-in methods

Transformation

returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.



In [92]:

    
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                   'b': [1, 2, 3, 4, 5, 6],
                   'c': ['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd': ['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df



In [93]:

    
df['f'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
df



In [94]:

    
assert df.loc[0, "f"] == df.loc[( (df["c"] == "q") & (df["d"] == "z") ), "e"].sum()

Filtration

returns a subset of the original object.



In [95]:

    
df = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
df



In [96]:

    
df.groupby('B').filter(lambda x: len(x) > 2)

Apply

Another frequent operation is applying a function on 1D arrays to each column or row. Apply a custom function to the entire DataFrame or a Series. axis indicates row or column-wise application. (default: axis=0)

apply on a Series performs an element-wise operation



In [97]:

    
def example_custom_function(number):
    if number >= 6:
        return number / 5.0
    elif number <= 3:
        return number * 88
    else:
        return np.nan



In [98]:

    
df["A"].apply(example_custom_function)









    Out[98]:





0      0.0
1     88.0
2    176.0
3    264.0
4      NaN
5      NaN
6      1.2
7      1.4
Name: A, dtype: float64

apply on a DataFrame perform a Series-wise operation



In [99]:

    
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df



In [100]:

    
def example_custom_function_on_series_1(series):
    return series / series.sum()



In [101]:

    
### notice how example_custom_function_on_series_1 return a single value per Series --> transformation
df.apply(example_custom_function_on_series_1)



In [102]:

    
def example_custom_function_on_series_2(series):
    return series.max() - series.min()



In [103]:

    
### notice how example_custom_function_on_series_2 return a single value per Series --> aggregation
df.apply(example_custom_function_on_series_2)









    Out[103]:





A    4.023014
B    2.898413
C    4.006210
D    3.764924
dtype: float64

Applymap

Element-wise Python functions can be used, too. You can do this with applymap:



In [104]:

    
formater = lambda x: '%.2f' % x
df.applymap(formater)

Map

The reason for the name applymap is that Series has a map method for applying an element-wise function:



In [105]:

    
df["A"].map(formater)









    Out[105]:





0    -0.06
1     0.20
2     0.07
3     2.36
4    -1.66
5    -0.89
6    -0.02
7    -0.36
Name: A, dtype: object

Reshaping

Stack

“compresses” a level in the DataFrame’s columns.



In [106]:

    
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2



In [107]:

    
stacked = df2.stack()
stacked









    Out[107]:





first  second   
bar    one     A   -0.916427
               B   -1.554786
       two     A    1.606001
               B   -0.414154
baz    one     A   -0.338968
               B    0.155801
       two     A    0.569637
               B   -2.387479
dtype: float64

Unstack

the inverse operation of stack() is unstack(), which by default unstacks the last level



In [108]:

    
stacked.unstack() # in this particular case equivalent to stacked.unstack(2)



In [109]:

    
stacked.unstack(0)



In [110]:

    
stacked.unstack(1)

Pivot

We can produce pivot tables from this data very easily:



In [111]:

    
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['a', 'b', 'c'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df



In [112]:

    
pd.pivot_table(df, values=['D', "E"], index=['A', 'B'], columns=['C'])

Melt

“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set. This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.



In [113]:

    
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df



In [114]:

    
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])

Plotting

On DataFrame, plot() is a convenience to plot all of the columns with labels. Using %matplotlib inline magic function enables plotting within the jupyter notebook cells (instead of e.g. in a separate Qt window).



In [115]:

    
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()









    Out[115]:





<matplotlib.axes._subplots.AxesSubplot at 0x10620d550>



In [116]:

    
df.boxplot()









    Out[116]:





<matplotlib.axes._subplots.AxesSubplot at 0x10620d250>



In [117]:

    
df.plot(kind="scatter", x="A", y="B")









    Out[117]:





<matplotlib.axes._subplots.AxesSubplot at 0x10667f2d0>



In [118]:

    
df[["C", "D"]].sum().plot(kind="bar", x="C", y="D")









    Out[118]:





<matplotlib.axes._subplots.AxesSubplot at 0x10677d290>



In [119]:

    
axs = pd.scatter_matrix(df, figsize=(16,12), diagonal='kde')

Getting data in and out

reading data from file

tabular data such as CSV or tab-delimited txt files



In [120]:

    
fn = r"data/Saliva.txt"
df = pd.read_csv(fn, sep='\t')
df.head(3)









    Out[120]:






  
    
      
      TaxName
      species
      Abundance
      SampleCategory
      frequency
      rank
    
  
  
    
      0
      Rothia mucilaginosa
      43675
      9.861078
      Caries
      10.0
      1.0
    
    
      1
      Veillonella atypica
      39777
      7.030844
      Caries
      10.0
      2.0
    
    
      2
      Prevotella histicola
      470565
      6.433449
      Caries
      10.0
      3.0

from Excel



In [121]:

    
xls = pd.ExcelFile('data/microbiome/MID1.xls')
print xls.sheet_names  # see all sheet names
df = xls.parse("Sheet 1", header=None)
df.columns = ["Taxon", "Count"]
df.head()









    



[u'Sheet 1']






    Out[121]:






  
    
      
      Taxon
      Count
    
  
  
    
      0
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      7
    
    
      1
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      2
    
    
      2
      Archaea "Crenarchaeota" Thermoprotei Sulfoloba...
      3
    
    
      3
      Archaea "Crenarchaeota" Thermoprotei Thermopro...
      3
    
    
      4
      Archaea "Euryarchaeota" "Methanomicrobia" Meth...
      7

a current list of I/O tools:

read_csv
read_excel
read_hdf
read_sql
read_json
read_msgpack (experimental)
read_html
read_gbq (experimental)
read_stata
read_sas
read_clipboard
read_pickle

checkout the documentation http://pandas.pydata.org/pandas-docs/stable/ for more infos

writing data to file



In [122]:

    
fn_out = r"data/Tidy_data.txt"
df.to_csv(fn_out, sep='\t', header=True, index=False)

Miscellaneous

zip example



In [123]:

    
xls = pd.ExcelFile('data/microbiome/MID1.xls')
df = xls.parse("Sheet 1", header=None)
df.columns = ["Taxon", "Count"]
df.head(3)









    Out[123]:






  
    
      
      Taxon
      Count
    
  
  
    
      0
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      7
    
    
      1
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      2
    
    
      2
      Archaea "Crenarchaeota" Thermoprotei Sulfoloba...
      3



In [124]:

    
df["superkingdom"], df["rest"] = zip(*df["Taxon"].apply(lambda x: x.split(" ", 1)))
df.head(3)









    Out[124]:






  
    
      
      Taxon
      Count
      superkingdom
      rest
    
  
  
    
      0
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      7
      Archaea
      "Crenarchaeota" Thermoprotei Desulfurococcales...
    
    
      1
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      2
      Archaea
      "Crenarchaeota" Thermoprotei Desulfurococcales...
    
    
      2
      Archaea "Crenarchaeota" Thermoprotei Sulfoloba...
      3
      Archaea
      "Crenarchaeota" Thermoprotei Sulfolobales Sulf...



In [125]:

    
df.superkingdom.unique()









    Out[125]:





array([u'Archaea', u'Bacteria'], dtype=object)

string processing methods

for Series and Index, to make it easy to operate on each element of the array. These methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods.



In [126]:

    
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s









    Out[126]:





0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object



In [127]:

    
s.str.lower()









    Out[127]:





0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Cut

Return indices of half-open bins to which each value of x belongs. The cut function can be useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges.



In [128]:

    
df = pd.DataFrame(np.array([.2, 1.4, 2.5, 6.2, 9.7, np.nan, 2.1]), columns=["value"])
df["category"] = pd.cut(df["value"], 3, retbins=False, labels=["good", "medium", "bad"])
df

rpy2 example

call an R function from Python

https://en.wikipedia.org/wiki/Quantile_normalization

https://bioconductor.org/packages/release/bioc/html/limma.html



In [129]:

    
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.numpy2ri import numpy2ri
robjects.conversion.py2ri = numpy2ri



In [130]:

    
dict_ = {"Intensity First"  : [5.0, 2.0, 3.0, 4.0],         
         "Intensity Second" : [4.0, 1.0, 4.0, 2.0],
         "Intensity Third"  : [3.0, 4.0, 6.0, 8.0]}
df = pd.DataFrame(dict_, index=list("ABCD"))
df









    Out[130]:






  
    
      
      Intensity First
      Intensity Second
      Intensity Third
    
  
  
    
      A
      5.0
      4.0
      3.0
    
    
      B
      2.0
      1.0
      4.0
    
    
      C
      3.0
      4.0
      6.0
    
    
      D
      4.0
      2.0
      8.0



In [131]:

    
def R_function_normalizeQuantiles():
    rpackages.importr('limma')
    normalizeQuantiles = robjects.r['normalizeQuantiles']
    return normalizeQuantiles



In [132]:

    
def quantile_normalize(df, cols_2_norm):
    """
    :param df: DataFrame
    :param cols_2_norm: ListOfString (Columns to normalize)
    :return: DataFrame
    """
    normalizeQuantiles = R_function_normalizeQuantiles()

    # set Zero to NaN and transform to log-space
    df[cols_2_norm] = df[cols_2_norm].replace(to_replace=0.0, value=np.nan)
    df[cols_2_norm] = np.log10(df[cols_2_norm])

    # quantile normalize
    df[cols_2_norm] = np.array(normalizeQuantiles(df[cols_2_norm].values))

    # Transform back to non-log space and replace NaN with Zero
    df[cols_2_norm] = np.power(10, df[cols_2_norm])
    df[cols_2_norm] = df[cols_2_norm].replace(to_replace=np.nan, value=0.0)

    return df



In [133]:

    
df_norm = quantile_normalize(df, df.columns.tolist())
df_norm









    Out[133]:






  
    
      
      Intensity First
      Intensity Second
      Intensity Third
    
  
  
    
      A
      5.428835
      4.985766
      1.817121
    
    
      B
      1.817121
      1.817121
      2.884499
    
    
      C
      2.884499
      4.985766
      4.578857
    
    
      D
      4.578857
      2.884499
      5.428835

now let's do some EXERCISES

open Exercises_part_A.ipynb and/or Exercises_part_B.ipynb

	A	B	C	D
0	-0.169900	0.289323	0.573236	1.823558
1	1.876850	1.270691	0.886132	1.507054
2	-0.282954	-0.441011	0.184304	1.661792
3	0.227809	0.015566	-1.719689	0.075242
4	1.293757	-0.551220	2.058666	0.039746

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.693941	0.291345	0.099825	0.829784
std	0.889000	0.780286	1.428150	0.922028
min	-0.282954	-0.551220	-1.719689	-0.128685
25%	-0.070472	-0.326867	-0.991697	0.048620
50%	0.722947	0.152444	0.378770	0.791148
75%	1.274839	0.945873	0.807908	1.623108
max	1.876850	1.270691	2.058666	1.823558

	A	B	C	D
0	-0.169900	0.289323	0.573236	1.823558
1	1.876850	1.270691	0.886132	1.507054
2	-0.282954	-0.441011	0.184304	1.661792

	A	B	C	D
0	-0.008918	-1.209779	1.830216	-1.031679
1	-0.418588	0.792097	-2.936405	-1.203708
2	0.577104	-1.230854	0.214456	0.898157
3	-0.375449	0.339775	0.551374	-1.198762
4	-0.104258	-1.646144	-0.202755	-1.118140
5	0.244080	0.297405	0.080183	-0.080778

	4	0	1	5	2	3
A	-0.008918	-0.418588	0.577104	-0.375449	-0.104258	0.244080
B	-1.209779	0.792097	-1.230854	0.339775	-1.646144	0.297405
C	1.830216	-2.936405	0.214456	0.551374	-0.202755	0.080183
D	-1.031679	-1.203708	0.898157	-1.198762	-1.118140	-0.080778

	A	B	C	D	E	F	tinyRick
0	1.0	2013-01-02	1.0	3	test	foo	2000.0
1	1.0	2013-01-02	1.0	3	train	foo	NaN
2	1.0	2013-01-02	1.0	3	test	foo	NaN
3	1.0	2013-01-02	1.0	3	train	foo	NaN

	A	B	C	D
0	1.695964	-2.460150	-2.087428	0.172406
1	-0.750895	2.017630	-0.725006	1.508415
2	-0.602350	2.511633	-0.181661	-0.463247
3	-0.615276	1.317014	-0.902506	0.125343
4	-1.099925	0.316938	1.285934	2.671944
5	-0.379311	0.610271	-1.756487	-0.195240

	A	B	C	D
0	-0.025316	0.576631	-0.097271	-0.830734
1	-0.216126	1.393101	-1.100897	0.252488
2	-0.519939	0.459839	0.446857	0.165097
3	-0.213917	0.011958	0.156027	1.678674
4	0.919606	0.556492	-1.089998	0.681550
5	0.390535	-1.825435	0.009902	-0.558049

	A	B	C	D
2013-01-01	0.718354	-0.365003	-0.000459	1.352737
2013-01-02	-0.319766	-0.789105	0.011944	-0.096405
2013-01-03	-1.240610	1.632686	-0.356929	-0.272638
2013-01-04	-2.367895	-0.407514	0.725953	-0.658410
2013-01-05	-1.267723	0.004515	-0.893457	0.087957
2013-01-06	-1.475928	-0.788952	-0.431096	-0.456979

	A	B	C	D	E
0	-1.693542	0.091828	-1.580609	0.594735	one
1	0.914928	1.528036	-1.870334	0.285952	one
2	0.731091	0.546496	-0.073345	0.063040	two
3	-0.804229	-0.351444	1.222302	-1.663926	three
4	0.404843	-0.471565	0.748519	-0.597049	four
5	0.238829	1.206530	0.137581	-0.188514	three

	A	B	C	D
0	-3.483162	0.674215	-0.863999	-0.921991
1	-1.089165	0.215477	1.594555	0.613976
2	0.443881	-1.100082	1.185749	0.137363
3	1.382410	-0.396501	-0.025471	-1.714747
4	1.498364	1.264864	0.242646	-0.432940
5	-1.124187	-0.304343	0.030233	0.093802

	A	B	C	D
0	-3.483162	-0.674215	-0.863999	-0.921991
1	-1.089165	-0.215477	-1.594555	-0.613976
2	-0.443881	-1.100082	-1.185749	-0.137363
3	-1.382410	-0.396501	-0.025471	-1.714747
4	-1.498364	-1.264864	-0.242646	-0.432940
5	-1.124187	-0.304343	-0.030233	-0.093802

	A	B	C	D
0	3.483162	0.674215	0.863999	0.921991
1	1.089165	0.215477	1.594555	0.613976
2	0.443881	1.100082	1.185749	0.137363
3	1.382410	0.396501	0.025471	1.714747
4	1.498364	1.264864	0.242646	0.432940
5	1.124187	0.304343	0.030233	0.093802

	A	B	C	D	F
0	5.000000	5.000000	5.000000	5.000000	5.000000
1	1.089165	0.215477	1.594555	0.613976	5.054672
2	5.000000	1.100082	5.000000	0.137363	0.403498
3	1.382410	0.396501	0.025471	1.714747	3.486522
4	1.498364	1.264864	0.242646	0.432940	1.184605
5	1.124187	0.304343	0.030233	0.093802	3.693813

	A	B	C	D	F
0	True	True	True	True	True
1	False	False	False	False	False
2	True	False	True	False	False
3	False	False	False	False	False
4	False	False	False	False	False
5	False	False	False	False	False

	A	B	C	D
0	foo	one	0.373508	2.153941
1	bar	one	0.356375	0.453069
2	foo	two	-0.491507	0.598356
3	bar	three	0.710054	-1.016165
4	foo	two	-0.209080	-1.164680
5	bar	two	0.028916	1.792031
6	foo	one	-0.348100	-1.193166
7	foo	three	0.335917	0.067953

	A	B	C	D	F
A	0.039566	0.080791	-0.080390	0.052278	-0.279206
B	0.080791	0.238247	-0.120797	-0.106147	-0.883364
C	-0.080390	-0.120797	0.569091	-0.073566	0.732454
D	0.052278	-0.106147	-0.073566	0.435256	0.430671
F	-0.279206	-0.883364	0.732454	0.430671	3.675057

	A	B	C	D
0	91	96	38	6
1	146	154	82	23
2	205	211	161	103
3	294	310	227	141
4	318	346	255	211
5	326	434	336	212

	0	1	2	3
0	0.970496	0.246441	0.844668	0.150986
1	0.341910	-0.633616	-1.376025	-0.473660
2	0.792826	-1.636417	-0.346066	-1.565380
3	-0.216165	0.489109	0.885583	-2.011223
4	-0.083781	-1.053215	0.693787	0.111576
5	1.120049	1.054716	-0.910174	-1.367579
6	-0.075761	-0.222491	0.949213	1.122462
7	2.151368	1.404117	0.746708	1.288665
8	1.224616	-0.189300	-1.982512	-0.961140
9	-1.139796	-1.253938	0.838092	0.869218

	A	B	C	D
0	0.969489	-1.604153	-0.248921	1.038086
1	0.412990	-1.012526	0.534236	-1.596054
2	1.018075	1.128761	1.131470	-0.284683
3	0.932039	-0.080776	1.309474	-1.010208
4	0.043121	0.776682	1.384962	-1.658708
5	-0.114250	-0.229319	0.954087	0.559766
6	-0.607813	-1.179375	1.980276	-0.360193
7	-0.468785	0.287706	0.070535	1.491814

	A	B	C	D
0	foo	one	-0.595697	-0.216727
1	bar	one	1.480251	0.311373
2	foo	two	-1.836842	-0.377590
3	bar	three	0.640351	-0.532001
4	foo	two	0.538425	0.530609
5	bar	two	-0.436455	-1.024308
6	foo	one	-0.057440	1.501428
7	foo	three	-0.260092	-0.684502

	A	B	C	D
0	-0.055586	0.565785	-1.166555	0.038104
1	0.200634	0.889308	1.126039	-2.620415
2	0.070676	0.895831	-1.826842	0.290249
3	2.361897	1.921316	0.722863	0.139535
4	-1.661117	0.428333	-2.217039	1.021083
5	-0.892498	2.375512	-2.880172	-1.041927
6	-0.016977	-0.522901	0.154441	0.600143
7	-0.356928	0.623010	-1.054379	1.144509

	A	B	C	D
0	0.158864	0.078842	0.163345	-0.088879
1	-0.573406	0.123925	-0.157672	6.112175
2	-0.201991	0.124834	0.255801	-0.677011
3	-6.750224	0.267735	-0.101218	-0.325468
4	4.747419	0.059688	0.310438	-2.381697
5	2.550730	0.331027	0.403293	2.430318
6	0.048519	-0.072866	-0.021625	-1.399847
7	1.020089	0.086816	0.147638	-2.669591

	A	B	C	D
0	-0.06	0.57	-1.17	0.04
1	0.20	0.89	1.13	-2.62
2	0.07	0.90	-1.83	0.29
3	2.36	1.92	0.72	0.14
4	-1.66	0.43	-2.22	1.02
5	-0.89	2.38	-2.88	-1.04
6	-0.02	-0.52	0.15	0.60
7	-0.36	0.62	-1.05	1.14

		A	B
first	second
bar	one	-0.916427	-1.554786
bar	two	1.606001	-0.414154
baz	one	-0.338968	0.155801
baz	two	0.569637	-2.387479

Table of Contents

IPython: beyond plain Python

Jupyter notebook

Navigation

Running code, getting help

Shell commands

helpful IPython commands

Magic

Introduction to Pandas

Pandas cheat sheet

Object creation

Viewing Data

Selection

Getting

Selection by Label

Selection by Position

Boolean indexing

Setting

Axis

Missing Data

Duplicates

Operations

Stats

Apply

Merge

Concat

Join

Append

Grouping

Splitting

Applying

Aggregation

Transformation

Filtration

Apply

Applymap

Map

Reshaping

Stack

Unstack

Pivot

Melt

Plotting

Getting data in and out

reading data from file

writing data to file

Miscellaneous

zip example

string processing methods

Cut

rpy2 example

now let's do some EXERCISES

	A	B	C	D
0	91	96	38	6
1	146	154	82	23
2	205	211	161	103
3	294	310	227	141
4	318	346	255	211
5	326	434	336	212

	A	B	C	D	E
0	one	a	foo	1.795658	-1.692460
1	one	b	foo	-0.072229	0.080841
2	two	c	foo	0.579050	-0.484044
3	three	a	bar	1.802420	0.343524
4	one	b	bar	-0.163139	-1.922711
5	one	c	bar	-0.411838	0.239472
6	two	a	foo	-0.167926	0.779548
7	three	b	foo	0.091777	-0.745811
8	one	c	foo	0.629784	0.927211
9	one	a	bar	-1.127097	0.671469
10	two	b	bar	-0.560821	-0.043930
11	three	c	bar	-1.625592	-0.932791

	TaxName	species	Abundance	SampleCategory	frequency	rank
0	Rothia mucilaginosa	43675	9.861078	Caries	10.0	1.0
1	Veillonella atypica	39777	7.030844	Caries	10.0	2.0
2	Prevotella histicola	470565	6.433449	Caries	10.0	3.0

	Taxon	Count
0	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	7
1	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	2
2	Archaea "Crenarchaeota" Thermoprotei Sulfoloba...	3
3	Archaea "Crenarchaeota" Thermoprotei Thermopro...	3
4	Archaea "Euryarchaeota" "Methanomicrobia" Meth...	7

	Taxon	Count	superkingdom	rest
0	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	7	Archaea	"Crenarchaeota" Thermoprotei Desulfurococcales...
1	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	2	Archaea	"Crenarchaeota" Thermoprotei Desulfurococcales...
2	Archaea "Crenarchaeota" Thermoprotei Sulfoloba...	3	Archaea	"Crenarchaeota" Thermoprotei Sulfolobales Sulf...

	Intensity First	Intensity Second	Intensity Third
A	5.428835	4.985766	1.817121
B	1.817121	1.817121	2.884499
C	2.884499	4.985766	4.578857
D	4.578857	2.884499	5.428835