In [1]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from IPython.display import Image
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')

Table of Contents


In [2]:
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')


see Installation_Customization_Resources.txt for useful infos and links

IPython: beyond plain Python

IPython provides a rich toolkit to help you make the most out of using Python, with:

  • TAB COMPLETION
  • Powerful Python shells (terminal and Qt-based).
  • A web-based notebook with the same core features but support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

Jupyter notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Checkout the "command palette" for useful commands and shortcuts.


In [3]:
from IPython.core.display import HTML
HTML("<iframe src=https://jupyter.readthedocs.io/en/latest/_images/notebook_components.png width=800 height=400></iframe>")


Out[3]:
  • Esc and Enter to toggle between "Command" and "Edit" mode --> see colour indicator of frame surrounding the cell

  • when in "Command" mode:

    • use cursor keys to navigate
    • a to insert cell Above
    • b to insert cell Below
    • dd to delete cell
  • when in "Edit" mode:

    • Tab for tab completion of file-names
    • Shift + Tab in parenthesis of function to display docstring
    • Cmd + / toggle line comment

Running code, getting help

In the notebook, to run a cell of code, hit Shift + Enter. This executes the cell and puts the cursor in the next cell below, or makes a new one if you are at the end. Alternately, you can use:

  • Alt + Enter to force the creation of a new cell unconditionally (useful when inserting new content in the middle of an existing notebook).
  • Control + Enter executes the cell and keeps the cursor in the same cell, useful for quick experimentation of snippets that you don't need to keep permanently.
  • Esc and Enter to toggle between "Command" and "Edit" mode --> see colour indicator of frame surrounding the cell
  • h for Help or checkout "Help" --> "Keyboard shortcuts" from the menu
  • Shift + Tab inside of the brackets of a function to get the Docstring
  • Cmd + / to toggle line comment

Shell commands


In [4]:
!pwd


/Users/dblyon/modules/cpr/PandasIntro

In [5]:
!ls -la


total 3496
drwxr-xr-x  17 dblyon  staff     578 Apr 28 15:21 .
drwxr-xr-x  14 dblyon  staff     476 Apr 27 23:12 ..
drwxr-xr-x  13 dblyon  staff     442 Apr 27 23:23 .git
-rw-r--r--   1 dblyon  staff    1045 Apr 27 23:12 .gitignore
drwxr-xr-x   7 dblyon  staff     238 Apr 27 23:21 .ipynb_checkpoints
-rw-r--r--   1 dblyon  staff   18747 Apr 28 15:02 Exercises_part_A.ipynb
-rw-r--r--   1 dblyon  staff   67287 Apr 28 15:03 Exercises_part_A_with_Solutions.ipynb
-rw-r--r--   1 dblyon  staff   20111 Apr 27 23:20 Exercises_part_B.ipynb
-rw-r--r--   1 dblyon  staff  263580 Apr 28 15:00 Exercises_part_B_with_Solutions.ipynb
-rw-r--r--   1 dblyon  staff    3867 Apr 27 23:12 Installation_Customization_Resources.txt
-rw-r--r--   1 dblyon  staff    1067 Apr 27 23:12 LICENSE
-rw-r--r--   1 dblyon  staff  685284 Apr 27 23:12 Pandas_Cheat_Sheet.pdf
-rw-r--r--   1 dblyon  staff  700478 Apr 28 15:21 Pandas_Introduction.ipynb
-rw-r--r--@  1 dblyon  staff    3429 Apr 27 23:16 Readme.md
-rw-r--r--   1 dblyon  staff    3478 Apr 27 23:12 Readme.txt
drwxr-xr-x   9 dblyon  staff     306 Apr 27 23:12 data
drwxr-xr-x   4 dblyon  staff     136 Apr 27 23:12 misc

In [6]:
var = !ls
len(var), var[:3]


Out[6]:
(12,
 ['Exercises_part_A.ipynb',
  'Exercises_part_A_with_Solutions.ipynb',
  'Exercises_part_B.ipynb'])

In [7]:
%%bash
echo "My shell is:" $SHELL
echo "My disk usage is:"
df -h


My shell is: /bin/zsh
My disk usage is:
Filesystem                          Size   Used  Avail Capacity  iused      ifree %iused  Mounted on
/dev/disk1                         465Gi  404Gi   61Gi    87%  1939768 4293027511    0%   /
devfs                              190Ki  190Ki    0Bi   100%      658          0  100%   /dev
map -hosts                           0Bi    0Bi    0Bi   100%        0          0  100%   /net
map auto_home                        0Bi    0Bi    0Bi   100%        0          0  100%   /home
localhost:/IasP10O30clRK_kqj-ftyr  465Gi  465Gi    0Bi   100%        0          0  100%   /Volumes/MobileBackups
/dev/disk2s1                       1.8Ti  414Gi  1.4Ti    23%  3392373   11868959   22%   /Volumes/Speedy
/dev/disk3s1                       2.7Ti  2.3Ti  476Gi    83% 13196722 4281770557    0%   /Volumes/Data

helpful IPython commands

command description
? Introduction and overview of IPython’s features.
%quickref Quick reference.
help Python’s own help system.
?object Details about ‘object’, use ‘??object’ for extra details.

In [8]:
# ?math.log

In [9]:
help(math.log)


Help on built-in function log in module math:

log(...)
    log(x[, base])
    
    Return the logarithm of x to the given base.
    If the base not specified, returns the natural logarithm (base e) of x.

press Shift + Tab inside of the brackets to get the Docstring


In [10]:
# math.log()

Magic


In [11]:
# %magic

In [12]:
values = range(1, 1001)

In [13]:
%%timeit 
results = []
for val in values:
    new_val = math.log(val, 10)
    results.append(new_val)


1000 loops, best of 3: 346 µs per loop

In [14]:
%%timeit 
results = [math.log(val, 10) for val in values]


1000 loops, best of 3: 258 µs per loop

In [15]:
%%timeit 
results = np.log10(values)


The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 59.2 µs per loop

Introduction to Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features:

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Pandas cheat sheet

(download the pdf or find it in the git repo of the current workshop)

Object creation

Creating a Series by passing a list of values, letting pandas create a default integer index:


In [16]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s


Out[16]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a numpy array, with labeled columns:


In [17]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df


Out[17]:
A B C D
0 -0.169900 0.289323 0.573236 1.823558
1 1.876850 1.270691 0.886132 1.507054
2 -0.282954 -0.441011 0.184304 1.661792
3 0.227809 0.015566 -1.719689 0.075242
4 1.293757 -0.551220 2.058666 0.039746
5 1.218084 1.164723 -1.383697 -0.128685

Creating a DataFrame by passing a dict of objects that can be converted to series-like.


In [18]:
df2 = pd.DataFrame({'A' : 1.,                    
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D' : np.array([3] * 4, dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo',
                    'tinyRick': pd.Series([2000])})
df2


Out[18]:
A B C D E F tinyRick
0 1.0 2013-01-02 1.0 3 test foo 2000.0
1 1.0 2013-01-02 1.0 3 train foo NaN
2 1.0 2013-01-02 1.0 3 test foo NaN
3 1.0 2013-01-02 1.0 3 train foo NaN

Having specific dtypes (data types)


In [19]:
df2.dtypes


Out[19]:
A                  float64
B           datetime64[ns]
C                  float32
D                    int32
E                 category
F                   object
tinyRick           float64
dtype: object

change the dtype


In [20]:
df2["D"] = df2["D"].astype('float64')

Viewing Data

See the top & bottom rows of the frame


In [21]:
df.head()


Out[21]:
A B C D
0 -0.169900 0.289323 0.573236 1.823558
1 1.876850 1.270691 0.886132 1.507054
2 -0.282954 -0.441011 0.184304 1.661792
3 0.227809 0.015566 -1.719689 0.075242
4 1.293757 -0.551220 2.058666 0.039746

In [22]:
df.tail(3)


Out[22]:
A B C D
3 0.227809 0.015566 -1.719689 0.075242
4 1.293757 -0.551220 2.058666 0.039746
5 1.218084 1.164723 -1.383697 -0.128685

Display the index, columns, and the underlying numpy data


In [23]:
df.index


Out[23]:
RangeIndex(start=0, stop=6, step=1)

In [24]:
df.columns


Out[24]:
Index([u'A', u'B', u'C', u'D'], dtype='object')

In [25]:
df.values


Out[25]:
array([[-0.16989951,  0.28932253,  0.57323584,  1.82355794],
       [ 1.87684958,  1.27069089,  0.88613196,  1.50705393],
       [-0.28295392, -0.4410113 ,  0.18430394,  1.66179244],
       [ 0.22780917,  0.01556611, -1.71968866,  0.07524179],
       [ 1.29375737, -0.55121983,  2.05866642,  0.03974571],
       [ 1.21808418,  1.16472342, -1.38369711, -0.12868522]])

Describe shows a quick statistic summary of your data excluding NaN (Not a Number) values


In [26]:
df.describe()


Out[26]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.693941 0.291345 0.099825 0.829784
std 0.889000 0.780286 1.428150 0.922028
min -0.282954 -0.551220 -1.719689 -0.128685
25% -0.070472 -0.326867 -0.991697 0.048620
50% 0.722947 0.152444 0.378770 0.791148
75% 1.274839 0.945873 0.807908 1.623108
max 1.876850 1.270691 2.058666 1.823558

Concise summary of a DataFrame


In [27]:
df2.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 7 columns):
A           4 non-null float64
B           4 non-null datetime64[ns]
C           4 non-null float32
D           4 non-null float64
E           4 non-null category
F           4 non-null object
tinyRick    1 non-null float64
dtypes: category(1), datetime64[ns](1), float32(1), float64(3), object(1)
memory usage: 228.0+ bytes

Transposing your data


In [28]:
df.T


Out[28]:
0 1 2 3 4 5
A -0.169900 1.876850 -0.282954 0.227809 1.293757 1.218084
B 0.289323 1.270691 -0.441011 0.015566 -0.551220 1.164723
C 0.573236 0.886132 0.184304 -1.719689 2.058666 -1.383697
D 1.823558 1.507054 1.661792 0.075242 0.039746 -0.128685

Sorting by values


In [29]:
df.sort_values(by='B')


Out[29]:
A B C D
4 1.293757 -0.551220 2.058666 0.039746
2 -0.282954 -0.441011 0.184304 1.661792
3 0.227809 0.015566 -1.719689 0.075242
0 -0.169900 0.289323 0.573236 1.823558
5 1.218084 1.164723 -1.383697 -0.128685
1 1.876850 1.270691 0.886132 1.507054

Selection

important ones:

  • .loc works on label of index and boolean array
  • .iloc works on integer position (from 0 to length-1 of the axis)

for completeness/comparison:

  • .ix first performs label based, if that fails then it falls to integer based --> don't use it (deprecated soon, prone to mistakes)
  • .at get scalar values. It's a very fast loc
  • .iat Get scalar values. It's a very fast iloc

CAVEATS: label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index.

Getting

Selecting a single column, which yields a Series.


In [30]:
df["A"]


Out[30]:
0   -0.169900
1    1.876850
2   -0.282954
3    0.227809
4    1.293757
5    1.218084
Name: A, dtype: float64

equivalent


In [31]:
df.A


Out[31]:
0   -0.169900
1    1.876850
2   -0.282954
3    0.227809
4    1.293757
5    1.218084
Name: A, dtype: float64

selecting with a list of column names, yields a data frame (also with a single column name)


In [32]:
df[["A", "C"]]


Out[32]:
A C
0 -0.169900 0.573236
1 1.876850 0.886132
2 -0.282954 0.184304
3 0.227809 -1.719689
4 1.293757 2.058666
5 1.218084 -1.383697

Selecting via [], which slices the rows.


In [33]:
df[0:3]


Out[33]:
A B C D
0 -0.169900 0.289323 0.573236 1.823558
1 1.876850 1.270691 0.886132 1.507054
2 -0.282954 -0.441011 0.184304 1.661792

Selection by Label

df.loc[rows, columns]

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
  • A list or array of labels ['a', 'b', 'c']
  • A slice object with labels 'a':'f', (note that contrary to usual python slices, both the start and the stop are included!)
  • A boolean array
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

In [34]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df


Out[34]:
A B C D
0 -0.008918 -1.209779 1.830216 -1.031679
1 -0.418588 0.792097 -2.936405 -1.203708
2 0.577104 -1.230854 0.214456 0.898157
3 -0.375449 0.339775 0.551374 -1.198762
4 -0.104258 -1.646144 -0.202755 -1.118140
5 0.244080 0.297405 0.080183 -0.080778

In [35]:
# Let's transpose the DataFrame and change the order of the columns (to force us to the string labels of the novel row index).
# row-index: Strings, columns: Integers
dfx = df.T
import random
columns = dfx.columns.tolist()
random.shuffle(columns)
dfx.columns = columns
dfx


Out[35]:
4 0 1 5 2 3
A -0.008918 -0.418588 0.577104 -0.375449 -0.104258 0.244080
B -1.209779 0.792097 -1.230854 0.339775 -1.646144 0.297405
C 1.830216 -2.936405 0.214456 0.551374 -0.202755 0.080183
D -1.031679 -1.203708 0.898157 -1.198762 -1.118140 -0.080778

In [36]:
dfx.loc["A":"C", ]


Out[36]:
4 0 1 5 2 3
A -0.008918 -0.418588 0.577104 -0.375449 -0.104258 0.244080
B -1.209779 0.792097 -1.230854 0.339775 -1.646144 0.297405
C 1.830216 -2.936405 0.214456 0.551374 -0.202755 0.080183

In [37]:
dfx.loc[["C", "A", "D"], ]


Out[37]:
4 0 1 5 2 3
C 1.830216 -2.936405 0.214456 0.551374 -0.202755 0.080183
A -0.008918 -0.418588 0.577104 -0.375449 -0.104258 0.244080
D -1.031679 -1.203708 0.898157 -1.198762 -1.118140 -0.080778

In [38]:
dfx.loc[["A", "D"], 0:2]


Out[38]:
0 1 5 2
A -0.418588 0.577104 -0.375449 -0.104258
D -1.203708 0.898157 -1.198762 -1.118140

Selection by Position

df.iloc[rows, columns]

.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics). Allowed inputs are:

  • An integer e.g. 5
  • A list or array of integers [4, 3, 0]
  • A slice object with ints 1:7
  • A boolean array
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

In [39]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df


Out[39]:
A B C D
0 1.695964 -2.460150 -2.087428 0.172406
1 -0.750895 2.017630 -0.725006 1.508415
2 -0.602350 2.511633 -0.181661 -0.463247
3 -0.615276 1.317014 -0.902506 0.125343
4 -1.099925 0.316938 1.285934 2.671944
5 -0.379311 0.610271 -1.756487 -0.195240

In [40]:
### transpose
# df = df.T
# df

In [41]:
df.iloc[0:3, :]


Out[41]:
A B C D
0 1.695964 -2.460150 -2.087428 0.172406
1 -0.750895 2.017630 -0.725006 1.508415
2 -0.602350 2.511633 -0.181661 -0.463247

uncomment the transpose and execute the two cells above again and notice that this doesn't raise an Error compared to label based indexing.


In [42]:
df.iloc[[1, 3, 5], 1:3]


Out[42]:
B C
1 2.017630 -0.725006
3 1.317014 -0.902506
5 0.610271 -1.756487

Boolean indexing


In [43]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df


Out[43]:
A B C D
0 -0.025316 0.576631 -0.097271 -0.830734
1 -0.216126 1.393101 -1.100897 0.252488
2 -0.519939 0.459839 0.446857 0.165097
3 -0.213917 0.011958 0.156027 1.678674
4 0.919606 0.556492 -1.089998 0.681550
5 0.390535 -1.825435 0.009902 -0.558049

Using a single column’s values to select data.


In [44]:
df[df["A"] > 0]


Out[44]:
A B C D
4 0.919606 0.556492 -1.089998 0.681550
5 0.390535 -1.825435 0.009902 -0.558049

A where operation for getting.


In [45]:
df[df > 0]


Out[45]:
A B C D
0 NaN 0.576631 NaN NaN
1 NaN 1.393101 NaN 0.252488
2 NaN 0.459839 0.446857 0.165097
3 NaN 0.011958 0.156027 1.678674
4 0.919606 0.556492 NaN 0.681550
5 0.390535 NaN 0.009902 NaN

This comparison yields a Pandas.Series of Booleans.


In [46]:
cond = df["A"] > 0
print type(cond)
cond


<class 'pandas.core.series.Series'>
Out[46]:
0    False
1    False
2    False
3    False
4     True
5     True
Name: A, dtype: bool

Summing boolean arrays can come in handy (True=1, False=0)


In [47]:
sum(cond), len(cond) # or cond.sum(), cond.shape[0]


Out[47]:
(2, 6)

Frequent use case: combining a Series of Bools with specific column names to select data.


In [48]:
df.loc[cond, ["A", "C"]]


Out[48]:
A C
4 0.919606 -1.089998
5 0.390535 0.009902

Let's add a column to the DataFrame


In [49]:
df['E'] = ['one', 'one','two','three','four','three']

In [50]:
df


Out[50]:
A B C D E
0 -0.025316 0.576631 -0.097271 -0.830734 one
1 -0.216126 1.393101 -1.100897 0.252488 one
2 -0.519939 0.459839 0.446857 0.165097 two
3 -0.213917 0.011958 0.156027 1.678674 three
4 0.919606 0.556492 -1.089998 0.681550 four
5 0.390535 -1.825435 0.009902 -0.558049 three

Using the isin() method for filtering:


In [51]:
df[df['E'].isin(['two','four'])]


Out[51]:
A B C D E
2 -0.519939 0.459839 0.446857 0.165097 two
4 0.919606 0.556492 -1.089998 0.681550 four

Setting

Setting a new column automatically aligns the data by the indexes


In [52]:
dates = pd.date_range('20130101', periods=6)
dfx = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
dfx


Out[52]:
A B C D
2013-01-01 0.718354 -0.365003 -0.000459 1.352737
2013-01-02 -0.319766 -0.789105 0.011944 -0.096405
2013-01-03 -1.240610 1.632686 -0.356929 -0.272638
2013-01-04 -2.367895 -0.407514 0.725953 -0.658410
2013-01-05 -1.267723 0.004515 -0.893457 0.087957
2013-01-06 -1.475928 -0.788952 -0.431096 -0.456979

In [53]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1 = s1.sort_values(ascending=False)
s1


Out[53]:
2013-01-07    6
2013-01-06    5
2013-01-05    4
2013-01-04    3
2013-01-03    2
2013-01-02    1
dtype: int64

In [54]:
dfx["F"] = s1
dfx


Out[54]:
A B C D F
2013-01-01 0.718354 -0.365003 -0.000459 1.352737 NaN
2013-01-02 -0.319766 -0.789105 0.011944 -0.096405 1.0
2013-01-03 -1.240610 1.632686 -0.356929 -0.272638 2.0
2013-01-04 -2.367895 -0.407514 0.725953 -0.658410 3.0
2013-01-05 -1.267723 0.004515 -0.893457 0.087957 4.0
2013-01-06 -1.475928 -0.788952 -0.431096 -0.456979 5.0

Setting values by label


In [55]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df['E'] = ['one', 'one','two','three','four','three']
df


Out[55]:
A B C D E
0 -1.693542 0.091828 -1.580609 0.594735 one
1 0.914928 1.528036 -1.870334 0.285952 one
2 0.731091 0.546496 -0.073345 0.063040 two
3 -0.804229 -0.351444 1.222302 -1.663926 three
4 0.404843 -0.471565 0.748519 -0.597049 four
5 0.238829 1.206530 0.137581 -0.188514 three

In [56]:
df.loc[5, "A"] = 0.815

Setting values by position


In [57]:
df.iloc[0, 4] = "zero"

delete (drop) some rows


In [58]:
df.drop([3, 5])


Out[58]:
A B C D E
0 -1.693542 0.091828 -1.580609 0.594735 zero
1 0.914928 1.528036 -1.870334 0.285952 one
2 0.731091 0.546496 -0.073345 0.063040 two
4 0.404843 -0.471565 0.748519 -0.597049 four

delete (drop) a column


In [59]:
df = df.drop("E", axis=1)
df


Out[59]:
A B C D
0 -1.693542 0.091828 -1.580609 0.594735
1 0.914928 1.528036 -1.870334 0.285952
2 0.731091 0.546496 -0.073345 0.063040
3 -0.804229 -0.351444 1.222302 -1.663926
4 0.404843 -0.471565 0.748519 -0.597049
5 0.815000 1.206530 0.137581 -0.188514

Axis

  • default: axis=0
  • axis=0 is "column-wise", for each column along the rows, horizontal
  • axis=1 is "row-wise", for each row along the columns, vertical

In [60]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df


Out[60]:
A B C D
0 -3.483162 0.674215 -0.863999 -0.921991
1 -1.089165 0.215477 1.594555 0.613976
2 0.443881 -1.100082 1.185749 0.137363
3 1.382410 -0.396501 -0.025471 -1.714747
4 1.498364 1.264864 0.242646 -0.432940
5 -1.124187 -0.304343 0.030233 0.093802

A where operation with setting.


In [61]:
df[df > 0] = -df
df


Out[61]:
A B C D
0 -3.483162 -0.674215 -0.863999 -0.921991
1 -1.089165 -0.215477 -1.594555 -0.613976
2 -0.443881 -1.100082 -1.185749 -0.137363
3 -1.382410 -0.396501 -0.025471 -1.714747
4 -1.498364 -1.264864 -0.242646 -0.432940
5 -1.124187 -0.304343 -0.030233 -0.093802

Multiply with a scalar.


In [62]:
df = df * -1
df


Out[62]:
A B C D
0 3.483162 0.674215 0.863999 0.921991
1 1.089165 0.215477 1.594555 0.613976
2 0.443881 1.100082 1.185749 0.137363
3 1.382410 0.396501 0.025471 1.714747
4 1.498364 1.264864 0.242646 0.432940
5 1.124187 0.304343 0.030233 0.093802

Row wise division


In [63]:
df["F"] = df["A"] / df["B"]
df


Out[63]:
A B C D F
0 3.483162 0.674215 0.863999 0.921991 5.166246
1 1.089165 0.215477 1.594555 0.613976 5.054672
2 0.443881 1.100082 1.185749 0.137363 0.403498
3 1.382410 0.396501 0.025471 1.714747 3.486522
4 1.498364 1.264864 0.242646 0.432940 1.184605
5 1.124187 0.304343 0.030233 0.093802 3.693813

Missing Data

pandas primarily uses the value np.nan to represent missing data.


In [64]:
df.loc[0, ] = np.nan
df.loc[2, ["A", "C"]] = np.nan
df


Out[64]:
A B C D F
0 NaN NaN NaN NaN NaN
1 1.089165 0.215477 1.594555 0.613976 5.054672
2 NaN 1.100082 NaN 0.137363 0.403498
3 1.382410 0.396501 0.025471 1.714747 3.486522
4 1.498364 1.264864 0.242646 0.432940 1.184605
5 1.124187 0.304343 0.030233 0.093802 3.693813

To drop any rows that have missing data.


In [65]:
df.dropna()


Out[65]:
A B C D F
1 1.089165 0.215477 1.594555 0.613976 5.054672
3 1.382410 0.396501 0.025471 1.714747 3.486522
4 1.498364 1.264864 0.242646 0.432940 1.184605
5 1.124187 0.304343 0.030233 0.093802 3.693813

Drop only rows where all values are missing.


In [66]:
df.dropna(how='all')


Out[66]:
A B C D F
1 1.089165 0.215477 1.594555 0.613976 5.054672
2 NaN 1.100082 NaN 0.137363 0.403498
3 1.382410 0.396501 0.025471 1.714747 3.486522
4 1.498364 1.264864 0.242646 0.432940 1.184605
5 1.124187 0.304343 0.030233 0.093802 3.693813

Fill missing values


In [67]:
df.fillna(value=5)


Out[67]:
A B C D F
0 5.000000 5.000000 5.000000 5.000000 5.000000
1 1.089165 0.215477 1.594555 0.613976 5.054672
2 5.000000 1.100082 5.000000 0.137363 0.403498
3 1.382410 0.396501 0.025471 1.714747 3.486522
4 1.498364 1.264864 0.242646 0.432940 1.184605
5 1.124187 0.304343 0.030233 0.093802 3.693813

see replace method to replace values given in 'to_replace' with 'value'.

To get the boolean mask where values are nan


In [68]:
df.isnull() # or pd.isnull(df)


Out[68]:
A B C D F
0 True True True True True
1 False False False False False
2 True False True False False
3 False False False False False
4 False False False False False
5 False False False False False

In [69]:
df.notnull() == -df.isnull()


Out[69]:
A B C D F
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True

Duplicates


In [70]:
dfx = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
dfx


Out[70]:
A B C D
0 foo one 0.373508 2.153941
1 bar one 0.356375 0.453069
2 foo two -0.491507 0.598356
3 bar three 0.710054 -1.016165
4 foo two -0.209080 -1.164680
5 bar two 0.028916 1.792031
6 foo one -0.348100 -1.193166
7 foo three 0.335917 0.067953

Drop duplicates except for the first occurrence, considering only a certain column.


In [71]:
dfx.drop_duplicates(subset="A")


Out[71]:
A B C D
0 foo one 0.373508 2.153941
1 bar one 0.356375 0.453069

see also duplicated method to return boolean Series denoting duplicate rows, optionally only considering certain columns


In [72]:
df


Out[72]:
A B C D F
0 NaN NaN NaN NaN NaN
1 1.089165 0.215477 1.594555 0.613976 5.054672
2 NaN 1.100082 NaN 0.137363 0.403498
3 1.382410 0.396501 0.025471 1.714747 3.486522
4 1.498364 1.264864 0.242646 0.432940 1.184605
5 1.124187 0.304343 0.030233 0.093802 3.693813

Operations

Stats

Operations in general exclude missing data.

Performing a descriptive statistic


In [73]:
df.mean()


Out[73]:
A    1.273531
B    0.656253
C    0.473226
D    0.598566
F    2.764622
dtype: float64

On the other axis


In [74]:
df.median(axis=1)


Out[74]:
0         NaN
1    1.089165
2    0.403498
3    1.382410
4    1.184605
5    0.304343
dtype: float64

The full range of basic statistics that are quickly calculable and built into the base Pandas package are:

Function Description
count Number of non-null observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum

The need for custom functions is minimal unless you have very specific requirements. c.f. http://pandas.pydata.org/pandas-docs/stable/basics.html

e.g. compute pairwise covariance of columns, excluding NA/null values


In [75]:
df.cov()


Out[75]:
A B C D F
A 0.039566 0.080791 -0.080390 0.052278 -0.279206
B 0.080791 0.238247 -0.120797 -0.106147 -0.883364
C -0.080390 -0.120797 0.569091 -0.073566 0.732454
D 0.052278 -0.106147 -0.073566 0.435256 0.430671
F -0.279206 -0.883364 0.732454 0.430671 3.675057

Apply

Applying functions to the data


In [76]:
df = pd.DataFrame(np.random.randint(0,100,size=(6, 4)), columns=list('ABCD'))
df


Out[76]:
A B C D
0 91 96 38 6
1 55 58 44 17
2 59 57 79 80
3 89 99 66 38
4 24 36 28 70
5 8 88 81 1

In [77]:
df.apply(np.cumsum, axis=0)


Out[77]:
A B C D
0 91 96 38 6
1 146 154 82 23
2 205 211 161 103
3 294 310 227 141
4 318 346 255 211
5 326 434 336 212

In [78]:
df.apply(lambda x: x.max() - x.min())


Out[78]:
A    83
B    63
C    53
D    79
dtype: int64

Merge

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concat


In [79]:
df = pd.DataFrame(np.random.randn(10, 4))
df


Out[79]:
0 1 2 3
0 0.970496 0.246441 0.844668 0.150986
1 0.341910 -0.633616 -1.376025 -0.473660
2 0.792826 -1.636417 -0.346066 -1.565380
3 -0.216165 0.489109 0.885583 -2.011223
4 -0.083781 -1.053215 0.693787 0.111576
5 1.120049 1.054716 -0.910174 -1.367579
6 -0.075761 -0.222491 0.949213 1.122462
7 2.151368 1.404117 0.746708 1.288665
8 1.224616 -0.189300 -1.982512 -0.961140
9 -1.139796 -1.253938 0.838092 0.869218

In [80]:
# break it into multiple pieces
pieces = [df[3:7], df[:3], df[7:]]
pd.concat(pieces)


Out[80]:
0 1 2 3
3 -0.216165 0.489109 0.885583 -2.011223
4 -0.083781 -1.053215 0.693787 0.111576
5 1.120049 1.054716 -0.910174 -1.367579
6 -0.075761 -0.222491 0.949213 1.122462
0 0.970496 0.246441 0.844668 0.150986
1 0.341910 -0.633616 -1.376025 -0.473660
2 0.792826 -1.636417 -0.346066 -1.565380
7 2.151368 1.404117 0.746708 1.288665
8 1.224616 -0.189300 -1.982512 -0.961140
9 -1.139796 -1.253938 0.838092 0.869218

Join

SQL style merges.


In [81]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['bar', 'foo'], 'rval': [4, 5]})
left


Out[81]:
key lval
0 foo 1
1 bar 2

In [82]:
right


Out[82]:
key rval
0 bar 4
1 foo 5

In [83]:
pd.merge(left, right, on='key')


Out[83]:
key lval rval
0 foo 1 5
1 bar 2 4

Append

Append rows to a dataframe.


In [84]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df


Out[84]:
A B C D
0 0.969489 -1.604153 -0.248921 1.038086
1 0.412990 -1.012526 0.534236 -1.596054
2 1.018075 1.128761 1.131470 -0.284683
3 0.932039 -0.080776 1.309474 -1.010208
4 0.043121 0.776682 1.384962 -1.658708
5 -0.114250 -0.229319 0.954087 0.559766
6 -0.607813 -1.179375 1.980276 -0.360193
7 -0.468785 0.287706 0.070535 1.491814

In [85]:
df.append(s, ignore_index=True)


Out[85]:
A B C D
0 0.969489 -1.604153 -0.248921 1.038086
1 0.412990 -1.012526 0.534236 -1.596054
2 1.018075 1.128761 1.131470 -0.284683
3 0.932039 -0.080776 1.309474 -1.010208
4 0.043121 0.776682 1.384962 -1.658708
5 -0.114250 -0.229319 0.954087 0.559766
6 -0.607813 -1.179375 1.980276 -0.360193
7 -0.468785 0.287706 0.070535 1.491814
8 0.932039 -0.080776 1.309474 -1.010208

Grouping

By “group by” we are referring to a process involving one or more of the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

In [86]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df


Out[86]:
A B C D
0 foo one -0.595697 -0.216727
1 bar one 1.480251 0.311373
2 foo two -1.836842 -0.377590
3 bar three 0.640351 -0.532001
4 foo two 0.538425 0.530609
5 bar two -0.436455 -1.024308
6 foo one -0.057440 1.501428
7 foo three -0.260092 -0.684502

Grouping and then applying a function sum to the resulting groups.


In [87]:
df.groupby('A').sum()


Out[87]:
C D
A
bar 1.684147 -1.244935
foo -2.211646 0.753217

Grouping by multiple columns forms a hierarchical index, which we then apply the function.


In [88]:
df.groupby(['A','B']).sum()


Out[88]:
C D
A B
bar one 1.480251 0.311373
three 0.640351 -0.532001
two -0.436455 -1.024308
foo one -0.653137 1.284701
three -0.260092 -0.684502
two -1.298417 0.153018

Series.nlargest: Return the largest n elements. Series.nsmallest: Return the smallest n elements.


In [89]:
df.groupby('A')["C"].nlargest(2)


Out[89]:
A     
bar  1    1.480251
     3    0.640351
foo  4    0.538425
     6   -0.057440
Name: C, dtype: float64

Splitting

Iterate over a groupby object, just to illustrate the split part of the groupby method.


In [90]:
grouped = df.groupby(['A','B'])
for name, group in grouped:
    print "Name:", name
    print group
    print "#"*50


Name: ('bar', 'one')
     A    B         C         D
1  bar  one  1.480251  0.311373
##################################################
Name: ('bar', 'three')
     A      B         C         D
3  bar  three  0.640351 -0.532001
##################################################
Name: ('bar', 'two')
     A    B         C         D
5  bar  two -0.436455 -1.024308
##################################################
Name: ('foo', 'one')
     A    B         C         D
0  foo  one -0.595697 -0.216727
6  foo  one -0.057440  1.501428
##################################################
Name: ('foo', 'three')
     A      B         C         D
7  foo  three -0.260092 -0.684502
##################################################
Name: ('foo', 'two')
     A    B         C         D
2  foo  two -1.836842 -0.377590
4  foo  two  0.538425  0.530609
##################################################

Applying

  • Aggregation: computing a summary statistic (or statistics) about each group. Some examples:

    • Compute group sums or means
    • Compute group sizes / counts
  • Transformation: perform some group-specific computations and return a like-indexed. Some examples:

    • Standardizing data (zscore) within group
    • Filling NAs within groups with a value derived from each group
  • Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

    • Discarding data that belongs to groups with only a few members
    • Filtering out data based on the group sum or mean

Aggregation

"reduces" the DataFrame, meaning the df_original.shape > df_aggregated.shape


In [91]:
grouped.aggregate(["sum", "count", "median", "mean"]) # see `Operations` above for more built-in methods


Out[91]:
C D
sum count median mean sum count median mean
A B
bar one 1.480251 1 1.480251 1.480251 0.311373 1 0.311373 0.311373
three 0.640351 1 0.640351 0.640351 -0.532001 1 -0.532001 -0.532001
two -0.436455 1 -0.436455 -0.436455 -1.024308 1 -1.024308 -1.024308
foo one -0.653137 2 -0.326569 -0.326569 1.284701 2 0.642350 0.642350
three -0.260092 1 -0.260092 -0.260092 -0.684502 1 -0.684502 -0.684502
two -1.298417 2 -0.649208 -0.649208 0.153018 2 0.076509 0.076509

Transformation

returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.


In [92]:
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                   'b': [1, 2, 3, 4, 5, 6],
                   'c': ['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd': ['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df


Out[92]:
a b c d e
0 1 1 q z 2
1 2 2 q z 4
2 3 3 q z 6
3 4 4 q o 8
4 5 5 w o 10
5 6 6 w o 12

In [93]:
df['f'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
df


Out[93]:
a b c d e f
0 1 1 q z 2 12
1 2 2 q z 4 12
2 3 3 q z 6 12
3 4 4 q o 8 8
4 5 5 w o 10 22
5 6 6 w o 12 22

In [94]:
assert df.loc[0, "f"] == df.loc[( (df["c"] == "q") & (df["d"] == "z") ), "e"].sum()

Filtration

returns a subset of the original object.


In [95]:
df = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
df


Out[95]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 b
5 5 b
6 6 c
7 7 c

In [96]:
df.groupby('B').filter(lambda x: len(x) > 2)


Out[96]:
A B
2 2 b
3 3 b
4 4 b
5 5 b

Apply

Another frequent operation is applying a function on 1D arrays to each column or row. Apply a custom function to the entire DataFrame or a Series. axis indicates row or column-wise application. (default: axis=0)

apply on a Series performs an element-wise operation


In [97]:
def example_custom_function(number):
    if number >= 6:
        return number / 5.0
    elif number <= 3:
        return number * 88
    else:
        return np.nan

In [98]:
df["A"].apply(example_custom_function)


Out[98]:
0      0.0
1     88.0
2    176.0
3    264.0
4      NaN
5      NaN
6      1.2
7      1.4
Name: A, dtype: float64

apply on a DataFrame perform a Series-wise operation


In [99]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df


Out[99]:
A B C D
0 -0.055586 0.565785 -1.166555 0.038104
1 0.200634 0.889308 1.126039 -2.620415
2 0.070676 0.895831 -1.826842 0.290249
3 2.361897 1.921316 0.722863 0.139535
4 -1.661117 0.428333 -2.217039 1.021083
5 -0.892498 2.375512 -2.880172 -1.041927
6 -0.016977 -0.522901 0.154441 0.600143
7 -0.356928 0.623010 -1.054379 1.144509

In [100]:
def example_custom_function_on_series_1(series):
    return series / series.sum()

In [101]:
### notice how example_custom_function_on_series_1 return a single value per Series --> transformation
df.apply(example_custom_function_on_series_1)


Out[101]:
A B C D
0 0.158864 0.078842 0.163345 -0.088879
1 -0.573406 0.123925 -0.157672 6.112175
2 -0.201991 0.124834 0.255801 -0.677011
3 -6.750224 0.267735 -0.101218 -0.325468
4 4.747419 0.059688 0.310438 -2.381697
5 2.550730 0.331027 0.403293 2.430318
6 0.048519 -0.072866 -0.021625 -1.399847
7 1.020089 0.086816 0.147638 -2.669591

In [102]:
def example_custom_function_on_series_2(series):
    return series.max() - series.min()

In [103]:
### notice how example_custom_function_on_series_2 return a single value per Series --> aggregation
df.apply(example_custom_function_on_series_2)


Out[103]:
A    4.023014
B    2.898413
C    4.006210
D    3.764924
dtype: float64

Applymap

Element-wise Python functions can be used, too. You can do this with applymap:


In [104]:
formater = lambda x: '%.2f' % x
df.applymap(formater)


Out[104]:
A B C D
0 -0.06 0.57 -1.17 0.04
1 0.20 0.89 1.13 -2.62
2 0.07 0.90 -1.83 0.29
3 2.36 1.92 0.72 0.14
4 -1.66 0.43 -2.22 1.02
5 -0.89 2.38 -2.88 -1.04
6 -0.02 -0.52 0.15 0.60
7 -0.36 0.62 -1.05 1.14

Map

The reason for the name applymap is that Series has a map method for applying an element-wise function:


In [105]:
df["A"].map(formater)


Out[105]:
0    -0.06
1     0.20
2     0.07
3     2.36
4    -1.66
5    -0.89
6    -0.02
7    -0.36
Name: A, dtype: object

Reshaping

Stack

“compresses” a level in the DataFrame’s columns.


In [106]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2


Out[106]:
A B
first second
bar one -0.916427 -1.554786
two 1.606001 -0.414154
baz one -0.338968 0.155801
two 0.569637 -2.387479

In [107]:
stacked = df2.stack()
stacked


Out[107]:
first  second   
bar    one     A   -0.916427
               B   -1.554786
       two     A    1.606001
               B   -0.414154
baz    one     A   -0.338968
               B    0.155801
       two     A    0.569637
               B   -2.387479
dtype: float64

Unstack

the inverse operation of stack() is unstack(), which by default unstacks the last level


In [108]:
stacked.unstack() # in this particular case equivalent to stacked.unstack(2)


Out[108]:
A B
first second
bar one -0.916427 -1.554786
two 1.606001 -0.414154
baz one -0.338968 0.155801
two 0.569637 -2.387479

In [109]:
stacked.unstack(0)


Out[109]:
first bar baz
second
one A -0.916427 -0.338968
B -1.554786 0.155801
two A 1.606001 0.569637
B -0.414154 -2.387479

In [110]:
stacked.unstack(1)


Out[110]:
second one two
first
bar A -0.916427 1.606001
B -1.554786 -0.414154
baz A -0.338968 0.569637
B 0.155801 -2.387479

Pivot

We can produce pivot tables from this data very easily:


In [111]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['a', 'b', 'c'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df


Out[111]:
A B C D E
0 one a foo 1.795658 -1.692460
1 one b foo -0.072229 0.080841
2 two c foo 0.579050 -0.484044
3 three a bar 1.802420 0.343524
4 one b bar -0.163139 -1.922711
5 one c bar -0.411838 0.239472
6 two a foo -0.167926 0.779548
7 three b foo 0.091777 -0.745811
8 one c foo 0.629784 0.927211
9 one a bar -1.127097 0.671469
10 two b bar -0.560821 -0.043930
11 three c bar -1.625592 -0.932791

In [112]:
pd.pivot_table(df, values=['D', "E"], index=['A', 'B'], columns=['C'])


Out[112]:
D E
C bar foo bar foo
A B
one a -1.127097 1.795658 0.671469 -1.692460
b -0.163139 -0.072229 -1.922711 0.080841
c -0.411838 0.629784 0.239472 0.927211
three a 1.802420 NaN 0.343524 NaN
b NaN 0.091777 NaN -0.745811
c -1.625592 NaN -0.932791 NaN
two a NaN -0.167926 NaN 0.779548
b -0.560821 NaN -0.043930 NaN
c NaN 0.579050 NaN -0.484044

Melt

“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set. This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.


In [113]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df


Out[113]:
A B C
0 a 1 2
1 b 3 4
2 c 5 6

In [114]:
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])


Out[114]:
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6

Plotting

On DataFrame, plot() is a convenience to plot all of the columns with labels. Using %matplotlib inline magic function enables plotting within the jupyter notebook cells (instead of e.g. in a separate Qt window).


In [115]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()


Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x10620d550>

In [116]:
df.boxplot()


Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x10620d250>

In [117]:
df.plot(kind="scatter", x="A", y="B")


Out[117]:
<matplotlib.axes._subplots.AxesSubplot at 0x10667f2d0>

In [118]:
df[["C", "D"]].sum().plot(kind="bar", x="C", y="D")


Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x10677d290>

In [119]:
axs = pd.scatter_matrix(df, figsize=(16,12), diagonal='kde')


Getting data in and out

reading data from file

tabular data such as CSV or tab-delimited txt files


In [120]:
fn = r"data/Saliva.txt"
df = pd.read_csv(fn, sep='\t')
df.head(3)


Out[120]:
TaxName species Abundance SampleCategory frequency rank
0 Rothia mucilaginosa 43675 9.861078 Caries 10.0 1.0
1 Veillonella atypica 39777 7.030844 Caries 10.0 2.0
2 Prevotella histicola 470565 6.433449 Caries 10.0 3.0

from Excel


In [121]:
xls = pd.ExcelFile('data/microbiome/MID1.xls')
print xls.sheet_names  # see all sheet names
df = xls.parse("Sheet 1", header=None)
df.columns = ["Taxon", "Count"]
df.head()


[u'Sheet 1']
Out[121]:
Taxon Count
0 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 7
1 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 2
2 Archaea "Crenarchaeota" Thermoprotei Sulfoloba... 3
3 Archaea "Crenarchaeota" Thermoprotei Thermopro... 3
4 Archaea "Euryarchaeota" "Methanomicrobia" Meth... 7

a current list of I/O tools:

  • read_csv
  • read_excel
  • read_hdf
  • read_sql
  • read_json
  • read_msgpack (experimental)
  • read_html
  • read_gbq (experimental)
  • read_stata
  • read_sas
  • read_clipboard
  • read_pickle

checkout the documentation http://pandas.pydata.org/pandas-docs/stable/ for more infos

writing data to file


In [122]:
fn_out = r"data/Tidy_data.txt"
df.to_csv(fn_out, sep='\t', header=True, index=False)

Miscellaneous

zip example


In [123]:
xls = pd.ExcelFile('data/microbiome/MID1.xls')
df = xls.parse("Sheet 1", header=None)
df.columns = ["Taxon", "Count"]
df.head(3)


Out[123]:
Taxon Count
0 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 7
1 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 2
2 Archaea "Crenarchaeota" Thermoprotei Sulfoloba... 3

In [124]:
df["superkingdom"], df["rest"] = zip(*df["Taxon"].apply(lambda x: x.split(" ", 1)))
df.head(3)


Out[124]:
Taxon Count superkingdom rest
0 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 7 Archaea "Crenarchaeota" Thermoprotei Desulfurococcales...
1 Archaea "Crenarchaeota" Thermoprotei Desulfuro... 2 Archaea "Crenarchaeota" Thermoprotei Desulfurococcales...
2 Archaea "Crenarchaeota" Thermoprotei Sulfoloba... 3 Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulf...

In [125]:
df.superkingdom.unique()


Out[125]:
array([u'Archaea', u'Bacteria'], dtype=object)

string processing methods

for Series and Index, to make it easy to operate on each element of the array. These methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods.


In [126]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s


Out[126]:
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

In [127]:
s.str.lower()


Out[127]:
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Cut

Return indices of half-open bins to which each value of x belongs. The cut function can be useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges.


In [128]:
df = pd.DataFrame(np.array([.2, 1.4, 2.5, 6.2, 9.7, np.nan, 2.1]), columns=["value"])
df["category"] = pd.cut(df["value"], 3, retbins=False, labels=["good", "medium", "bad"])
df


Out[128]:
value category
0 0.2 good
1 1.4 good
2 2.5 good
3 6.2 medium
4 9.7 bad
5 NaN NaN
6 2.1 good

rpy2 example

call an R function from Python


In [129]:
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.numpy2ri import numpy2ri
robjects.conversion.py2ri = numpy2ri

In [130]:
dict_ = {"Intensity First"  : [5.0, 2.0, 3.0, 4.0],         
         "Intensity Second" : [4.0, 1.0, 4.0, 2.0],
         "Intensity Third"  : [3.0, 4.0, 6.0, 8.0]}
df = pd.DataFrame(dict_, index=list("ABCD"))
df


Out[130]:
Intensity First Intensity Second Intensity Third
A 5.0 4.0 3.0
B 2.0 1.0 4.0
C 3.0 4.0 6.0
D 4.0 2.0 8.0

In [131]:
def R_function_normalizeQuantiles():
    rpackages.importr('limma')
    normalizeQuantiles = robjects.r['normalizeQuantiles']
    return normalizeQuantiles

In [132]:
def quantile_normalize(df, cols_2_norm):
    """
    :param df: DataFrame
    :param cols_2_norm: ListOfString (Columns to normalize)
    :return: DataFrame
    """
    normalizeQuantiles = R_function_normalizeQuantiles()

    # set Zero to NaN and transform to log-space
    df[cols_2_norm] = df[cols_2_norm].replace(to_replace=0.0, value=np.nan)
    df[cols_2_norm] = np.log10(df[cols_2_norm])

    # quantile normalize
    df[cols_2_norm] = np.array(normalizeQuantiles(df[cols_2_norm].values))

    # Transform back to non-log space and replace NaN with Zero
    df[cols_2_norm] = np.power(10, df[cols_2_norm])
    df[cols_2_norm] = df[cols_2_norm].replace(to_replace=np.nan, value=0.0)

    return df

In [133]:
df_norm = quantile_normalize(df, df.columns.tolist())
df_norm


Out[133]:
Intensity First Intensity Second Intensity Third
A 5.428835 4.985766 1.817121
B 1.817121 1.817121 2.884499
C 2.884499 4.985766 4.578857
D 4.578857 2.884499 5.428835

now let's do some EXERCISES

open Exercises_part_A.ipynb and/or Exercises_part_B.ipynb