2016-01-28_Pandas


Usual stuff to import


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from IPython.display import display, HTML

Pandas Python Data Analysis Library

If you find manipulating dataframes in R a bit too cumbersome, why don't you give Pandas a chance. On top of easy and efficient table management, plotting functionality is pretty great.

Data Structures in Pandas

  • Data alignment is intrinsic in pandas.

Series

One-dimensional labelled array which can hold any data type (even Python objects).


In [2]:
series_one = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
series_one


Out[2]:
a    1.897624
b   -1.402098
c    1.763193
d    0.739645
e    0.088261
dtype: float64

In [3]:
series_two = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
series_two


Out[3]:
a    1
b    2
c    3
d    4
e    5
dtype: int64
  • Starting from version v0.8.0, pandas supporst non-unique index values

Series is ndarray-like, dick-like, supports vectorized operations and label alignment


In [4]:
series_one[2:4]


Out[4]:
c    1.763193
d    0.739645
dtype: float64

In [5]:
series_one['a']


Out[5]:
1.8976236527526644

In [6]:
series_one + series_two


Out[6]:
a    2.897624
b    0.597902
c    4.763193
d    4.739645
e    5.088261
dtype: float64

In [7]:
series_one * 3


Out[7]:
a    5.692871
b   -4.206294
c    5.289580
d    2.218934
e    0.264783
dtype: float64

DataFrame

DataFrame is a 2-dimensional labelled data structure, like a spreadsheet or SQL table or a dict of Series objects. Obviously, the most used data structure in Pandas and what we'll be discussing more often.


In [8]:
df_one = pd.DataFrame({'one': pd.Series(np.random.rand(5), 
                                        index=['a', 'b', 'c', 'd' , 'e']),
                     'two': pd.Series(np.random.rand(4), 
                                      index=['a', 'b', 'c', 'e'])})
df_one


Out[8]:
one two
a 0.396618 0.835934
b 0.785182 0.740628
c 0.806272 0.258771
d 0.234836 NaN
e 0.787505 0.917104

There are several other constructors for creating a DataFrame object

  • pd.DataFrame.from_records
  • pd.DataFrame.from_dict
  • pd.DataFrame.from_items

Other Pandas data objects which we are not going to talk about are

Panels (3D, 4D, ND)

IO Tools

The Pandas I/O API is a set of nice reader functions which generally return a pandas object

pd.from_csv

Some important parameters

  • sep - Delimiter
  • index_col - Specifies which column to select as index
  • usecols - Specify which columns to read when reading a file
  • compression - Can handle gzip, bz2 compressed text files
  • comment - Comment character
  • names - If header=None, you can specify the names of columns
  • iterator - Return an iterator TextFileReader object

In [9]:
iris = pd.read_csv("iris.csv", index_col=0)
iris.head()


Out[9]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Let's see the power of pandas. We'll use Gencode v24 to demonstrate and read the annotation file.


In [10]:
url = "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.primary_assembly.annotation.gtf.gz"
gencode = pd.read_csv(url, compression="gzip", iterator=True, header=None, 
                      sep="\t", comment="#", quoting=3, 
                      usecols=[0, 1, 2, 3, 4, 6])
gencode.get_chunk(10)


Out[10]:
0 1 2 3 4 6
0 chr1 HAVANA gene 11869 14409 +
1 chr1 HAVANA transcript 11869 14409 +
2 chr1 HAVANA exon 11869 12227 +
3 chr1 HAVANA exon 12613 12721 +
4 chr1 HAVANA exon 13221 14409 +
5 chr1 HAVANA transcript 12010 13670 +
6 chr1 HAVANA exon 12010 12057 +
7 chr1 HAVANA exon 12179 12227 +
8 chr1 HAVANA exon 12613 12697 +
9 chr1 HAVANA exon 12975 13052 +

pd.DataFrame.to_csv

Dumps data to a csv file. A lot of optional parameters apply which will help you save the file just like you want.

iris.to_csv("iris_copy.csv")

pd.DataFrame.to_hdf

iris.to_hdf("iris_copy.h5", "df")

Creates a HDF5 file (binary indexed file for faster loading and index filtering during load times). Requires pytables as a depandency if you want to go full on with it's functionality

Reshaping

Almost everyone will be familiar on how much you need to reshape the data if we want to plot it properly. This functionality is also pretty well covered in pandas.

  • pd.melt

In [11]:
planets = pd.read_csv("planets.csv", index_col=0)
planets.head()


Out[11]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009

In [12]:
planets_melt = pd.melt(planets, id_vars="method")
planets_melt.head()


Out[12]:
method variable value
0 Radial Velocity number 1
1 Radial Velocity number 1
2 Radial Velocity number 1
3 Radial Velocity number 1
4 Radial Velocity number 1

Indexing and Selecting Data

pd.DataFrame and pd.Series support basic array-like indexing. To get into detail, it's better to use .loc and .iloc


In [13]:
heatmap = pd.read_csv("Heatmap.tsv", sep="\t", index_col=0)
heatmap.head(10)


Out[13]:
sense_0 sense_1 sense_2 sense_3 wafers_0 wafers_1 wafers_2 wafers_3
residues 1.339411 0.581236 0.321386 1.305148 -2.621799 -2.215848 -2.040509 -1.684054
confidence 0.472581 0.219921 0.821433 0.619075 -4.081257 -5.111464 -4.910159 -6.187502
prisons -0.363045 -0.293318 -1.706427 -0.911326 -3.646697 -2.481388 -2.820728 -2.649331
pressure 1.482086 2.047276 2.335837 1.024684 -1.304947 0.452421 -0.533841 -0.326147
feeder -0.859722 0.149037 -0.439567 -3.505245 -3.095876 -3.806153 -3.771297 -4.369893
maneuvers -2.752813 -3.886084 -3.138448 -4.183682 -1.930908 -1.731554 -1.158437 -2.105596
signatures -0.885476 -1.677831 1.203428 -2.455702 1.571333 2.282688 2.750340 0.950576
jacks 4.984442 5.311316 5.373420 5.457087 2.868453 3.221334 3.685987 3.396676
irons -3.453915 -0.457893 -2.560719 -1.945431 -4.068244 -2.694247 -3.386339 -1.385304
mate -1.767560 -1.167568 -0.981631 -1.671402 -3.490988 -3.380942 -2.574154 -4.517326

In [14]:
heatmap.iloc[4:8]


Out[14]:
sense_0 sense_1 sense_2 sense_3 wafers_0 wafers_1 wafers_2 wafers_3
feeder -0.859722 0.149037 -0.439567 -3.505245 -3.095876 -3.806153 -3.771297 -4.369893
maneuvers -2.752813 -3.886084 -3.138448 -4.183682 -1.930908 -1.731554 -1.158437 -2.105596
signatures -0.885476 -1.677831 1.203428 -2.455702 1.571333 2.282688 2.750340 0.950576
jacks 4.984442 5.311316 5.373420 5.457087 2.868453 3.221334 3.685987 3.396676

In [15]:
heatmap.loc[['prisons', 'jacks', 'irons']]


Out[15]:
sense_0 sense_1 sense_2 sense_3 wafers_0 wafers_1 wafers_2 wafers_3
prisons -0.363045 -0.293318 -1.706427 -0.911326 -3.646697 -2.481388 -2.820728 -2.649331
jacks 4.984442 5.311316 5.373420 5.457087 2.868453 3.221334 3.685987 3.396676
irons -3.453915 -0.457893 -2.560719 -1.945431 -4.068244 -2.694247 -3.386339 -1.385304

Almost forgot, HTML conditional formatting just made it into the latest release `0.17.1` and it's pretty awesome. Use a function to your liking or do it with a background gradient</span>


In [16]:
def color_negative_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, black otherwise.
    """
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color

# Apply the function like this
heatmap.head(10).style.applymap(color_negative_red)


Out[16]:
sense_0 sense_1 sense_2 sense_3 wafers_0 wafers_1 wafers_2 wafers_3
residues 1.339411 0.581236 0.321386 1.305148 -2.621799 -2.215848 -2.040509 -1.684054
confidence 0.472581 0.219921 0.821433 0.619075 -4.081257 -5.111464 -4.910159 -6.187502
prisons -0.363045 -0.293318 -1.706427 -0.911326 -3.646697 -2.481388 -2.820728 -2.649331
pressure 1.482086 2.047276 2.335837 1.024684 -1.304947 0.452421 -0.533841 -0.326147
feeder -0.859722 0.149037 -0.439567 -3.505245 -3.095876 -3.806153 -3.771297 -4.369893
maneuvers -2.752813 -3.886084 -3.138448 -4.183682 -1.930908 -1.731554 -1.158437 -2.105596
signatures -0.885476 -1.677831 1.203428 -2.455702 1.571333 2.282688 2.75034 0.950576
jacks 4.984442 5.311316 5.37342 5.457087 2.868453 3.221334 3.685987 3.396676
irons -3.453915 -0.457893 -2.560719 -1.945431 -4.068244 -2.694247 -3.386339 -1.385304
mate -1.76756 -1.167568 -0.981631 -1.671402 -3.490988 -3.380942 -2.574154 -4.517326

In [17]:
heatmap.head(10).style.background_gradient(cmap="RdBu_r")


Out[17]:
sense_0 sense_1 sense_2 sense_3 wafers_0 wafers_1 wafers_2 wafers_3
residues 1.339411 0.581236 0.321386 1.305148 -2.621799 -2.215848 -2.040509 -1.684054
confidence 0.472581 0.219921 0.821433 0.619075 -4.081257 -5.111464 -4.910159 -6.187502
prisons -0.363045 -0.293318 -1.706427 -0.911326 -3.646697 -2.481388 -2.820728 -2.649331
pressure 1.482086 2.047276 2.335837 1.024684 -1.304947 0.452421 -0.533841 -0.326147
feeder -0.859722 0.149037 -0.439567 -3.505245 -3.095876 -3.806153 -3.771297 -4.369893
maneuvers -2.752813 -3.886084 -3.138448 -4.183682 -1.930908 -1.731554 -1.158437 -2.105596
signatures -0.885476 -1.677831 1.203428 -2.455702 1.571333 2.282688 2.75034 0.950576
jacks 4.984442 5.311316 5.37342 5.457087 2.868453 3.221334 3.685987 3.396676
irons -3.453915 -0.457893 -2.560719 -1.945431 -4.068244 -2.694247 -3.386339 -1.385304
mate -1.76756 -1.167568 -0.981631 -1.671402 -3.490988 -3.380942 -2.574154 -4.517326

Group-by and apply

You can group data (on both axes) based on a criteria. It returns an iterator but you can directly apply a function without the need to iterate through.

Remember though, you'll get a new index based on what you group with if you directly apply the function without iterating over the groups.

  • pd.DataFrame.groupby

In [18]:
# No need to iter through to apply mean based on species
iris_species_grouped = iris.groupby('species')
iris_species_grouped.mean()


Out[18]:
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

In [19]:
# The previous iterator has reached it's end, so re-initialize
iris_species_grouped = iris.groupby('species')

for species, group in iris_species_grouped:
    display(HTML(species))
    display(pd.DataFrame(group.mean(axis=0)).T)


setosa
sepal_length sepal_width petal_length petal_width
0 5.006 3.428 1.462 0.246
versicolor
sepal_length sepal_width petal_length petal_width
0 5.936 2.77 4.26 1.326
virginica
sepal_length sepal_width petal_length petal_width
0 6.588 2.974 5.552 2.026

Applying a function

  • pd.DataFrame.apply

In [20]:
pd.DataFrame(iris[[0, 1, 2, 3]].apply(np.std, axis=0)).T


Out[20]:
sepal_length sepal_width petal_length petal_width
0 0.825301 0.434411 1.759404 0.759693

In [21]:
def add_length_width(x):
    """
    Adds up the length and width of the features and returns
    a pd.Series object so as to get a pd.DataFrame
    """
    sepal_sum = x['sepal_length'] + x['sepal_width']
    petal_sum = x['petal_length'] + x['petal_width']
    return pd.Series([sepal_sum, petal_sum, x['species']], 
                    index=['sepal_sum', 'petal_sum', 'species'])

iris.apply(add_length_width, axis=1).head(5)


Out[21]:
sepal_sum petal_sum species
0 8.6 1.6 setosa
1 7.9 1.6 setosa
2 7.9 1.5 setosa
3 7.7 1.7 setosa
4 8.6 1.6 setosa

Filtering (Numeric and String)

There's always need for that. Obviously needed float & int filters but exceptional string filtering options baked in... So much good stuff this..

Inside the pd.DataFrame.loc, you can specify and (&), or (|), not (~) as logical operators. This stuff works and is tested ;)

  • >, <, >=, <=
  • str.contains, str.startswith, str.endswith

In [22]:
iris.loc[iris.sepal_width > 3.5]


Out[22]:
sepal_length sepal_width petal_length petal_width species
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
10 5.4 3.7 1.5 0.2 setosa
14 5.8 4.0 1.2 0.2 setosa
15 5.7 4.4 1.5 0.4 setosa
16 5.4 3.9 1.3 0.4 setosa
18 5.7 3.8 1.7 0.3 setosa
19 5.1 3.8 1.5 0.3 setosa
21 5.1 3.7 1.5 0.4 setosa
22 4.6 3.6 1.0 0.2 setosa
32 5.2 4.1 1.5 0.1 setosa
33 5.5 4.2 1.4 0.2 setosa
37 4.9 3.6 1.4 0.1 setosa
44 5.1 3.8 1.9 0.4 setosa
46 5.1 3.8 1.6 0.2 setosa
48 5.3 3.7 1.5 0.2 setosa
109 7.2 3.6 6.1 2.5 virginica
117 7.7 3.8 6.7 2.2 virginica
131 7.9 3.8 6.4 2.0 virginica

In [23]:
iris.loc[(iris.sepal_width > 3.5) & (iris.species == 'virginica')]


Out[23]:
sepal_length sepal_width petal_length petal_width species
109 7.2 3.6 6.1 2.5 virginica
117 7.7 3.8 6.7 2.2 virginica
131 7.9 3.8 6.4 2.0 virginica

In [24]:
heatmap.loc[heatmap.index.str.contains("due|ver|ap")]


Out[24]:
sense_0 sense_1 sense_2 sense_3 wafers_0 wafers_1 wafers_2 wafers_3
residues 1.339411 0.581236 0.321386 1.305148 -2.621799 -2.215848 -2.040509 -1.684054
maneuvers -2.752813 -3.886084 -3.138448 -4.183682 -1.930908 -1.731554 -1.158437 -2.105596
photograph -0.464363 -1.009428 -0.641627 -1.003007 -1.187375 -1.808720 -0.884953 -1.783674
lapses -1.126825 -1.576013 -0.658668 -1.012810 0.276163 -0.881380 -0.340348 -0.586810
advertisement 1.153506 2.456341 3.059485 1.631144 -0.027503 0.640562 1.308691 -1.703689
shape -0.451413 0.535117 0.531269 -0.389788 -2.545310 -0.865711 -1.586384 -3.111963
slap 3.388655 3.002504 2.907908 3.311611 -0.502354 -0.821947 0.385204 -0.338511

There is a ton of stuff that can be done in Pandas. The online docs is super detailed and amazing. Explore, search, stack overflow it and most probably you'll get what you're looking for. The current version docs (as of this talk) Pandas v0.17.1

Things that I can't cover because of the time constraints

  • Plotting - Uses matplotlib as the backend and makes big data analyses/visualization quicker
  • Lots of mathematical functions to easily use in day to day life