1 - Pandas Python Data Analysis Library

1.1.1 - Series

1.1.2 - DataFrame

1.1.3 - Panels (3D, 4D, ND)

1.2 - IO Tools

1.2.1 - pd.from_csv

1.2.2 - pd.DataFrame.to_csv

1.2.3 - pd.DataFrame.to_hdf

1.3 - Reshaping

1.4 - Indexing and Selecting Data

1.5 - Group-by and apply

1.5.1 - Applying a function

1.6 - Filtering (Numeric and String)

Usual stuff to import



In [1]:

    
import pandas as pd
import numpy as np
import seaborn as sns

from IPython.display import display, HTML

Pandas Python Data Analysis Library

If you find manipulating dataframes in R a bit too cumbersome, why don't you give Pandas a chance. On top of easy and efficient table management, plotting functionality is pretty great.

Data Structures in Pandas

Data alignment is intrinsic in pandas.

Series

One-dimensional labelled array which can hold any data type (even Python objects).



In [2]:

    
series_one = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
series_one









    Out[2]:





a    1.897624
b   -1.402098
c    1.763193
d    0.739645
e    0.088261
dtype: float64



In [3]:

    
series_two = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
series_two









    Out[3]:





a    1
b    2
c    3
d    4
e    5
dtype: int64

Starting from version v0.8.0, pandas supporst non-unique index values

Series is ndarray-like, dick-like, supports vectorized operations and label alignment



In [4]:

    
series_one[2:4]









    Out[4]:





c    1.763193
d    0.739645
dtype: float64



In [5]:

    
series_one['a']









    Out[5]:





1.8976236527526644



In [6]:

    
series_one + series_two









    Out[6]:





a    2.897624
b    0.597902
c    4.763193
d    4.739645
e    5.088261
dtype: float64



In [7]:

    
series_one * 3









    Out[7]:





a    5.692871
b   -4.206294
c    5.289580
d    2.218934
e    0.264783
dtype: float64

DataFrame

DataFrame is a 2-dimensional labelled data structure, like a spreadsheet or SQL table or a dict of Series objects. Obviously, the most used data structure in Pandas and what we'll be discussing more often.



In [8]:

    
df_one = pd.DataFrame({'one': pd.Series(np.random.rand(5), 
                                        index=['a', 'b', 'c', 'd' , 'e']),
                     'two': pd.Series(np.random.rand(4), 
                                      index=['a', 'b', 'c', 'e'])})
df_one

There are several other constructors for creating a DataFrame object

pd.DataFrame.from_records
pd.DataFrame.from_dict
pd.DataFrame.from_items

Other Pandas data objects which we are not going to talk about are

Panels (3D, 4D, ND)

IO Tools

The Pandas I/O API is a set of nice reader functions which generally return a pandas object

pd.from_csv

Some important parameters

sep - Delimiter
index_col - Specifies which column to select as index
usecols - Specify which columns to read when reading a file
compression - Can handle gzip, bz2 compressed text files
comment - Comment character
names - If header=None, you can specify the names of columns
iterator - Return an iterator TextFileReader object



In [9]:

    
iris = pd.read_csv("iris.csv", index_col=0)
iris.head()









    Out[9]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      setosa

Let's see the power of pandas. We'll use Gencode v24 to demonstrate and read the annotation file.



In [10]:

    
url = "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.primary_assembly.annotation.gtf.gz"
gencode = pd.read_csv(url, compression="gzip", iterator=True, header=None, 
                      sep="\t", comment="#", quoting=3, 
                      usecols=[0, 1, 2, 3, 4, 6])
gencode.get_chunk(10)

pd.DataFrame.to_csv

Dumps data to a csv file. A lot of optional parameters apply which will help you save the file just like you want.

iris.to_csv("iris_copy.csv")

pd.DataFrame.to_hdf

iris.to_hdf("iris_copy.h5", "df")

Creates a HDF5 file (binary indexed file for faster loading and index filtering during load times). Requires pytables as a depandency if you want to go full on with it's functionality

Reshaping

Almost everyone will be familiar on how much you need to reshape the data if we want to plot it properly. This functionality is also pretty well covered in pandas.

pd.melt



In [11]:

    
planets = pd.read_csv("planets.csv", index_col=0)
planets.head()









    Out[11]:






  
    
      
      method
      number
      orbital_period
      mass
      distance
      year
    
  
  
    
      0
      Radial Velocity
      1
      269.300
      7.10
      77.40
      2006
    
    
      1
      Radial Velocity
      1
      874.774
      2.21
      56.95
      2008
    
    
      2
      Radial Velocity
      1
      763.000
      2.60
      19.84
      2011
    
    
      3
      Radial Velocity
      1
      326.030
      19.40
      110.62
      2007
    
    
      4
      Radial Velocity
      1
      516.220
      10.50
      119.47
      2009



In [12]:

    
planets_melt = pd.melt(planets, id_vars="method")
planets_melt.head()









    Out[12]:






  
    
      
      method
      variable
      value
    
  
  
    
      0
      Radial Velocity
      number
      1
    
    
      1
      Radial Velocity
      number
      1
    
    
      2
      Radial Velocity
      number
      1
    
    
      3
      Radial Velocity
      number
      1
    
    
      4
      Radial Velocity
      number
      1

Indexing and Selecting Data

pd.DataFrame and pd.Series support basic array-like indexing. To get into detail, it's better to use .loc and .iloc



In [13]:

    
heatmap = pd.read_csv("Heatmap.tsv", sep="\t", index_col=0)
heatmap.head(10)



In [14]:

    
heatmap.iloc[4:8]



In [15]:

    
heatmap.loc[['prisons', 'jacks', 'irons']]

Almost forgot, HTML conditional formatting just made it into the latest release `0.17.1` and it's pretty awesome. Use a function to your liking or do it with a background gradient</span>



In [16]:

    
def color_negative_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, black otherwise.
    """
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color

# Apply the function like this
heatmap.head(10).style.applymap(color_negative_red)



In [17]:

    
heatmap.head(10).style.background_gradient(cmap="RdBu_r")

Group-by and apply

You can group data (on both axes) based on a criteria. It returns an iterator but you can directly apply a function without the need to iterate through.

Remember though, you'll get a new index based on what you group with if you directly apply the function without iterating over the groups.

pd.DataFrame.groupby



In [18]:

    
# No need to iter through to apply mean based on species
iris_species_grouped = iris.groupby('species')
iris_species_grouped.mean()









    Out[18]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
    
    
      species
      
      
      
      
    
  
  
    
      setosa
      5.006
      3.428
      1.462
      0.246
    
    
      versicolor
      5.936
      2.770
      4.260
      1.326
    
    
      virginica
      6.588
      2.974
      5.552
      2.026



In [19]:

    
# The previous iterator has reached it's end, so re-initialize
iris_species_grouped = iris.groupby('species')

for species, group in iris_species_grouped:
    display(HTML(species))
    display(pd.DataFrame(group.mean(axis=0)).T)









    




setosa






    






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
    
  
  
    
      0
      5.006
      3.428
      1.462
      0.246
    
  








    




versicolor






    






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
    
  
  
    
      0
      5.936
      2.77
      4.26
      1.326
    
  








    




virginica






    






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
    
  
  
    
      0
      6.588
      2.974
      5.552
      2.026

Applying a function

pd.DataFrame.apply



In [20]:

    
pd.DataFrame(iris[[0, 1, 2, 3]].apply(np.std, axis=0)).T









    Out[20]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
    
  
  
    
      0
      0.825301
      0.434411
      1.759404
      0.759693



In [21]:

    
def add_length_width(x):
    """
    Adds up the length and width of the features and returns
    a pd.Series object so as to get a pd.DataFrame
    """
    sepal_sum = x['sepal_length'] + x['sepal_width']
    petal_sum = x['petal_length'] + x['petal_width']
    return pd.Series([sepal_sum, petal_sum, x['species']], 
                    index=['sepal_sum', 'petal_sum', 'species'])

iris.apply(add_length_width, axis=1).head(5)









    Out[21]:






  
    
      
      sepal_sum
      petal_sum
      species
    
  
  
    
      0
      8.6
      1.6
      setosa
    
    
      1
      7.9
      1.6
      setosa
    
    
      2
      7.9
      1.5
      setosa
    
    
      3
      7.7
      1.7
      setosa
    
    
      4
      8.6
      1.6
      setosa

Filtering (Numeric and String)

There's always need for that. Obviously needed float & int filters but exceptional string filtering options baked in... So much good stuff this..

Inside the pd.DataFrame.loc, you can specify and (&), or (|), not (~) as logical operators. This stuff works and is tested ;)

>, <, >=, <=
str.contains, str.startswith, str.endswith



In [22]:

    
iris.loc[iris.sepal_width > 3.5]









    Out[22]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      4
      5.0
      3.6
      1.4
      0.2
      setosa
    
    
      5
      5.4
      3.9
      1.7
      0.4
      setosa
    
    
      10
      5.4
      3.7
      1.5
      0.2
      setosa
    
    
      14
      5.8
      4.0
      1.2
      0.2
      setosa
    
    
      15
      5.7
      4.4
      1.5
      0.4
      setosa
    
    
      16
      5.4
      3.9
      1.3
      0.4
      setosa
    
    
      18
      5.7
      3.8
      1.7
      0.3
      setosa
    
    
      19
      5.1
      3.8
      1.5
      0.3
      setosa
    
    
      21
      5.1
      3.7
      1.5
      0.4
      setosa
    
    
      22
      4.6
      3.6
      1.0
      0.2
      setosa
    
    
      32
      5.2
      4.1
      1.5
      0.1
      setosa
    
    
      33
      5.5
      4.2
      1.4
      0.2
      setosa
    
    
      37
      4.9
      3.6
      1.4
      0.1
      setosa
    
    
      44
      5.1
      3.8
      1.9
      0.4
      setosa
    
    
      46
      5.1
      3.8
      1.6
      0.2
      setosa
    
    
      48
      5.3
      3.7
      1.5
      0.2
      setosa
    
    
      109
      7.2
      3.6
      6.1
      2.5
      virginica
    
    
      117
      7.7
      3.8
      6.7
      2.2
      virginica
    
    
      131
      7.9
      3.8
      6.4
      2.0
      virginica



In [23]:

    
iris.loc[(iris.sepal_width > 3.5) & (iris.species == 'virginica')]









    Out[23]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      109
      7.2
      3.6
      6.1
      2.5
      virginica
    
    
      117
      7.7
      3.8
      6.7
      2.2
      virginica
    
    
      131
      7.9
      3.8
      6.4
      2.0
      virginica



In [24]:

    
heatmap.loc[heatmap.index.str.contains("due|ver|ap")]









    Out[24]:






  
    
      
      sense_0
      sense_1
      sense_2
      sense_3
      wafers_0
      wafers_1
      wafers_2
      wafers_3
    
  
  
    
      residues
      1.339411
      0.581236
      0.321386
      1.305148
      -2.621799
      -2.215848
      -2.040509
      -1.684054
    
    
      maneuvers
      -2.752813
      -3.886084
      -3.138448
      -4.183682
      -1.930908
      -1.731554
      -1.158437
      -2.105596
    
    
      photograph
      -0.464363
      -1.009428
      -0.641627
      -1.003007
      -1.187375
      -1.808720
      -0.884953
      -1.783674
    
    
      lapses
      -1.126825
      -1.576013
      -0.658668
      -1.012810
      0.276163
      -0.881380
      -0.340348
      -0.586810
    
    
      advertisement
      1.153506
      2.456341
      3.059485
      1.631144
      -0.027503
      0.640562
      1.308691
      -1.703689
    
    
      shape
      -0.451413
      0.535117
      0.531269
      -0.389788
      -2.545310
      -0.865711
      -1.586384
      -3.111963
    
    
      slap
      3.388655
      3.002504
      2.907908
      3.311611
      -0.502354
      -0.821947
      0.385204
      -0.338511

There is a ton of stuff that can be done in Pandas. The online docs is super detailed and amazing. Explore, search, stack overflow it and most probably you'll get what you're looking for. The current version docs (as of this talk) Pandas v0.17.1

Things that I can't cover because of the time constraints

Plotting - Uses matplotlib as the backend and makes big data analyses/visualization quicker
Lots of mathematical functions to easily use in day to day life

	one	two
a	0.396618	0.835934
b	0.785182	0.740628
c	0.806272	0.258771
d	0.234836	NaN
e	0.787505	0.917104

	0	1	2	3	4	6
0	chr1	HAVANA	gene	11869	14409	+
1	chr1	HAVANA	transcript	11869	14409	+
2	chr1	HAVANA	exon	11869	12227	+
3	chr1	HAVANA	exon	12613	12721	+
4	chr1	HAVANA	exon	13221	14409	+
5	chr1	HAVANA	transcript	12010	13670	+
6	chr1	HAVANA	exon	12010	12057	+
7	chr1	HAVANA	exon	12179	12227	+
8	chr1	HAVANA	exon	12613	12697	+
9	chr1	HAVANA	exon	12975	13052	+

	sense_0	sense_1	sense_2	sense_3	wafers_0	wafers_1	wafers_2	wafers_3
residues	1.339411	0.581236	0.321386	1.305148	-2.621799	-2.215848	-2.040509	-1.684054
confidence	0.472581	0.219921	0.821433	0.619075	-4.081257	-5.111464	-4.910159	-6.187502
prisons	-0.363045	-0.293318	-1.706427	-0.911326	-3.646697	-2.481388	-2.820728	-2.649331
pressure	1.482086	2.047276	2.335837	1.024684	-1.304947	0.452421	-0.533841	-0.326147
feeder	-0.859722	0.149037	-0.439567	-3.505245	-3.095876	-3.806153	-3.771297	-4.369893
maneuvers	-2.752813	-3.886084	-3.138448	-4.183682	-1.930908	-1.731554	-1.158437	-2.105596
signatures	-0.885476	-1.677831	1.203428	-2.455702	1.571333	2.282688	2.750340	0.950576
jacks	4.984442	5.311316	5.373420	5.457087	2.868453	3.221334	3.685987	3.396676
irons	-3.453915	-0.457893	-2.560719	-1.945431	-4.068244	-2.694247	-3.386339	-1.385304
mate	-1.767560	-1.167568	-0.981631	-1.671402	-3.490988	-3.380942	-2.574154	-4.517326

	sense_0	sense_1	sense_2	sense_3	wafers_0	wafers_1	wafers_2	wafers_3
feeder	-0.859722	0.149037	-0.439567	-3.505245	-3.095876	-3.806153	-3.771297	-4.369893
maneuvers	-2.752813	-3.886084	-3.138448	-4.183682	-1.930908	-1.731554	-1.158437	-2.105596
signatures	-0.885476	-1.677831	1.203428	-2.455702	1.571333	2.282688	2.750340	0.950576
jacks	4.984442	5.311316	5.373420	5.457087	2.868453	3.221334	3.685987	3.396676

	sense_0	sense_1	sense_2	sense_3	wafers_0	wafers_1	wafers_2	wafers_3
prisons	-0.363045	-0.293318	-1.706427	-0.911326	-3.646697	-2.481388	-2.820728	-2.649331
jacks	4.984442	5.311316	5.373420	5.457087	2.868453	3.221334	3.685987	3.396676
irons	-3.453915	-0.457893	-2.560719	-1.945431	-4.068244	-2.694247	-3.386339	-1.385304

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011
3	Radial Velocity	1	326.030	19.40	110.62	2007
4	Radial Velocity	1	516.220	10.50	119.47	2009

	method	variable	value
0	Radial Velocity	number	1
1	Radial Velocity	number	1
2	Radial Velocity	number	1
3	Radial Velocity	number	1
4	Radial Velocity	number	1

	sepal_length	sepal_width	petal_length	petal_width
species
setosa	5.006	3.428	1.462	0.246
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

Table of Contents