Pandas Overview

  • Data analysis package for use with Python
  • Data structures are the fundamental elements
    • 1D - Series
    • 2D - DataFrame
  • Easy to work with labeled (or relational) data in SQL tables or Excel spreadsheets
  • Documentation: http://pandas.pydata.org/pandas-docs/version/0.19.1/

Import the Pandas Package


In [190]:
import pandas as pd
from pandas import Series, DataFrame

Series

A series is a 1-Dimensional object similar to an array. There will be an array of data labels corresponding to an array of data.

Create series1 and display its indices and values:


In [191]:
series1 = Series([1,2,3,4,5])
series1.name = 'MyFirstSeries'
series1.index.name = 'Indx'
series1


Out[191]:
Indx
0    1
1    2
2    3
3    4
4    5
Name: MyFirstSeries, dtype: int64

In [192]:
series1.index


Out[192]:
RangeIndex(start=0, stop=5, step=1, name='Indx')

In [193]:
series1.values


Out[193]:
array([1, 2, 3, 4, 5], dtype=int64)

Create series2 with custom indices:


In [194]:
series2 = Series([10,20,30], index=['a','b','c'])
series2


Out[194]:
a    10
b    20
c    30
dtype: int64

Series indexing and operations


In [195]:
series2[0] == series2['a']     #check value of an index


Out[195]:
True

In [196]:
series1[series1 > 3]           #get values greater than 3


Out[196]:
Indx
3    4
4    5
Name: MyFirstSeries, dtype: int64

In [197]:
series2 / 2                    #scalar division of a series


Out[197]:
a     5.0
b    10.0
c    15.0
dtype: float64

In [198]:
series2.isnull()               #check for nulls (multiple)
pd.isnull(series2)


Out[198]:
a    False
b    False
c    False
dtype: bool

In [199]:
List = ['c','a','b']           #get series values by passing in a List
series2[List]


Out[199]:
c    30
a    10
b    20
dtype: int64

In [200]:
series1 + series2


Out[200]:
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
a   NaN
b   NaN
c   NaN
dtype: float64

Import DataStructure from File


In [201]:
df = pd.read_csv("data1.csv", header=None)     #import csv file

In [202]:
df.columns = ["ID","Name", "Birthday"]

In [203]:
df.head()                                      #view top 5 lines


Out[203]:
ID Name Birthday
0 10 Matt 14-Feb
1 20 Sammy 23-Apr
2 30 Ravi 24-Aug
3 40 Donald 12-May
4 50 Bridget 9-Mar

In [204]:
df.tail(1)                                     #view last line


Out[204]:
ID Name Birthday
9 100 Tom 30-Aug

Data Structure Indexing


In [205]:
df['Name']                 #index based on column name (multiple)
df.Name


Out[205]:
0       Matt
1      Sammy
2       Ravi
3     Donald
4    Bridget
5      Brett
6        Pam
7        Jon
8        Max
9        Tom
Name: Name, dtype: object

In [206]:
df.ix[0]                   #index based on row number


Out[206]:
ID              10
Name          Matt
Birthday    14-Feb
Name: 0, dtype: object

In [207]:
df[df['ID'] < 60]          #index based on values


Out[207]:
ID Name Birthday
0 10 Matt 14-Feb
1 20 Sammy 23-Apr
2 30 Ravi 24-Aug
3 40 Donald 12-May
4 50 Bridget 9-Mar

Create new column w/lambda function which adds 1 to each row in 'ID'


In [208]:
df['ID+1'] = df.apply(lambda row: row['ID'] + 1, axis=1)
df


Out[208]:
ID Name Birthday ID+1
0 10 Matt 14-Feb 11
1 20 Sammy 23-Apr 21
2 30 Ravi 24-Aug 31
3 40 Donald 12-May 41
4 50 Bridget 9-Mar 51
5 60 Brett 8-Sep 61
6 70 Pam 22-Dec 71
7 80 Jon 6-Jun 81
8 90 Max 7-Jan 91
9 100 Tom 30-Aug 101