Introduction to Pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Our convention for importing pandas:


In [1]:
import pandas as pd
from pandas import Series, DataFrame

Since Series and DataFrame are used frequently, they should be imported directly by name.

Panda Data Structures

Series

A Series is basically a one-dimensional array with indices.

You create a simplest Series like this:


In [2]:
ps = Series([4,2,1,3])
print ps


0    4
1    2
2    1
3    3
dtype: int64

Get values and indeces like this:


In [3]:
print ps.values
print ps.index
ps[0]


[4 2 1 3]
RangeIndex(start=0, stop=4, step=1)
Out[3]:
4

To use a custom index, do this:


In [4]:
ps2 = Series([4, 7, -1, 8], ['a','b','c','d'])
ps2


Out[4]:
a    4
b    7
c   -1
d    8
dtype: int64

Often, you want to create Series from python dict


In [5]:
ps3 = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
ps3


Out[5]:
{'Ohio': 35000, 'Oregon': 16000, 'Texas': 71000, 'Utah': 5000}

DataFrame

A DataFrame represents a tabular structure. It can be thought of as a dict of Series.

A DataFrame can be constructed from a dict of equal-length lists


In [6]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = DataFrame(data)
df


Out[6]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

You can specify a sequence of columns like so:


In [7]:
DataFrame(data, columns=['year', 'state', 'pop'])


Out[7]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

In addition to index and values, DataFrame has columns


In [8]:
print df.index
print
print df.values
print
print df.columns


RangeIndex(start=0, stop=5, step=1)

[[1.5 'Ohio' 2000]
 [1.7 'Ohio' 2001]
 [3.6 'Ohio' 2002]
 [2.4 'Nevada' 2001]
 [2.9 'Nevada' 2002]]

Index([u'pop', u'state', u'year'], dtype='object')

You can get a specific column like this:


In [9]:
df['state']


Out[9]:
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

Rows can be retrieved using the ix method:


In [10]:
df.ix[0]


Out[10]:
pop       1.5
state    Ohio
year     2000
Name: 0, dtype: object

Another common form of data to create DataFrame is a nested dict of dicts OR nested dict of Series:


In [11]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
df2 = DataFrame(pop)
df2


Out[11]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

You can pass explicit index when creating DataFrame:


In [12]:
df3=DataFrame(pop, index=[2001, 2002, 2003])
df3


Out[12]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

If a DataFrame’s index and columns have their name attributes set, these will also be displayed:


In [13]:
df2.index.name = 'year'
df2.columns.name = 'state'
df2


Out[13]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

The 3rd common data input structures is a list of dicts or Series:


In [14]:
films = [{'star': 9.3, 'title': 'The Shawshank Redemption', 'content_rating': 'R'},
         {'star': 9.2, 'title': 'The Godfather', 'content_rating': 'R'},
         {'star': 9.1, 'title': 'The Godfather: Part II', 'content_rating': 'R'}
         ]
                                                     
df3 = DataFrame(films)
df3


Out[14]:
content_rating star title
0 R 9.3 The Shawshank Redemption
1 R 9.2 The Godfather
2 R 9.1 The Godfather: Part II

More on DataFrame manipulation will come later.

Reading Tabular data file into Pandas

There are two main methods for reading data from file to DataFrame: read_table and read_csv. read_csv is exactly the same as read_table, except it assumes a comma separator.

You can read a data set using read_table like so:


In [15]:
orders = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/chipotle.tsv')
orders.head (5)


Out[15]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

A file does not always have a header row. In this case, you can use default column names or specify column names yourself:


In [16]:
users = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None)
users.head(5)


Out[16]:
0 1 2 3 4
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

In [17]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users2 = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None, names=user_cols)
users2.head(5)


Out[17]:
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

You can choose a specific column to be the index column instead of the default generated by Pandas:


In [18]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users3 = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None, names=user_cols, index_col='user_id')
users3.head(5)


Out[18]:
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

Recipes