Introduction to Pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Our convention for importing pandas:



In [1]:

    
import pandas as pd
from pandas import Series, DataFrame

Since Series and DataFrame are used frequently, they should be imported directly by name.

Panda Data Structures

Series

A Series is basically a one-dimensional array with indices.

You create a simplest Series like this:



In [2]:

    
ps = Series([4,2,1,3])
print ps









    



0    4
1    2
2    1
3    3
dtype: int64

Get values and indeces like this:



In [3]:

    
print ps.values
print ps.index
ps[0]









    



[4 2 1 3]
RangeIndex(start=0, stop=4, step=1)






    Out[3]:





4

To use a custom index, do this:



In [4]:

    
ps2 = Series([4, 7, -1, 8], ['a','b','c','d'])
ps2









    Out[4]:





a    4
b    7
c   -1
d    8
dtype: int64

Often, you want to create Series from python dict



In [5]:

    
ps3 = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
ps3









    Out[5]:





{'Ohio': 35000, 'Oregon': 16000, 'Texas': 71000, 'Utah': 5000}

DataFrame

A DataFrame represents a tabular structure. It can be thought of as a dict of Series.

A DataFrame can be constructed from a dict of equal-length lists



In [6]:

    
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = DataFrame(data)
df

You can specify a sequence of columns like so:



In [7]:

    
DataFrame(data, columns=['year', 'state', 'pop'])

In addition to index and values, DataFrame has columns



In [8]:

    
print df.index
print
print df.values
print
print df.columns









    



RangeIndex(start=0, stop=5, step=1)

[[1.5 'Ohio' 2000]
 [1.7 'Ohio' 2001]
 [3.6 'Ohio' 2002]
 [2.4 'Nevada' 2001]
 [2.9 'Nevada' 2002]]

Index([u'pop', u'state', u'year'], dtype='object')

You can get a specific column like this:



In [9]:

    
df['state']









    Out[9]:





0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

Rows can be retrieved using the ix method:



In [10]:

    
df.ix[0]









    Out[10]:





pop       1.5
state    Ohio
year     2000
Name: 0, dtype: object

Another common form of data to create DataFrame is a nested dict of dicts OR nested dict of Series:



In [11]:

    
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
df2 = DataFrame(pop)
df2

You can pass explicit index when creating DataFrame:



In [12]:

    
df3=DataFrame(pop, index=[2001, 2002, 2003])
df3

If a DataFrame’s index and columns have their name attributes set, these will also be displayed:



In [13]:

    
df2.index.name = 'year'
df2.columns.name = 'state'
df2

The 3rd common data input structures is a list of dicts or Series:



In [14]:

    
films = [{'star': 9.3, 'title': 'The Shawshank Redemption', 'content_rating': 'R'},
         {'star': 9.2, 'title': 'The Godfather', 'content_rating': 'R'},
         {'star': 9.1, 'title': 'The Godfather: Part II', 'content_rating': 'R'}
         ]
                                                     
df3 = DataFrame(films)
df3









    Out[14]:






  
    
      
      content_rating
      star
      title
    
  
  
    
      0
      R
      9.3
      The Shawshank Redemption
    
    
      1
      R
      9.2
      The Godfather
    
    
      2
      R
      9.1
      The Godfather: Part II

More on DataFrame manipulation will come later.

Reading Tabular data file into Pandas

There are two main methods for reading data from file to DataFrame: read_table and read_csv. read_csv is exactly the same as read_table, except it assumes a comma separator.

You can read a data set using read_table like so:



In [15]:

    
orders = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/chipotle.tsv')
orders.head (5)









    Out[15]:






  
    
      
      order_id
      quantity
      item_name
      choice_description
      item_price
    
  
  
    
      0
      1
      1
      Chips and Fresh Tomato Salsa
      NaN
      $2.39
    
    
      1
      1
      1
      Izze
      [Clementine]
      $3.39
    
    
      2
      1
      1
      Nantucket Nectar
      [Apple]
      $3.39
    
    
      3
      1
      1
      Chips and Tomatillo-Green Chili Salsa
      NaN
      $2.39
    
    
      4
      2
      2
      Chicken Bowl
      [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
      $16.98

A file does not always have a header row. In this case, you can use default column names or specify column names yourself:



In [16]:

    
users = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None)
users.head(5)



In [17]:

    
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users2 = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None, names=user_cols)
users2.head(5)

You can choose a specific column to be the index column instead of the default generated by Pandas:



In [18]:

    
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users3 = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None, names=user_cols, index_col='user_id')
users3.head(5)

	0	1	2	3	4
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

	user_id	age	gender	occupation	zip_code
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

	age	gender	occupation	zip_code
user_id
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	technician	43537
5	33	F	other	15213

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002

	content_rating	star	title
0	R	9.3	The Shawshank Redemption
1	R	9.2	The Godfather
2	R	9.1	The Godfather: Part II

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

Introduction to Pandas

Panda Data Structures

Series

DataFrame

Reading Tabular data file into Pandas

Recipes