Part 1: Data Structures in Pandas


In [ ]:
"""
----------------------------------------------------------------------
Filename : 01_basic_data_structs.py
Date     : 12th Dec, 2013
Author   : Jaidev Deshpande
Purpose  : To get started with basic data structures in Pandas
Libraries: Pandas 0.12 and its dependencies
----------------------------------------------------------------------
"""

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. http://pandas.pydata.org

There are many useful objects in Pandas:

  • Series
  • DataFrame
  • Panel
  • TimeSeries

Series and DataFrame


In [ ]:
# imports
import pandas as pd
from math import pi

In [ ]:
s = pd.Series(range(10))
print(s)

In [ ]:
print(s[5])

A pandas Series, like a list, doesn't have to be homogenous.


In [ ]:
s = pd.Series(['foo', None, 3+4j])

The index of a Series can be arbitrary as well.


In [ ]:
inds = ['bar',1, (1, 2)]
s.index = inds
print(s['bar'], s[1], s[(1, 2)])

Multiple Series objects can be clubbed together to make a pandas DataFrame. The pandas DataFrame is similar to the data.frame object in R.


In [ ]:
s1 = pd.Series(range(10))
s2 = pd.Series(range(10,20))
df = pd.DataFrame({'A':s1,'B':s2})
df.head()

Think of pandas DataFrames as dicts of Series. Almost all operations that are valid on a Python dictionary will work on a pandas DataFrame.


In [ ]:
df['C'] = [str(c) for c in range(20, 30)]
print(df.head())

In [ ]:
print(df['C'])

In [ ]:
del df['A']
print(df.head(10))

In [ ]:
df.update({'B': range(50,60)})
print(df.head())

Index Objects

Index objects available in Pandas:

  • Index : The most general Pandas index, often created by default
  • Int64Index : Specialized index for integer values
  • MultiIndex : Hierarchical index
  • DatetimeIndex: Nanosecond timestamps that can be used as indexes
  • PeriodIndex : Specialized indices for timespans

In [ ]:
df.index

Exercise: Creating Series, DataFrames and indexing them

  1. Create a random valued NumPy array having dimensions (10,10).
  2. Convert this into a DataFrame
  3. The column names of this DataFrame should be of type str.
  4. Add one more column to the DataFrame using the update method demonstrated above.