In [ ]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)

Data Structures

pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy (this means it's fast).

Series

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.



In [ ]:

    
# create a Series with an arbitrary list
s = pd.Series([7, 'Pi', 3.14, -3233432, 'Happy Learning!'])
s

Alternatively, you can specify an index to use when creating the Series.



In [ ]:

    
s = pd.Series([7, 'Pi', 3.14, -3233432, 'Happy Learning!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.



In [ ]:

    
d = {'Pagri': 4600, 'Tanggulashan': 4587, 'Ukdungle': 4659, 'Colquechaca': 4692,
     'Hunza Khunjerab Pass': 4693, 'El Aguilar': 4895, 'Wenquan': 5019, 'La Rinconada': 5099}
cities = pd.Series(d)
cities

You can use the index to select specific items from the Series ...



In [ ]:

    
cities['Pagri']



In [ ]:

    
cities[['Pagri', 'Colquechaca', 'Wenquan']]

Or you can use boolean indexing for selection.



In [ ]:

    
cities[cities > 4900]



In [ ]:

    
greater_than_4900 = cities > 4900

print greater_than_4900
print '\n'
print cities[greater_than_4900]

You can also change the values in a Series on the fly.



In [ ]:

    
# changing based on the index
print 'Old value:', cities['Pagri']
cities['Pagri'] = 4990
print 'New value:', cities['Pagri']



In [ ]:

    
# changing values using boolean logic
print cities[cities < 4900]
print '\n'
cities[cities < 4900] = 4890

print cities[cities < 4900]

What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.



In [ ]:

    
print 'Wenquan' in cities
print 'Pune' in cities

Mathematical operations can be done using scalars and functions.



In [ ]:

    
# divide city values by 3
cities / 3



In [ ]:

    
# square city values
np.square(cities)



In [ ]:

    
print cities[['Pagri', 'Colquechaca', 'Wenquan']]
print '\n'
print cities[['Pagri', 'Ukdungle', 'Tanggulashan']]
print '\n'
print cities[['Pagri', 'Colquechaca', 'Wenquan']] + cities[['Pagri', 'Ukdungle', 'Tanggulashan']]



In [ ]:

    
cities['Pagri'] = np.nan



In [ ]:

    
cities



In [ ]:

    
cities.notnull()



In [ ]:

    
cities.isnull()



In [ ]:

    
print cities[cities.isnull()]

DataFrame



In [ ]:

    
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
print football



In [ ]:

    
! head -n 5 /Users/aditya/Desktop/MIT/data/body.csv



In [ ]:

    
frm_csv = pd.read_csv('data/body.csv')
print frm_csv.head(n=5)



In [ ]:

    
colnames = ['Date', 'Weight', 'BMI', 'Fat' , 'BP', 'RHR' ,'DS']

frm_csv = pd.read_csv('data/body.csv',
                      na_values=[0, '0/0'],
                      sep=',',
                      parse_dates=[0], 
                      header = 0,
                      names=colnames)

print frm_csv.head(n=10)



In [ ]:

    
print frm_csv.describe()



In [ ]:

    
print frm_csv.dtypes



In [ ]:

    
print frm_csv.tail()



In [ ]:

    
print frm_csv[10:15]

Selection



In [ ]:

    
frm_csv['Weight'].head()



In [ ]:

    
frm_csv[['Date', 'Weight']].head()



In [ ]:

    
frm_csv[frm_csv.Weight > 65].head()



In [ ]:

    
condition = frm_csv.Weight > 65
print condition[:5]



In [ ]:

    
frm_csv.Date[frm_csv.Weight > 65].head()

Data Structures

Series

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

Alternatively, you can specify an index to use when creating the Series.

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.

You can use the index to select specific items from the Series ...

Or you can use boolean indexing for selection.

You can also change the values in a Series on the fly.

What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

Mathematical operations can be done using scalars and functions.

DataFrame

A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

Selection