Title: pandas Data Structures
Slug: pandas_data_structures
Summary: pandas Data Structures
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

Import modules


In [1]:
import pandas as pd

Series 101

Series are one-dimensional arrays (like R's vectors)

Create a series of the number of floodingReports


In [2]:
floodingReports = pd.Series([5, 6, 2, 9, 12])
floodingReports


Out[2]:
0     5
1     6
2     2
3     9
4    12
dtype: int64

Note that the first column of numbers (0 to 4) are the index.

Set county names to be the index of the floodingReports series


In [3]:
floodingReports = pd.Series([5, 6, 2, 9, 12], index=['Cochise County', 'Pima County', 'Santa Cruz County', 'Maricopa County', 'Yuma County'])
floodingReports


Out[3]:
Cochise County        5
Pima County           6
Santa Cruz County     2
Maricopa County       9
Yuma County          12
dtype: int64

View the number of floodingReports in Cochise County


In [4]:
floodingReports['Cochise County']


Out[4]:
5

View the counties with more than 6 flooding reports


In [5]:
floodingReports[floodingReports > 6]


Out[5]:
Maricopa County     9
Yuma County        12
dtype: int64

Create a pandas series from a dictionary

Note: when you do this, the dict's key's will become the series's index


In [6]:
# Create a dictionary
fireReports_dict = {'Cochise County': 12, 'Pima County': 342, 'Santa Cruz County': 13, 'Maricopa County': 42, 'Yuma County' : 52}

# Convert the dictionary into a pd.Series, and view it
fireReports = pd.Series(fireReports_dict); fireReports


Out[6]:
Cochise County        12
Maricopa County       42
Pima County          342
Santa Cruz County     13
Yuma County           52
dtype: int64

Change the index of a series to shorter names


In [7]:
fireReports.index = ["Cochice", "Pima", "Santa Cruz", "Maricopa", "Yuma"]
fireReports


Out[7]:
Cochice        12
Pima           42
Santa Cruz    342
Maricopa       13
Yuma           52
dtype: int64

DataFrame 101

DataFrames are like R's Dataframes

Create a dataframe from a dict of equal length lists or numpy arrays


In [8]:
data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data)
df


Out[8]:
county reports year
0 Cochice 4 2012
1 Pima 24 2012
2 Santa Cruz 31 2013
3 Maricopa 2 2014
4 Yuma 3 2014

Set the order of the columns using the columns attribute


In [9]:
dfColumnOrdered = pd.DataFrame(data, columns=['county', 'year', 'reports'])
dfColumnOrdered


Out[9]:
county year reports
0 Cochice 2012 4
1 Pima 2012 24
2 Santa Cruz 2013 31
3 Maricopa 2014 2
4 Yuma 2014 3

Add a column


In [10]:
dfColumnOrdered['newsCoverage'] = pd.Series([42.3, 92.1, 12.2, 39.3, 30.2])
dfColumnOrdered


Out[10]:
county year reports newsCoverage
0 Cochice 2012 4 42.3
1 Pima 2012 24 92.1
2 Santa Cruz 2013 31 12.2
3 Maricopa 2014 2 39.3
4 Yuma 2014 3 30.2

Delete a column


In [11]:
del dfColumnOrdered['newsCoverage']
dfColumnOrdered


Out[11]:
county year reports
0 Cochice 2012 4
1 Pima 2012 24
2 Santa Cruz 2013 31
3 Maricopa 2014 2
4 Yuma 2014 3

Transpose the dataframe


In [12]:
dfColumnOrdered.T


Out[12]:
0 1 2 3 4
county Cochice Pima Santa Cruz Maricopa Yuma
year 2012 2012 2013 2014 2014
reports 4 24 31 2 3