Title: pandas Data Structures
Slug: pandas_data_structures
Summary: pandas Data Structures
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

Import modules



In [1]:

    
import pandas as pd

Series 101

Series are one-dimensional arrays (like R's vectors)

Create a series of the number of floodingReports



In [2]:

    
floodingReports = pd.Series([5, 6, 2, 9, 12])
floodingReports









    Out[2]:





0     5
1     6
2     2
3     9
4    12
dtype: int64

Note that the first column of numbers (0 to 4) are the index.

Set county names to be the index of the floodingReports series



In [3]:

    
floodingReports = pd.Series([5, 6, 2, 9, 12], index=['Cochise County', 'Pima County', 'Santa Cruz County', 'Maricopa County', 'Yuma County'])
floodingReports









    Out[3]:





Cochise County        5
Pima County           6
Santa Cruz County     2
Maricopa County       9
Yuma County          12
dtype: int64

View the number of floodingReports in Cochise County



In [4]:

    
floodingReports['Cochise County']









    Out[4]:





5

View the counties with more than 6 flooding reports



In [5]:

    
floodingReports[floodingReports > 6]









    Out[5]:





Maricopa County     9
Yuma County        12
dtype: int64

Create a pandas series from a dictionary

Note: when you do this, the dict's key's will become the series's index



In [6]:

    
# Create a dictionary
fireReports_dict = {'Cochise County': 12, 'Pima County': 342, 'Santa Cruz County': 13, 'Maricopa County': 42, 'Yuma County' : 52}

# Convert the dictionary into a pd.Series, and view it
fireReports = pd.Series(fireReports_dict); fireReports









    Out[6]:





Cochise County        12
Maricopa County       42
Pima County          342
Santa Cruz County     13
Yuma County           52
dtype: int64

Change the index of a series to shorter names



In [7]:

    
fireReports.index = ["Cochice", "Pima", "Santa Cruz", "Maricopa", "Yuma"]
fireReports









    Out[7]:





Cochice        12
Pima           42
Santa Cruz    342
Maricopa       13
Yuma           52
dtype: int64

DataFrame 101

DataFrames are like R's Dataframes

Create a dataframe from a dict of equal length lists or numpy arrays



In [8]:

    
data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data)
df

Set the order of the columns using the columns attribute



In [9]:

    
dfColumnOrdered = pd.DataFrame(data, columns=['county', 'year', 'reports'])
dfColumnOrdered

Add a column



In [10]:

    
dfColumnOrdered['newsCoverage'] = pd.Series([42.3, 92.1, 12.2, 39.3, 30.2])
dfColumnOrdered









    Out[10]:






  
    
      
      county
      year
      reports
      newsCoverage
    
  
  
    
      0
      Cochice
      2012
      4
      42.3
    
    
      1
      Pima
      2012
      24
      92.1
    
    
      2
      Santa Cruz
      2013
      31
      12.2
    
    
      3
      Maricopa
      2014
      2
      39.3
    
    
      4
      Yuma
      2014
      3
      30.2

Delete a column



In [11]:

    
del dfColumnOrdered['newsCoverage']
dfColumnOrdered

Transpose the dataframe



In [12]:

    
dfColumnOrdered.T

	county	reports	year
0	Cochice	4	2012
1	Pima	24	2012
2	Santa Cruz	31	2013
3	Maricopa	2	2014
4	Yuma	3	2014

	county	year	reports	newsCoverage
0	Cochice	2012	4	42.3
1	Pima	2012	24	92.1
2	Santa Cruz	2013	31	12.2
3	Maricopa	2014	2	39.3
4	Yuma	2014	3	30.2

	0	1	2	3	4
county	Cochice	Pima	Santa Cruz	Maricopa	Yuma
year	2012	2012	2013	2014	2014
reports	4	24	31	2	3