Introduction to Pandas

Professor Robert J. Brunner

</DIV>


Introduction

One of the early criticisms of many in the data science arena of the Python language was the lack of useful data structures for performing data analysis tasks. This stemmed in part from comparisons between the R language and Python, since R has a built-in DataFrame object that greatly simplified many data analysis tasks. This deficiency was addressed in 2008 by Wes McKinney with the creation of Pandas (the name was originally an abbreviation of Panel datas). To quote the Pandas documentation:

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Pandas introduces several new data structures like the Series, DataFrame, and Panel that build on top of existing tools like numpy to speed-up data analysis tasks. Pandas also provides efficient mechanisms for moving data between in memory representations and different data formats including CSV and text files, JSON files, SQL databases, HDF5 format files, and even Excel spreadsheets. Pandas also provides support for dealing with missing or incomplete data and aggregating or grouping data.


Brief introduction to Pandas

Before using Pandas, we must first import the Pandas library:

import pandas as pd

Second, we simply create and use the appropriate Pandas data structure. The two most important data structures for typical data science tasks are the Series and the DataFrame:

  1. Series: a one-dimensional labeled array that can hold any data type such as integers, floating-point numbers, strings, or Python objects. A Series has both a data column and a label column called the index.

  2. DataFrame: a two-dimensional labeled data structure with columns that can be hold different data types, similar to a spreadsheet or relational database table.

Pandas also provides a date/time data structure sometimes refereed to as a TimeSeries and a three-dimensional data structure known as a Panel.

Series

A Series is useful to hold data that should be accesible by using a specific label. You create a Series by passing in an appropriate data set along with an optional index:

values = pd.Series(data, index=idx)

The index varies depending on the type of data passed in to create the `Series:

  • if data is a NumPy array, the index should be the same length as the data array. If no index is provided one will be created that enables integer access that mirrors a traditional NumPy array (i.e., zero indexed).

  • if data is a Python dictionary, idx can contain specific labels to indicate which values in the dictionary should be used to create the Series. If no index is specified, an index is created from the sorted dictionary keys.

  • if data is a scalar value, an inde must be supplied. In this case, the scalar value will be repeated to ensure that each label in the index has a value in the Series.

These different options are demonstrated in the following code cells.



In [1]:
import pandas as pd
import numpy as np

# We label the random values
s1 = pd.Series(np.random.rand(6), index=['q', 'w', 'e', 'r', 't', 'y'])

print(s1)


q    0.832660
w    0.512689
e    0.516118
r    0.670917
t    0.311595
y    0.888944
dtype: float64

In [2]:
d = {'q': 11, 'w': 21, 'e': 31, 'r': 41}

# We pick out the q, w, and r keys, but have an undefined y key.
s2 = pd.Series(d, index = ['q', 'w', 'r', 'y'])

print(s2)


q    11
w    21
r    41
y   NaN
dtype: float64

In [3]:
# We create a Series from an integer constant with explicit labels
s3 = pd.Series(42, index = ['q', 'w', 'e', 'r', 't', 'y'])

print(s3)

print('\nThe "e" value is ', s3['e'])


q    42
w    42
e    42
r    42
t    42
y    42
dtype: int64

The "e" value is  42

In [4]:
# We can slice like NumPy arrays

print(s1[:-2])

# We can also perform vectorized operations
print('\nSum Series:')
print(s1 + s3)
print('\nSeries operations:')
print(s2 * 5 - 1.2)


q    0.832660
w    0.512689
e    0.516118
r    0.670917
dtype: float64

Sum Series:
q    42.832660
w    42.512689
e    42.516118
r    42.670917
t    42.311595
y    42.888944
dtype: float64

Series operations:
q     53.8
w    103.8
r    203.8
y      NaN
dtype: float64

DataFrame

The second major data structure that Pandas provdis is he DataFrame, which is a two-dimensional array, where each column is effectively a Series with a shared index. A DataFrame is a very powerful data structure and provides a nice mapping for a number of different data formats (and storage mechanisms). For example, you can easily read data from a CSV file, a fixed width format text file, a JSOPN file, an HTML file, and HDF file, and a relational database into a Pandas DataFrame. This is demonstrated in the next set of code cells, where we extract data from files we created in the Introduction to Data Formats Notebook.



In [5]:
# Read data from CSV file, and display subset

dfa = pd.read_csv('data.csv', delimiter='|', index_col='iata')

# We can grab the first five rows, and only extract the 'airport' column
print(dfa[['airport', 'city', 'state']].head(5))


                   airport              city state
iata                                              
00M               Thigpen        Bay Springs    MS
00R   Livingston Municipal        Livingston    TX
00V            Meadow Lake  Colorado Springs    CO
01G           Perry-Warsaw             Perry    NY
01J       Hilliard Airpark          Hilliard    FL

In [6]:
# Read data from our JSON file
dfb = pd.read_json('data.json')

# Grab the last five rows
print(dfb[[0, 1, 2, 3, 5, 6]].tail(5))


        0                          1            2   3            5  \
3372  ZEF            Elkin Municipal        Elkin  NC  36.28002361   
3373  ZER  Schuylkill Cty/Joe Zerbey   Pottsville  PA  40.70644889   
3374  ZPH      Zephyrhills Municipal  Zephyrhills  FL  28.22806472   
3375  ZUN                 Black Rock         Zuni  NM  35.08322694   
3376  ZZV       Zanesville Municipal   Zanesville  OH  39.94445833   

                 6  
3372  -80.78606861  
3373  -76.37314667  
3374  -82.15591639  
3375  -108.7917769  
3376  -81.89210528  

In the previous code cells, we read data first from a delimiter separated value file and second from a JSON file into a Pandas DataFrame. In each code cell, we display data contained in the new DataFrame, first by using the head method to display the first few rows, and second by using the tail method to display the last few lines. For the delimiter separated value file, we explicitly specified the delimiter, which is a vertical bar |, the default is to assume a comma as the delimiter. We also explicitly specify the iata column should be used as the index column, which is how we can refer to rows in the array.

We also explicitly select columns for display in both code cells. In the first code cell, we explicitly name the columns, passing in a list of the names to the DataFrame. Alternatively, in the second code cell, we pass in a list of the column ids, which we must do since we did not create named columns when reading data from the JSON file. The list of integers can be used even if the columns of the array have been assigned names.

Pandas includes a tremendous amount of functionality, especially for the DataFrame, to learn more, view the detailed documentation. Several useful functions are demonstrated below, however, including information summaries, slicing, and column operations on DataFrames.



In [7]:
# Lets look at the datatypes of each column
dfa.dtypes


Out[7]:
airport     object
city        object
state       object
country     object
lat        float64
long       float64
dtype: object

In [8]:
# We can get a summary of numerical information in the dataframe

dfa.describe()


Out[8]:
lat long
count 3376.000000 3376.000000
mean 40.036524 -98.621205
std 8.329559 22.869458
min 7.367222 -176.646031
25% 34.688427 -108.761121
50% 39.434449 -93.599425
75% 43.372612 -84.137519
max 71.285448 145.621384

In [9]:
# Notice the JSON data did not automatically specify data types
dfb.dtypes


Out[9]:
0    object
1    object
2    object
3    object
4    object
5    object
6    object
dtype: object

In [10]:
# This affects the output of the describe method, dfb has no numerical data types.

dfb.describe()


Out[10]:
0 1 2 3 4 5 6
count 3377 3377 3377 3377 3377 3377 3377
unique 3377 3246 2677 58 6 3376 3376
top KLG Jackson County NA AK USA 41.61033333 -88.91561611
freq 1 5 12 263 3372 2 2

In [11]:
# We can slice out rows using the indicated index for dfa

print(dfa.loc[['00V', '11R', '12C']])


                airport              city state country        lat        long
00V         Meadow Lake  Colorado Springs    CO     USA  38.945749 -104.569893
11R   Brenham Municipal           Brenham    TX     USA  30.219000  -96.374278
12C  Rochelle Municipal          Rochelle    IL     USA  41.893001  -89.078290

In [12]:
# We can slice out rows using the row index for dfb
print(dfb[100:105])


       0                   1         2   3    4            5             6
100  11J        Early County   Blakely  GA  USA  31.39698611  -84.89525694
101  11R   Brenham Municipal   Brenham  TX  USA       30.219  -96.37427778
102  12C  Rochelle Municipal  Rochelle  IL  USA  41.89300139     -89.07829
103  12D     Tower Municipal     Tower  MN  USA  47.81833333  -92.29166667
104  12J   Brewton Municipal   Brewton  AL  USA  31.05126306  -87.06796833

In [13]:
# We can also select rows based on boolean tests on columns
print(dfa[(dfa.lat > 48) & (dfa.long < -170)])


       airport      city state country        lat        long
iata                                                         
ADK       Adak      Adak    AK     USA  51.877964 -176.646031
AKA       Atka      Atka    AK     USA  52.220348 -174.206350
GAM    Gambell   Gambell    AK     USA  63.766766 -171.732824
SNP   St. Paul  St. Paul    AK     USA  57.167333 -170.220444
SVA   Savoonga  Savoonga    AK     USA  63.686394 -170.492636

We can also perform numerical operations on a DataFrame, just as was the case with NumPy arrays. To demonstrate this, we create a numerical DataFrame, apply different operations, and view the results.



In [14]:
df = pd.DataFrame(np.random.randn(5, 6))

print(df)


          0         1         2         3         4         5
0  0.504598 -1.761699  0.485144  1.424999 -0.732168  1.917617
1  0.069418 -1.599813 -0.911061 -0.208813  0.212493 -0.943686
2  0.487085  0.967983  1.073010 -0.726211  0.965202 -1.327939
3 -0.775296  0.043955 -0.734958 -0.257474  0.741533  1.819673
4 -0.944408  0.070267  0.167863  0.837726  0.195545  0.475303

In [15]:
# We can incorporate operate with basic scalar values

df + 2.5


Out[15]:
0 1 2 3 4 5
0 3.004598 0.738301 2.985144 3.924999 1.767832 4.417617
1 2.569418 0.900187 1.588939 2.291187 2.712493 1.556314
2 2.987085 3.467983 3.573010 1.773789 3.465202 1.172061
3 1.724704 2.543955 1.765042 2.242526 3.241533 4.319673
4 1.555592 2.570267 2.667863 3.337726 2.695545 2.975303

In [16]:
# And perform more complex scalar operations

-1.0 * df + 3.5


Out[16]:
0 1 2 3 4 5
0 2.995402 5.261699 3.014856 2.075001 4.232168 1.582383
1 3.430582 5.099813 4.411061 3.708813 3.287507 4.443686
2 3.012915 2.532017 2.426990 4.226211 2.534798 4.827939
3 4.275296 3.456045 4.234958 3.757474 2.758467 1.680327
4 4.444408 3.429733 3.332137 2.662274 3.304455 3.024697

In [17]:
# We can also apply vectorized functions

np.sin(df)


Out[17]:
0 1 2 3 4 5
0 0.483455 -0.981833 0.466336 0.989390 -0.668483 0.940458
1 0.069362 -0.999579 -0.790154 -0.207299 0.210898 -0.809726
2 0.468052 0.823744 0.878642 -0.664042 0.822164 -0.970655
3 -0.699928 0.043940 -0.670556 -0.254639 0.675419 0.969190
4 -0.810150 0.070209 0.167076 0.743123 0.194301 0.457608

In [18]:
# We can tranpose the dataframe

df.T


Out[18]:
0 1 2 3 4
0 0.504598 0.069418 0.487085 -0.775296 -0.944408
1 -1.761699 -1.599813 0.967983 0.043955 0.070267
2 0.485144 -0.911061 1.073010 -0.734958 0.167863
3 1.424999 -0.208813 -0.726211 -0.257474 0.837726
4 -0.732168 0.212493 0.965202 0.741533 0.195545
5 1.917617 -0.943686 -1.327939 1.819673 0.475303

The above description merely scratchs the surface of what you can do with a Pandas Series or a DataFrame. The best way to learn how to effectively use these data structures is to just do it!

We now will change gears and explore how to use Pandas with a relational database.


Additional References

  1. Pandas Documentation
  2. A slightly dated Pandas tutorial