Introduction to Pandas

Professor Robert J. Brunner

</DIV>

Introduction

One of the early criticisms of many in the data science arena of the Python language was the lack of useful data structures for performing data analysis tasks. This stemmed in part from comparisons between the R language and Python, since R has a built-in DataFrame object that greatly simplified many data analysis tasks. This deficiency was addressed in 2008 by Wes McKinney with the creation of Pandas (the name was originally an abbreviation of Panel datas). To quote the Pandas documentation:

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Pandas introduces several new data structures like the Series, DataFrame, and Panel that build on top of existing tools like numpy to speed-up data analysis tasks. Pandas also provides efficient mechanisms for moving data between in memory representations and different data formats including CSV and text files, JSON files, SQL databases, HDF5 format files, and even Excel spreadsheets. Pandas also provides support for dealing with missing or incomplete data and aggregating or grouping data.

Brief introduction to Pandas

Before using Pandas, we must first import the Pandas library:

import pandas as pd

Second, we simply create and use the appropriate Pandas data structure. The two most important data structures for typical data science tasks are the Series and the DataFrame:

Series: a one-dimensional labeled array that can hold any data type such as integers, floating-point numbers, strings, or Python objects. A Series has both a data column and a label column called the index.
DataFrame: a two-dimensional labeled data structure with columns that can be hold different data types, similar to a spreadsheet or relational database table.

Pandas also provides a date/time data structure sometimes refereed to as a TimeSeries and a three-dimensional data structure known as a Panel.

`Series`

A Series is useful to hold data that should be accesible by using a specific label. You create a Series by passing in an appropriate data set along with an optional index:

values = pd.Series(data, index=idx)

The index varies depending on the type of data passed in to create the `Series:

if data is a NumPy array, the index should be the same length as the data array. If no index is provided one will be created that enables integer access that mirrors a traditional NumPy array (i.e., zero indexed).
if data is a Python dictionary, idx can contain specific labels to indicate which values in the dictionary should be used to create the Series. If no index is specified, an index is created from the sorted dictionary keys.
if data is a scalar value, an inde must be supplied. In this case, the scalar value will be repeated to ensure that each label in the index has a value in the Series.

These different options are demonstrated in the following code cells.



In [1]:

    
import pandas as pd
import numpy as np

# We label the random values
s1 = pd.Series(np.random.rand(6), index=['q', 'w', 'e', 'r', 't', 'y'])

print(s1)









    



q    0.832660
w    0.512689
e    0.516118
r    0.670917
t    0.311595
y    0.888944
dtype: float64



In [2]:

    
d = {'q': 11, 'w': 21, 'e': 31, 'r': 41}

# We pick out the q, w, and r keys, but have an undefined y key.
s2 = pd.Series(d, index = ['q', 'w', 'r', 'y'])

print(s2)









    



q    11
w    21
r    41
y   NaN
dtype: float64



In [3]:

    
# We create a Series from an integer constant with explicit labels
s3 = pd.Series(42, index = ['q', 'w', 'e', 'r', 't', 'y'])

print(s3)

print('\nThe "e" value is ', s3['e'])









    



q    42
w    42
e    42
r    42
t    42
y    42
dtype: int64

The "e" value is  42



In [4]:

    
# We can slice like NumPy arrays

print(s1[:-2])

# We can also perform vectorized operations
print('\nSum Series:')
print(s1 + s3)
print('\nSeries operations:')
print(s2 * 5 - 1.2)









    



q    0.832660
w    0.512689
e    0.516118
r    0.670917
dtype: float64

Sum Series:
q    42.832660
w    42.512689
e    42.516118
r    42.670917
t    42.311595
y    42.888944
dtype: float64

Series operations:
q     53.8
w    103.8
r    203.8
y      NaN
dtype: float64

`DataFrame`

The second major data structure that Pandas provdis is he DataFrame, which is a two-dimensional array, where each column is effectively a Series with a shared index. A DataFrame is a very powerful data structure and provides a nice mapping for a number of different data formats (and storage mechanisms). For example, you can easily read data from a CSV file, a fixed width format text file, a JSOPN file, an HTML file, and HDF file, and a relational database into a Pandas DataFrame. This is demonstrated in the next set of code cells, where we extract data from files we created in the Introduction to Data Formats Notebook.



In [5]:

    
# Read data from CSV file, and display subset

dfa = pd.read_csv('data.csv', delimiter='|', index_col='iata')

# We can grab the first five rows, and only extract the 'airport' column
print(dfa[['airport', 'city', 'state']].head(5))









    



                   airport              city state
iata                                              
00M               Thigpen        Bay Springs    MS
00R   Livingston Municipal        Livingston    TX
00V            Meadow Lake  Colorado Springs    CO
01G           Perry-Warsaw             Perry    NY
01J       Hilliard Airpark          Hilliard    FL



In [6]:

    
# Read data from our JSON file
dfb = pd.read_json('data.json')

# Grab the last five rows
print(dfb[[0, 1, 2, 3, 5, 6]].tail(5))









    



        0                          1            2   3            5  \
3372  ZEF            Elkin Municipal        Elkin  NC  36.28002361   
3373  ZER  Schuylkill Cty/Joe Zerbey   Pottsville  PA  40.70644889   
3374  ZPH      Zephyrhills Municipal  Zephyrhills  FL  28.22806472   
3375  ZUN                 Black Rock         Zuni  NM  35.08322694   
3376  ZZV       Zanesville Municipal   Zanesville  OH  39.94445833   

                 6  
3372  -80.78606861  
3373  -76.37314667  
3374  -82.15591639  
3375  -108.7917769  
3376  -81.89210528

In the previous code cells, we read data first from a delimiter separated value file and second from a JSON file into a Pandas DataFrame. In each code cell, we display data contained in the new DataFrame, first by using the head method to display the first few rows, and second by using the tail method to display the last few lines. For the delimiter separated value file, we explicitly specified the delimiter, which is a vertical bar |, the default is to assume a comma as the delimiter. We also explicitly specify the iata column should be used as the index column, which is how we can refer to rows in the array.

We also explicitly select columns for display in both code cells. In the first code cell, we explicitly name the columns, passing in a list of the names to the DataFrame. Alternatively, in the second code cell, we pass in a list of the column ids, which we must do since we did not create named columns when reading data from the JSON file. The list of integers can be used even if the columns of the array have been assigned names.

Pandas includes a tremendous amount of functionality, especially for the DataFrame, to learn more, view the detailed documentation. Several useful functions are demonstrated below, however, including information summaries, slicing, and column operations on DataFrames.



In [7]:

    
# Lets look at the datatypes of each column
dfa.dtypes









    Out[7]:





airport     object
city        object
state       object
country     object
lat        float64
long       float64
dtype: object



In [8]:

    
# We can get a summary of numerical information in the dataframe

dfa.describe()









    Out[8]:






  
    
      
      lat
      long
    
  
  
    
      count
       3376.000000
       3376.000000
    
    
      mean
         40.036524
        -98.621205
    
    
      std
          8.329559
         22.869458
    
    
      min
          7.367222
       -176.646031
    
    
      25%
         34.688427
       -108.761121
    
    
      50%
         39.434449
        -93.599425
    
    
      75%
         43.372612
        -84.137519
    
    
      max
         71.285448
        145.621384



In [9]:

    
# Notice the JSON data did not automatically specify data types
dfb.dtypes









    Out[9]:





0    object
1    object
2    object
3    object
4    object
5    object
6    object
dtype: object



In [10]:

    
# This affects the output of the describe method, dfb has no numerical data types.

dfb.describe()









    Out[10]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      count
       3377
                 3377
       3377
       3377
       3377
              3377
               3377
    
    
      unique
       3377
                 3246
       2677
         58
          6
              3376
               3376
    
    
      top
        KLG
       Jackson County
         NA
         AK
        USA
       41.61033333
       -88.91561611
    
    
      freq
          1
                    5
         12
        263
       3372
                 2
                  2



In [11]:

    
# We can slice out rows using the indicated index for dfa

print(dfa.loc[['00V', '11R', '12C']])









    



                airport              city state country        lat        long
00V         Meadow Lake  Colorado Springs    CO     USA  38.945749 -104.569893
11R   Brenham Municipal           Brenham    TX     USA  30.219000  -96.374278
12C  Rochelle Municipal          Rochelle    IL     USA  41.893001  -89.078290



In [12]:

    
# We can slice out rows using the row index for dfb
print(dfb[100:105])









    



       0                   1         2   3    4            5             6
100  11J        Early County   Blakely  GA  USA  31.39698611  -84.89525694
101  11R   Brenham Municipal   Brenham  TX  USA       30.219  -96.37427778
102  12C  Rochelle Municipal  Rochelle  IL  USA  41.89300139     -89.07829
103  12D     Tower Municipal     Tower  MN  USA  47.81833333  -92.29166667
104  12J   Brewton Municipal   Brewton  AL  USA  31.05126306  -87.06796833



In [13]:

    
# We can also select rows based on boolean tests on columns
print(dfa[(dfa.lat > 48) & (dfa.long < -170)])









    



       airport      city state country        lat        long
iata                                                         
ADK       Adak      Adak    AK     USA  51.877964 -176.646031
AKA       Atka      Atka    AK     USA  52.220348 -174.206350
GAM    Gambell   Gambell    AK     USA  63.766766 -171.732824
SNP   St. Paul  St. Paul    AK     USA  57.167333 -170.220444
SVA   Savoonga  Savoonga    AK     USA  63.686394 -170.492636

We can also perform numerical operations on a DataFrame, just as was the case with NumPy arrays. To demonstrate this, we create a numerical DataFrame, apply different operations, and view the results.



In [14]:

    
df = pd.DataFrame(np.random.randn(5, 6))

print(df)









    



          0         1         2         3         4         5
0  0.504598 -1.761699  0.485144  1.424999 -0.732168  1.917617
1  0.069418 -1.599813 -0.911061 -0.208813  0.212493 -0.943686
2  0.487085  0.967983  1.073010 -0.726211  0.965202 -1.327939
3 -0.775296  0.043955 -0.734958 -0.257474  0.741533  1.819673
4 -0.944408  0.070267  0.167863  0.837726  0.195545  0.475303



In [15]:

    
# We can incorporate operate with basic scalar values

df + 2.5



In [16]:

    
# And perform more complex scalar operations

-1.0 * df + 3.5



In [17]:

    
# We can also apply vectorized functions

np.sin(df)



In [18]:

    
# We can tranpose the dataframe

df.T

The above description merely scratchs the surface of what you can do with a Pandas Series or a DataFrame. The best way to learn how to effectively use these data structures is to just do it!

We now will change gears and explore how to use Pandas with a relational database.

Additional References

Pandas Documentation
A slightly dated Pandas tutorial

	0	1	2	3	4	5
0	3.004598	0.738301	2.985144	3.924999	1.767832	4.417617
1	2.569418	0.900187	1.588939	2.291187	2.712493	1.556314
2	2.987085	3.467983	3.573010	1.773789	3.465202	1.172061
3	1.724704	2.543955	1.765042	2.242526	3.241533	4.319673
4	1.555592	2.570267	2.667863	3.337726	2.695545	2.975303

	0	1	2	3	4	5
0	2.995402	5.261699	3.014856	2.075001	4.232168	1.582383
1	3.430582	5.099813	4.411061	3.708813	3.287507	4.443686
2	3.012915	2.532017	2.426990	4.226211	2.534798	4.827939
3	4.275296	3.456045	4.234958	3.757474	2.758467	1.680327
4	4.444408	3.429733	3.332137	2.662274	3.304455	3.024697

	0	1	2	3	4	5
0	0.483455	-0.981833	0.466336	0.989390	-0.668483	0.940458
1	0.069362	-0.999579	-0.790154	-0.207299	0.210898	-0.809726
2	0.468052	0.823744	0.878642	-0.664042	0.822164	-0.970655
3	-0.699928	0.043940	-0.670556	-0.254639	0.675419	0.969190
4	-0.810150	0.070209	0.167076	0.743123	0.194301	0.457608

	lat	long
count	3376.000000	3376.000000
mean	40.036524	-98.621205
std	8.329559	22.869458
min	7.367222	-176.646031
25%	34.688427	-108.761121
50%	39.434449	-93.599425
75%	43.372612	-84.137519
max	71.285448	145.621384

	0	1	2	3	4	5	6
count	3377	3377	3377	3377	3377	3377	3377
unique	3377	3246	2677	58	6	3376	3376
top	KLG	Jackson County	NA	AK	USA	41.61033333	-88.91561611
freq	1	5	12	263	3372	2	2

	0	1	2	3	4
0	0.504598	0.069418	0.487085	-0.775296	-0.944408
1	-1.761699	-1.599813	0.967983	0.043955	0.070267
2	0.485144	-0.911061	1.073010	-0.734958	0.167863
3	1.424999	-0.208813	-0.726211	-0.257474	0.837726
4	-0.732168	0.212493	0.965202	0.741533	0.195545
5	1.917617	-0.943686	-1.327939	1.819673	0.475303