One of the early criticisms of many in the data science arena of the Python language was the lack of useful data structures for performing data analysis tasks. This stemmed in part from comparisons between the R language and Python, since R has a built-in DataFrame object that greatly simplified many data analysis tasks. This deficiency was addressed in 2008 by Wes McKinney with the creation of Pandas (the name was originally an abbreviation of Panel datas). To quote the Pandas documentation:
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
Pandas introduces several new data structures like the Series
,
DataFrame
, and Panel
that build on top of existing
tools like numpy
to speed-up data analysis tasks. Pandas also provides
efficient mechanisms for moving data between in memory representations
and different data formats including CSV and text files, JSON files, SQL
databases, HDF5 format files, and even Excel spreadsheets. Pandas also
provides support for dealing with missing or incomplete data and
aggregating or grouping data.
Before using Pandas, we must first import the Pandas library:
import pandas as pd
Second, we simply create and use the appropriate Pandas data structure.
The two most important data structures for typical data science tasks
are the Series
and the DataFrame
:
Series
: a one-dimensional labeled array that can hold any data type
such as integers, floating-point numbers, strings, or Python objects. A
Series
has both a data column and a label column called the index.
DataFrame
: a two-dimensional labeled data structure with columns
that can be hold different data types, similar to a spreadsheet or
relational database table.
Pandas also provides a date/time data structure sometimes refereed to as
a TimeSeries
and a three-dimensional data structure known as a
Panel
.
Series
A Series
is useful to hold data that should be accesible by using a
specific label. You create a Series
by passing in an appropriate data
set along with an optional index:
values = pd.Series(data, index=idx)
The index varies depending on the type of data passed in to create the `Series:
if data is a NumPy array, the index should be the same length as the data array. If no index is provided one will be created that enables integer access that mirrors a traditional NumPy array (i.e., zero indexed).
if data is a Python dictionary, idx
can contain specific labels to
indicate which values in the dictionary should be used to create the
Series
. If no index is specified, an index is created from the sorted
dictionary keys.
if data is a scalar value, an inde must be supplied. In this case, the
scalar value will be repeated to ensure that each label in the index has
a value in the Series
.
These different options are demonstrated in the following code cells.
In [1]:
import pandas as pd
import numpy as np
# We label the random values
s1 = pd.Series(np.random.rand(6), index=['q', 'w', 'e', 'r', 't', 'y'])
print(s1)
In [2]:
d = {'q': 11, 'w': 21, 'e': 31, 'r': 41}
# We pick out the q, w, and r keys, but have an undefined y key.
s2 = pd.Series(d, index = ['q', 'w', 'r', 'y'])
print(s2)
In [3]:
# We create a Series from an integer constant with explicit labels
s3 = pd.Series(42, index = ['q', 'w', 'e', 'r', 't', 'y'])
print(s3)
print('\nThe "e" value is ', s3['e'])
In [4]:
# We can slice like NumPy arrays
print(s1[:-2])
# We can also perform vectorized operations
print('\nSum Series:')
print(s1 + s3)
print('\nSeries operations:')
print(s2 * 5 - 1.2)
DataFrame
The second major data structure that Pandas provdis is he DataFrame
,
which is a two-dimensional array, where each column is effectively a
Series
with a shared index. A DataFrame is a very powerful data
structure and provides a nice mapping for a number of different data
formats (and storage mechanisms). For example, you can easily read data
from a CSV file, a fixed width format text file, a JSOPN file, an HTML
file, and HDF file, and a relational database into a Pandas DataFrame
.
This is demonstrated in the next set of code cells, where we extract
data from files we created in the Introduction to Data
Formats Notebook.
In [5]:
# Read data from CSV file, and display subset
dfa = pd.read_csv('data.csv', delimiter='|', index_col='iata')
# We can grab the first five rows, and only extract the 'airport' column
print(dfa[['airport', 'city', 'state']].head(5))
In [6]:
# Read data from our JSON file
dfb = pd.read_json('data.json')
# Grab the last five rows
print(dfb[[0, 1, 2, 3, 5, 6]].tail(5))
In the previous code cells, we read data first from a delimiter
separated value file and second from a JSON file into a Pandas
DataFrame
. In each code cell, we display data contained in the new
DataFrame, first by using the head
method to display the first few
rows, and second by using the tail
method to display the last few
lines. For the delimiter separated value file, we explicitly specified
the delimiter, which is a vertical bar |
, the default is to assume a
comma as the delimiter. We also explicitly specify the iata
column
should be used as the index column, which is how we can refer to rows in
the array.
We also explicitly select columns for display in both code cells. In the first code cell, we explicitly name the columns, passing in a list of the names to the DataFrame. Alternatively, in the second code cell, we pass in a list of the column ids, which we must do since we did not create named columns when reading data from the JSON file. The list of integers can be used even if the columns of the array have been assigned names.
Pandas includes a tremendous amount of functionality, especially for
the DataFrame
, to learn more, view the detailed documentation.
Several useful functions are demonstrated below, however, including
information summaries, slicing, and column operations on DataFrames.
In [7]:
# Lets look at the datatypes of each column
dfa.dtypes
Out[7]:
In [8]:
# We can get a summary of numerical information in the dataframe
dfa.describe()
Out[8]:
In [9]:
# Notice the JSON data did not automatically specify data types
dfb.dtypes
Out[9]:
In [10]:
# This affects the output of the describe method, dfb has no numerical data types.
dfb.describe()
Out[10]:
In [11]:
# We can slice out rows using the indicated index for dfa
print(dfa.loc[['00V', '11R', '12C']])
In [12]:
# We can slice out rows using the row index for dfb
print(dfb[100:105])
In [13]:
# We can also select rows based on boolean tests on columns
print(dfa[(dfa.lat > 48) & (dfa.long < -170)])
We can also perform numerical operations on a DataFrame
, just as was the
case with NumPy arrays. To demonstrate this, we create a numerical
DataFrame, apply different operations, and view the results.
In [14]:
df = pd.DataFrame(np.random.randn(5, 6))
print(df)
In [15]:
# We can incorporate operate with basic scalar values
df + 2.5
Out[15]:
In [16]:
# And perform more complex scalar operations
-1.0 * df + 3.5
Out[16]:
In [17]:
# We can also apply vectorized functions
np.sin(df)
Out[17]:
In [18]:
# We can tranpose the dataframe
df.T
Out[18]:
The above description merely scratchs the surface of what you can do
with a Pandas Series
or a DataFrame
. The best way to learn how to
effectively use these data structures is to just do it!
We now will change gears and explore how to use Pandas with a relational database.