What is Pandas?

One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy arrays.

Like NumPy, Pandas introduces some new data types to our Python word, the most important of which is the DataFrame.

Briefly, a DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.

Once we load our data into a DataFrame, pandas gives a host of analytical capabilities with the data. Among these are:

  • subset/select/query specific rows and or columns of data
  • handling missing data
  • grouping/aggregating data
  • sorting, transforming, pivoting, and "melting" data
  • computing descriptive and summary stats
  • combining data in different DataFrames
  • plotting data

In short, Pandas and its DataFrame object is essentially a one-stop for anything data.

Learning by Diving In!

Here in this notebook, we'll dive in with some examples and then explain what's going on. The highlights here include:

  • Creating DataFrames programmatically and then by reading in data; and in doing so reviewing the structure and key elements of a DataFrame
  • Examining key properties and methods of a DataFrame that quickly reveal information about our DataFrame: the number of rows and columns, what types of data it stores, quick summary statistics, etc.
  • Summarizing the data in a DataFrame, including how to quickly count and list unique items in a column.
  • Examining the various ways to handle missing data in a DataFrame
  • Grouping data
  • Subsetting observations (rows) and variables (columns)

It all starts by importing the package...


In [1]:
#Import the package
import pandas as pd

The DataFrame

We'll begin by exploring the key elements of the DataFrame object. Some notions are self evident, i.e., data are stored in rows and columns, much like a spreadsheet. Others are more nuanced: implicit and explicit indices, tables vs. views, and some others.

Let's begin examining the components of DataFrames by examining two ways they can be created.

DataFrame as a list of lists

First, a DataFrame can be considered as a list of lists. Below we see an example where we have 4 sub-lists, each containing 3 items (e.g. the first list ['Joe',22,& True]). Each of these 4 sub-lists comprises a row in the resulting DataFrame, and each item in a given list becomes a column.


In [52]:
#Creating a simple data frame as a list of lists
df = pd.DataFrame([['Joe',22,True],['Bob',25,False],['Sue',28,False],['Ken',24,True]],
                  index = [10,20,40,30],
                  columns = ['Name','Age','IsStudent']
                 )
#Display the resulting data frame
df


Out[52]:
Name Age IsStudent
10 Joe 22 True
20 Bob 25 False
40 Sue 28 False
30 Ken 24 True

A few key points here:

  • First is that each of the sub-lists has the same number of elements (3) and the same data types as the other sub-lists. Otherwise we'd end up with missing data or "coerced" data types.
  • Second is that we also explicitly specify and index for the rows (index = [1,2,3,4]). The index allows us to identify a specific row.
  • Likewise, we explicitly set column names with columns = ['Name','Age','IsStudent'], and yes, these allow us to indentify specific columns in our DataFrame.

Data frame as a collection of dictionaries

Another way to build (and think of) a DataFrame as a set of dictionaries where each dictionary is a column of data, with the dictionary's key being the column name and it's value being a list of values:


In [53]:
#Creating a data frame as dictionaries of lists
df = pd.DataFrame({"Name":['Joe','Bob','Sue','Ken'],
                   "Age":[22,25,28,24],
                   "IsStudent":[True,False,False,True]},
                  index = [10,20,40,30]
                 )
df


Out[53]:
Age IsStudent Name
10 22 True Joe
20 25 False Bob
40 28 False Sue
30 24 True Ken

While the order of columns has changed (they are alphabetized), the data are the same as the previous DataFrame even though it was created differently. This is just another way to think of the data underlying a DataFrame.

Of note here is that each dictionary has the same number of elements in it and the order of the elements is important otherwise the values in one dictionary/column would lose its proper correspondence with the other elements.

Eh, so what...

What does this reveal? List of lists vs set of dictionaries? Well, it explains how you can extract elements from the DataFrame. Thinking of a DataFrame as a list of list, getting the value of the 2nd column, 3rd row is equivalent of getting data from the 2nd item in the 3rd list.

We can get that value using the DataFrame's iloc function (short for intrinsic location), passing the row and column of the location we want.


In [37]:
#Get the 3rd item from the 2nd row; recalling Python is zero-based
print df.iloc[1,2]


Bob

And if we hop over to thinking a DataFrame is a set of dictionaries, we can target a specific value by specifying the index of the value (row) from the dictionary column) we want. The row however, is referred to by the index we assigned, not it's implicit index generated by the order in which it was entered.


In [57]:
#Get the value in the 'Name' column corresponding to the row with an index of '20'
print df['Name'][20]


Bob

We'll return to how we extract data from a DataFrame, but for now just soak in the fact that values in a DataFrame can be referenced by their implicit location (i.e. their row, column coordinates) and by their explicit column name and row index.

Creating a dataframe from a CSV file

More than likely we'll be reading in data vs entering it manually, so let's review how files are read into a Pandas Dataframe. Pandas can read many other formats: Excel files, HTML tables, JSON, etc. But let's concentrate on the simplest one - the csv file - and discuss the key parameters involved.

In the Data folder within our workspace is a file named surveys.csv which holds the data we'll use. If you're curious, this dataset is part of the Portal Teaching data, a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA.

The dataset is stored as a .csv file: each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams

Below, we read in this file, saving the contents to the variabel surveys_df


In [61]:
#Read in the surveys.csv file
surveys_df = pd.read_csv('../data/surveys.csv')

Exploring our Species Survey Data

Now, as we often do, let's look at the type of the object we just created.


In [ ]:
#Show the object type of the object we just created
type(surveys_df)

We can print the entire contents of the data frame by just calling the object.

Remember that in Jupyter notebooks, we can toggle the output by clicking the lightly shaded area to the left of it...


In [ ]:
#Show the data frame's contents
surveys_df

At the bottom of the [long] output above, we see that the data includes 33,549 rows and 9 columns.

The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame. It looks like the read_csv function in Pandas read our file properly.

All the values in a column have the same type. For example, months have type int64, which is a kind of integer. Cells in the month column cannot have fractional values, but the weight and hindfoot_length columns can, because they have type float64. The object type doesn’t have a very helpful name, but in this case it represents strings (such as ‘M’ and ‘F’ in the case of sex).

Exploring our Species Survey Data

Now, as we often do, let's look at the type of the object we just created.


In [62]:
#Show the object type of the object we just created
type(surveys_df)


Out[62]:
pandas.core.frame.DataFrame

As expected, it’s a DataFrame (or, to use the full name that Python uses to refer to it internally, a pandas.core.frame.DataFrame).
What kind of things does surveys_df contain?

DataFrames have an attribute called dtypes that answers this:


In [63]:
#Show the data types of the columns in our data frame
surveys_df.dtypes


Out[63]:
record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

All the values in a column have the same type. For example, months have type int64, which is a kind of integer. Cells in the month column cannot have fractional values, but the weight and hindfoot_length columns can, because they have type float64. The object type doesn’t have a very helpful name, but in this case it represents strings (such as ‘M’ and ‘F’ in the case of sex).

Useful Ways to View DataFrame Objects in Python

There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.

To access an attribute, use the DataFrame object name followed by the attribute name df_object.attribute. Using the DataFrame surveys_df and attribute columns, an index of all the column names in the DataFrame can be accessed with surveys_df.columns.

Methods are called in a similar fashion using the syntax df_object.method(). As an example, surveys_df.head() gets the first few rows in the DataFrame surveys_df using the head() method. With a method, we can supply extra information in the parens to control behaviour.

Let’s look at the data using these.


*Challenge* - DataFrames

Using our DataFrame surveys_df, try out the attributes & methods below to see what they return.

  1. surveys_df.columns
  2. surveys_df.shape Take note of the output of shape - what format does it return the shape of the DataFrame in?
  3. surveys_df.head() Also, what does surveys_df.head(15) do?
  4. surveys_df.tail()

Use the boxes below to type in the above commands and see what they produce.


In [ ]:
# Challenge 1
surveys_df.

In [ ]:
# Challenge 2

In [ ]:
# Challenge 3

Inspecting the Data in a Pandas DataFrame

We’ve read our data into Python. Next, let’s perform some quick summary statistics to learn more about the data that we’re working with. We might want to know how many animals were collected in each plot, or how many of each species were caught. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by.

Let’s begin by exploring the data in our data frame:

First, examine the column names. (Yes,I know we just did that in the Challenge above...)


In [ ]:
# Look at the column names
surveys_df.columns

We can extract one column of data into a new object by referencing that column as shown here:


In [ ]:
speciesIDs = surveys_df['species_id']

Examining the type of this speciesIDs object reveals another Pandas data type: the Series which is slightly different than a DataFrame...


In [ ]:
type(speciesIDs)

A series object is a one-dimensional array, much like a NumPy array, with its own set of properties and functions. The values are indexed allowing us to extract values at a specific row (try: speciesIDs[5]) or slice of rows (try: species[2:7]).

We can also, using the series.nunique() and series.unique() functions, generate a count of unique values in the series and a list of unique values, respectively.


In [ ]:
#Reveal how many unique species_ID values are in the table
speciesIDs.nunique()

In [ ]:
#List the unique values
speciesIDs.unique()

*Challenge* - Counts and Lists from Data

  1. Create a list of unique plot ID’s found in the surveys data. Call it plot_names. How many unique plots are there in the data? How many unique species are in the data?

  2. What is the difference between len(plot_names) and surveys_df['plot_id'].nunique()?


In [ ]:
# Challenge 1

In [ ]:
# Challenge 2

Grouping Data in Pandas

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per plot.

We can calculate basic statistics for all records in a single column using the syntax below:


In [ ]:
surveys_df['weight'].describe()

We can also extract one specific metric if we wish:


In [ ]:
print" Min: ", surveys_df['weight'].min()
print" Max: ", surveys_df['weight'].max()
print" Mean: ", surveys_df['weight'].mean()
print" Std Dev: ", surveys_df['weight'].std()
print" Count: ", surveys_df['weight'].count()

But if we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.


In [65]:
# Group data by sex
grouped_data = surveys_df.groupby('sex')

In [ ]:
# Show just the grouped means
grouped_data.mean()

In [ ]:
# Or, use the describe function to reveal all summary stats for the grouped data
grouped_data.describe()

Challenge - Summary Data

  1. How many recorded individuals are female F and how many male M
  2. What happens when you group by two columns using the following syntax and then grab mean values:
    • grouped_data2 = surveys_df.groupby(['plot_id','sex'])
    • grouped_data2.mean()
  3. Summarize weight values for each plot in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data by_plot['weight'].describe()

Challenge #3 should reveal:


In [ ]:


In [ ]: