Working With Pandas DataFrames in Python

Source: http://www.datacarpentry.org/python-ecology-lesson/01-starting-with-data/\

One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy arrays.

We begin by importing the pandas library. By convention, we often import pandas with the pd alias.

Reading CSV Data Using Pandas
What is a DataFrame?
Exploring our Species Data
Useful ways to view DataFrame objects in Python
- Challenge
Generating counts and lists from data in a Pandas DataFrame
- Challenge
Groups in Pandas
- Challenge
Quickly creating summary counts in Pandas
- Challenge- Make a list
Basic Math Functions



In [1]:

    
#Import pandas, using the alias 'pd'
import pandas as pd

Reading CSV Data Using Pandas

In the Data folder within our workspace is a file named surveys.csv which holds the data we'll use for our exercises. If you're curious, this dataset is part of the Portal Teaching data, a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA.

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a .csv file: each row holds information for a single animal, and the columns represent:

Column	Description
record_id	Unique id for the observation
month	month of observation
day	day of observation
year	year of observation
plot_id	ID of a particular plot
species_id	2-letter code
sex	sex of animal (“M”, “F”)
hindfoot_length	length of the hindfoot in mm
weight	weight of the animal in grams

So what is a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.



In [2]:

    
#Read in the csv file as a data frame, holding it in the object called surveys_df
surveys_df = pd.read_csv('../data/surveys.csv')

We can print the entire contents of the data frame by just calling the object.

Remember that in Jupyter notebooks, we can toggle the output by clicking the lightly shaded area to the left of it...



In [3]:

    
#Show the data frame's contents
surveys_df









    Out[3]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN
    
    
      5
      6
      7
      16
      1977
      1
      PF
      M
      14.0
      NaN
    
    
      6
      7
      7
      16
      1977
      2
      PE
      F
      NaN
      NaN
    
    
      7
      8
      7
      16
      1977
      1
      DM
      M
      37.0
      NaN
    
    
      8
      9
      7
      16
      1977
      1
      DM
      F
      34.0
      NaN
    
    
      9
      10
      7
      16
      1977
      6
      PF
      F
      20.0
      NaN
    
    
      10
      11
      7
      16
      1977
      5
      DS
      F
      53.0
      NaN
    
    
      11
      12
      7
      16
      1977
      7
      DM
      M
      38.0
      NaN
    
    
      12
      13
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN
    
    
      13
      14
      7
      16
      1977
      8
      DM
      NaN
      NaN
      NaN
    
    
      14
      15
      7
      16
      1977
      6
      DM
      F
      36.0
      NaN
    
    
      15
      16
      7
      16
      1977
      4
      DM
      F
      36.0
      NaN
    
    
      16
      17
      7
      16
      1977
      3
      DS
      F
      48.0
      NaN
    
    
      17
      18
      7
      16
      1977
      2
      PP
      M
      22.0
      NaN
    
    
      18
      19
      7
      16
      1977
      4
      PF
      NaN
      NaN
      NaN
    
    
      19
      20
      7
      17
      1977
      11
      DS
      F
      48.0
      NaN
    
    
      20
      21
      7
      17
      1977
      14
      DM
      F
      34.0
      NaN
    
    
      21
      22
      7
      17
      1977
      15
      NL
      F
      31.0
      NaN
    
    
      22
      23
      7
      17
      1977
      13
      DM
      M
      36.0
      NaN
    
    
      23
      24
      7
      17
      1977
      13
      SH
      M
      21.0
      NaN
    
    
      24
      25
      7
      17
      1977
      9
      DM
      M
      35.0
      NaN
    
    
      25
      26
      7
      17
      1977
      15
      DM
      M
      31.0
      NaN
    
    
      26
      27
      7
      17
      1977
      15
      DM
      M
      36.0
      NaN
    
    
      27
      28
      7
      17
      1977
      11
      DM
      M
      38.0
      NaN
    
    
      28
      29
      7
      17
      1977
      11
      PP
      M
      NaN
      NaN
    
    
      29
      30
      7
      17
      1977
      10
      DS
      F
      52.0
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      35519
      35520
      12
      31
      2002
      9
      SF
      NaN
      24.0
      36.0
    
    
      35520
      35521
      12
      31
      2002
      9
      DM
      M
      37.0
      48.0
    
    
      35521
      35522
      12
      31
      2002
      9
      DM
      F
      35.0
      45.0
    
    
      35522
      35523
      12
      31
      2002
      9
      DM
      F
      36.0
      44.0
    
    
      35523
      35524
      12
      31
      2002
      9
      PB
      F
      25.0
      27.0
    
    
      35524
      35525
      12
      31
      2002
      9
      OL
      M
      21.0
      26.0
    
    
      35525
      35526
      12
      31
      2002
      8
      OT
      F
      20.0
      24.0
    
    
      35526
      35527
      12
      31
      2002
      13
      DO
      F
      33.0
      43.0
    
    
      35527
      35528
      12
      31
      2002
      13
      US
      NaN
      NaN
      NaN
    
    
      35528
      35529
      12
      31
      2002
      13
      PB
      F
      25.0
      25.0
    
    
      35529
      35530
      12
      31
      2002
      13
      OT
      F
      20.0
      NaN
    
    
      35530
      35531
      12
      31
      2002
      13
      PB
      F
      27.0
      NaN
    
    
      35531
      35532
      12
      31
      2002
      14
      DM
      F
      34.0
      43.0
    
    
      35532
      35533
      12
      31
      2002
      14
      DM
      F
      36.0
      48.0
    
    
      35533
      35534
      12
      31
      2002
      14
      DM
      M
      37.0
      56.0
    
    
      35534
      35535
      12
      31
      2002
      14
      DM
      M
      37.0
      53.0
    
    
      35535
      35536
      12
      31
      2002
      14
      DM
      F
      35.0
      42.0
    
    
      35536
      35537
      12
      31
      2002
      14
      DM
      F
      36.0
      46.0
    
    
      35537
      35538
      12
      31
      2002
      15
      PB
      F
      26.0
      31.0
    
    
      35538
      35539
      12
      31
      2002
      15
      SF
      M
      26.0
      68.0
    
    
      35539
      35540
      12
      31
      2002
      15
      PB
      F
      26.0
      23.0
    
    
      35540
      35541
      12
      31
      2002
      15
      PB
      F
      24.0
      31.0
    
    
      35541
      35542
      12
      31
      2002
      15
      PB
      F
      26.0
      29.0
    
    
      35542
      35543
      12
      31
      2002
      15
      PB
      F
      27.0
      34.0
    
    
      35543
      35544
      12
      31
      2002
      15
      US
      NaN
      NaN
      NaN
    
    
      35544
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      35545
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      35546
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
    
    
      35547
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
    
    
      35548
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN
    
  

35549 rows × 9 columns

At the bottom of the [long] output above, we see that the data includes 33,549 rows and 9 columns.

The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame. It looks like the read_csv function in Pandas read our file properly.

Exploring our Species Survey Data

Now, as we often do, let's look at the type of the object we just created.



In [ ]:

    
#Show the object type of the object we just created
type(surveys_df)

As expected, it’s a DataFrame (or, to use the full name that Python uses to refer to it internally, a pandas.core.frame.DataFrame).

What kind of things does surveys_df contain? DataFrames have an attribute called dtypes that answers this:



In [ ]:

    
#Show the data types of the columns in our data frame
surveys_df.dtypes

All the values in a column have the same type. For example, months have type int64, which is a kind of integer. Cells in the month column cannot have fractional values, but the weight and hindfoot_length columns can, because they have type float64. The object type doesn’t have a very helpful name, but in this case it represents strings (such as ‘M’ and ‘F’ in the case of sex).

Useful Ways to View DataFrame Objects in Python

There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.

To access an attribute, use the DataFrame object name followed by the attribute name df_object.attribute. Using the DataFrame surveys_df and attribute columns, an index of all the column names in the DataFrame can be accessed with surveys_df.columns.

Methods are called in a similar fashion using the syntax df_object.method(). As an example, surveys_df.head() gets the first few rows in the DataFrame surveys_df using the head() method. With a method, we can supply extra information in the parens to control behaviour.

Let’s look at the data using these.

Challenge - DataFrames

Using our DataFrame surveys_df, try out the attributes & methods below to see what they return.

surveys_df.columns
surveys_df.shape Take note of the output of shape - what format does it return the shape of the DataFrame in?
surveys_df.head() Also, what does surveys_df.head(15) do?
surveys_df.tail()

Use the boxes below to type in the above commands and see what they produce.



In [ ]:

    
#Complete challenge 1
surveys_df.



In [ ]:

    
#Complete challenge 2



In [ ]:

    
#Complete challenge 3



In [ ]:

    
#Complete challenge 3

Generating Counts and Lists From Data In A Pandas DataFrame

We’ve read our data into Python. Next, let’s perform some quick summary statistics to learn more about the data that we’re working with. We might want to know how many animals were collected in each plot, or how many of each species were caught. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by.

Let’s begin by exploring the data in our data frame:

First, examine the column names. (Yes,I know we just did that in the Challenge above...)



In [ ]:

    
# Look at the column names
surveys_df.columns

We can extract one column of data into a new object by referencing that column as shown here:



In [ ]:

    
speciesIDs = surveys_df['species_id']

Examining the type of this speciesIDs object reveals another Pandas data type: the Series which is slightly different than a DataFrame...



In [ ]:

    
type(speciesIDs)

A series object is a one-dimensional array, much like a NumPy array, with its own set of properties and functions. The values are indexed allowing us to extract values at a specific row (try: speciesIDs[5]) or slice of rows (try: species[2:7]).

We can also, using the series.nunique() and series.unique() functions, generate a count of unique values in the series and a list of unique values, respectively.



In [ ]:

    
#Reveal how many unique species_ID values are in the table
speciesIDs.nunique()



In [ ]:

    
#List the unique values
speciesIDs.unique()

Challenge - Counts and Lists from Data

Create a list of unique plot ID’s found in the surveys data. Call it plot_names. How many unique plots are there in the data? How many unique species are in the data?
What is the difference between len(plot_names) and surveys_df['plot_id'].nunique()?



In [ ]:

    
#Challenge 1



In [ ]:

    
#Challenge 2

Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per plot.

We can calculate basic statistics for all records in a single column using the syntax below:



In [ ]:

    
surveys_df['weight'].describe()

We can also extract one specific metric if we wish:



In [ ]:

    
print" Min: ", surveys_df['weight'].min()
print" Max: ", surveys_df['weight'].max()
print" Mean: ", surveys_df['weight'].mean()
print" Std Dev: ", surveys_df['weight'].std()
print" Count: ", surveys_df['weight'].count()

But if we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.



In [ ]:

    
# Group data by sex
grouped_data = surveys_df.groupby('sex')



In [ ]:

    
# Show just the grouped means
grouped_data.mean()



In [ ]:

    
# Or, use the describe function to reveal all summary stats for the grouped data
grouped_data.describe()

Challenge - Summary Data

How many recorded individuals are female F and how many male M
What happens when you group by two columns using the following syntax and then grab mean values:
- grouped_data2 = surveys_df.groupby(['plot_id','sex'])
- grouped_data2.mean()
Summarize weight values for each plot in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data by_plot['weight'].describe()

Challenge #3 should reveal:



In [ ]:

    
# Challenge 1
grouped_data.count()



In [ ]:

    
# Challenge 2



In [ ]:

    
# Challenge 3

Quickly Creating Summary Counts in Pandas

Let’s next count the number of samples for each species. We can do this in a few ways, but we’ll use groupby combined with a count() method.



In [ ]:

    
# count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)

Or, we can also count just the rows that have the species “DO”:



In [ ]:

    
surveys_df.groupby('species_id')['record_id'].count()['DO']

Challenge - Make a list

What's another way to create a list of species and associated count of the records in the data? Hint: you can perform count, min, etc functions on groupby DataFrames in the same way you can perform them on regular DataFrames.



In [ ]:

    
#Challenge

Basic Math Functions

If we wanted to, we could perform math on an entire column of our data. For example let’s multiply all weight values by 2. A more practical use of this might be to normalize the data according to a mean, area, or some other value calculated from our data.



In [ ]:

    
# multiply all weight values by 2
surveys_df['weight']

Quick & Easy Plotting Data Using Pandas

We can plot our summary stats using Pandas, too.



In [ ]:

    
# make sure figures appear inline in Ipython Notebook
%matplotlib inline
# create a quick bar chart
species_counts.plot(kind='bar');

Challenge - Plots

Create a plot of average weight across all species per plot.
Create a plot of total males versus total females for the entire dataset.

Summary Plotting Challenge

Create a stacked bar plot, with weight on the Y axis, and the stacked variable being sex. The plot should show total weight by sex for each plot. Some tips are below to help you solve this challenge:

For more on Pandas plots, visit this link.
You can use the code that follows to create a stacked bar plot but the data to stack need to be in individual columns. Here’s a simple example with some data where ‘a’, ‘b’, and ‘c’ are the groups, and ‘one’ and ‘two’ are the subgroups.



In [ ]:

    
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)

We can plot the above with:



In [ ]:

    
# plot stacked data so columns 'one' and 'two' are stacked
my_df = pd.DataFrame(d)
my_df.plot(kind='bar',stacked=True,title="The title of my graph")

You can use the .unstack() method to transform grouped data into columns for each plotting. Try running .unstack() on some DataFrames above and see what it yields.

Start by transforming the grouped data (by plot and sex) into an unstacked layout, then create a stacked plot.



In [ ]:



In [ ]:

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
5	6	7	16	1977	1	PF	M	14.0	NaN
6	7	7	16	1977	2	PE	F	NaN	NaN
7	8	7	16	1977	1	DM	M	37.0	NaN
8	9	7	16	1977	1	DM	F	34.0	NaN
9	10	7	16	1977	6	PF	F	20.0	NaN
10	11	7	16	1977	5	DS	F	53.0	NaN
11	12	7	16	1977	7	DM	M	38.0	NaN
12	13	7	16	1977	3	DM	M	35.0	NaN
13	14	7	16	1977	8	DM	NaN	NaN	NaN
14	15	7	16	1977	6	DM	F	36.0	NaN
15	16	7	16	1977	4	DM	F	36.0	NaN
16	17	7	16	1977	3	DS	F	48.0	NaN
17	18	7	16	1977	2	PP	M	22.0	NaN
18	19	7	16	1977	4	PF	NaN	NaN	NaN
19	20	7	17	1977	11	DS	F	48.0	NaN
20	21	7	17	1977	14	DM	F	34.0	NaN
21	22	7	17	1977	15	NL	F	31.0	NaN
22	23	7	17	1977	13	DM	M	36.0	NaN
23	24	7	17	1977	13	SH	M	21.0	NaN
24	25	7	17	1977	9	DM	M	35.0	NaN
25	26	7	17	1977	15	DM	M	31.0	NaN
26	27	7	17	1977	15	DM	M	36.0	NaN
27	28	7	17	1977	11	DM	M	38.0	NaN
28	29	7	17	1977	11	PP	M	NaN	NaN
29	30	7	17	1977	10	DS	F	52.0	NaN
...	...	...	...	...	...	...	...	...	...
35519	35520	12	31	2002	9	SF	NaN	24.0	36.0
35520	35521	12	31	2002	9	DM	M	37.0	48.0
35521	35522	12	31	2002	9	DM	F	35.0	45.0
35522	35523	12	31	2002	9	DM	F	36.0	44.0
35523	35524	12	31	2002	9	PB	F	25.0	27.0
35524	35525	12	31	2002	9	OL	M	21.0	26.0
35525	35526	12	31	2002	8	OT	F	20.0	24.0
35526	35527	12	31	2002	13	DO	F	33.0	43.0
35527	35528	12	31	2002	13	US	NaN	NaN	NaN
35528	35529	12	31	2002	13	PB	F	25.0	25.0
35529	35530	12	31	2002	13	OT	F	20.0	NaN
35530	35531	12	31	2002	13	PB	F	27.0	NaN
35531	35532	12	31	2002	14	DM	F	34.0	43.0
35532	35533	12	31	2002	14	DM	F	36.0	48.0
35533	35534	12	31	2002	14	DM	M	37.0	56.0
35534	35535	12	31	2002	14	DM	M	37.0	53.0
35535	35536	12	31	2002	14	DM	F	35.0	42.0
35536	35537	12	31	2002	14	DM	F	36.0	46.0
35537	35538	12	31	2002	15	PB	F	26.0	31.0
35538	35539	12	31	2002	15	SF	M	26.0	68.0
35539	35540	12	31	2002	15	PB	F	26.0	23.0
35540	35541	12	31	2002	15	PB	F	24.0	31.0
35541	35542	12	31	2002	15	PB	F	26.0	29.0
35542	35543	12	31	2002	15	PB	F	27.0	34.0
35543	35544	12	31	2002	15	US	NaN	NaN	NaN
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN