Source: http://www.datacarpentry.org/python-ecology-lesson/01-starting-with-data/\
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy arrays.
We begin by importing the pandas library. By convention, we often import pandas with the pd
alias.
In [1]:
#Import pandas, using the alias 'pd'
import pandas as pd
In the Data folder within our workspace is a file named surveys.csv
which holds the data we'll use for our exercises. If you're curious, this dataset is part of the Portal Teaching data, a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA.
We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a .csv
file: each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.
In [2]:
#Read in the csv file as a data frame, holding it in the object called surveys_df
surveys_df = pd.read_csv('../data/surveys.csv')
We can print the entire contents of the data frame by just calling the object.
Remember that in Jupyter notebooks, we can toggle the output by clicking the lightly shaded area to the left of it...
In [3]:
#Show the data frame's contents
surveys_df
Out[3]:
At the bottom of the [long] output above, we see that the data includes 33,549 rows and 9 columns.
The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame. It looks like the read_csv function in Pandas read our file properly.
In [ ]:
#Show the object type of the object we just created
type(surveys_df)
As expected, it’s a DataFrame (or, to use the full name that Python uses to refer to it internally, a pandas.core.frame.DataFrame).
What kind of things does surveys_df contain? DataFrames have an attribute called dtypes
that answers this:
In [ ]:
#Show the data types of the columns in our data frame
surveys_df.dtypes
All the values in a column have the same type. For example, months have type int64
, which is a kind of integer. Cells in the month column cannot have fractional values, but the weight and hindfoot_length columns can, because they have type float64
. The object
type doesn’t have a very helpful name, but in this case it represents strings (such as ‘M’ and ‘F’ in the case of sex).
There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame
object.
To access an attribute, use the DataFrame object name followed by the attribute name df_object.attribute
. Using the DataFrame surveys_df
and attribute
columns, an index of all the column names in the DataFrame can be accessed with surveys_df.columns
.
Methods are called in a similar fashion using the syntax df_object.method()
. As an example, surveys_df.head()
gets the first few rows in the DataFrame surveys_df
using the head()
method. With a method, we can supply extra information in the parens to control behaviour.
Let’s look at the data using these.
Using our DataFrame surveys_df, try out the attributes & methods below to see what they return.
surveys_df.columns
surveys_df.shape
Take note of the output of shape - what format does it return the shape of the DataFrame in?surveys_df.head()
Also, what does surveys_df.head(15) do?surveys_df.tail()
Use the boxes below to type in the above commands and see what they produce.
In [ ]:
#Complete challenge 1
surveys_df.
In [ ]:
#Complete challenge 2
In [ ]:
#Complete challenge 3
In [ ]:
#Complete challenge 3
We’ve read our data into Python. Next, let’s perform some quick summary statistics to learn more about the data that we’re working with. We might want to know how many animals were collected in each plot, or how many of each species were caught. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by.
Let’s begin by exploring the data in our data frame:
First, examine the column names. (Yes,I know we just did that in the Challenge above...)
In [ ]:
# Look at the column names
surveys_df.columns
We can extract one column of data into a new object by referencing that column as shown here:
In [ ]:
speciesIDs = surveys_df['species_id']
Examining the type of this speciesIDs
object reveals another Pandas data type: the Series which is slightly different than a DataFrame...
In [ ]:
type(speciesIDs)
A series
object is a one-dimensional array, much like a NumPy array, with its own set of properties and functions. The values are indexed allowing us to extract values at a specific row (try: speciesIDs[5]
) or slice of rows (try: species[2:7]
).
We can also, using the series.nunique()
and series.unique()
functions, generate a count of unique values in the series and a list of unique values, respectively.
In [ ]:
#Reveal how many unique species_ID values are in the table
speciesIDs.nunique()
In [ ]:
#List the unique values
speciesIDs.unique()
In [ ]:
#Challenge 1
In [ ]:
#Challenge 2
We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per plot.
We can calculate basic statistics for all records in a single column using the syntax below:
In [ ]:
surveys_df['weight'].describe()
We can also extract one specific metric if we wish:
In [ ]:
print" Min: ", surveys_df['weight'].min()
print" Max: ", surveys_df['weight'].max()
print" Mean: ", surveys_df['weight'].mean()
print" Std Dev: ", surveys_df['weight'].std()
print" Count: ", surveys_df['weight'].count()
But if we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby
method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.
In [ ]:
# Group data by sex
grouped_data = surveys_df.groupby('sex')
In [ ]:
# Show just the grouped means
grouped_data.mean()
In [ ]:
# Or, use the describe function to reveal all summary stats for the grouped data
grouped_data.describe()
F
and how many male M
grouped_data2 = surveys_df.groupby(['plot_id','sex'])
grouped_data2.mean()
by_plot['weight'].describe()
Challenge #3 should reveal:
In [ ]:
# Challenge 1
grouped_data.count()
In [ ]:
# Challenge 2
In [ ]:
# Challenge 3
In [ ]:
# count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)
Or, we can also count just the rows that have the species “DO”:
In [ ]:
surveys_df.groupby('species_id')['record_id'].count()['DO']
In [ ]:
#Challenge
In [ ]:
# multiply all weight values by 2
surveys_df['weight']
In [ ]:
# make sure figures appear inline in Ipython Notebook
%matplotlib inline
# create a quick bar chart
species_counts.plot(kind='bar');
Create a stacked bar plot, with weight on the Y axis, and the stacked variable being sex
. The plot should show total weight by sex for each plot. Some tips are below to help you solve this challenge:
In [ ]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)
We can plot the above with:
In [ ]:
# plot stacked data so columns 'one' and 'two' are stacked
my_df = pd.DataFrame(d)
my_df.plot(kind='bar',stacked=True,title="The title of my graph")
Start by transforming the grouped data (by plot and sex) into an unstacked layout, then create a stacked plot.
In [ ]:
In [ ]: