This lesson introduces Python as an environment for data analysis and visualization. The materials are based on the Data Carpentry Python for Ecologists lesson. However, the lesson focuses on general analysis and visualization of tabular data and is not specific to ecologists or ecological data. As Data Carpentry explains:
Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.
At the end of this lesson, you will be able to:
This lesson will not prepare you to use Python as a general purpose programming language; there are some parts of the language we won't have time to cover. However, at the end of this lesson, you will have a good grasp of Python syntax and be well-prepared to learn the rest of the language, if you desire to do so. Even without seeing all of the Python programming language, you will be prepared to analyze and visualize data in Python using pandas and matplotlib.
As with the Data Carpentry ecology lesson, this lesson is licensed for open use under the CC BY 4.0 license.
Python is a general purpose programming language that allows for the rapid development of scientific workflows. Python's main advantages are:
The only language that computers really understand is machine language, or binary: ones and zeros. Anything we tell computers to do has to be translated to binary for computers to execute.
Python is what we call an interpreted language. This means that computers can translate Python to machine code as they are reading it. This distinguishes Python from languages like C, C++, or Java, which have to be compiled to machine code before they are run. The details aren't important to us; what is important is that we can use Python in two ways:
For this lesson, we'll be using the Python interpreter that is embedded in Jupyter Notebook. Jupyter Notebook is a fancy, browser-based environment for literate programming, the combination of Python scripts with rich text for telling a story about the task you set out to do with Python. This is a powerful way for collecting the code, the analysis, the context, and the results in a single place.
The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:
jupyter notebook
; then press ENTER.jupyter notebook
; then press ENTER.Let's try out the Python interpreter.
In [1]:
print('Hello, world!')
Alternatively, we could save that one line of Python code to a text file with a *.py
file extension and then execute that file.
In [4]:
text = 'Data Carpentry' # A character string
number = 42 # An integer number
pi = 3.14159265 # A floating-point number or "float"
Here, we've assigned data to variables using the assignment operator or equal sign. The process of assignment takes a value and stores it under a name that we make up. This way, we can use that stored value again by calling its name.
In [5]:
number
Out[5]:
Note that to recover a variable's stored value, we simply type the name of the variable and hit Enter
. (This only works in interactive mode; if we wrote a script and want it to print out a value, we have to use the print()
function.)
Variable names can only include letters, the underscore, and numbers. However, variable names cannot start with numbers.
In [7]:
my_variable = 'some text'
We can perform mathematical calculations in Python using the basic operators +
, -
, /
, *
, and %
.
In [8]:
2 + 2
Out[8]:
In [9]:
6 * 7
Out[9]:
In [12]:
5 ** 2
Out[12]:
This is "13 modulo 5" and the result is the remainder, 3.
In [22]:
13 % 5
Out[22]:
We can also use comparison and logical operators. These operators return Boolean values; that is, they determine or describe whether something is True
or False
.
In [23]:
3 > 4
Out[23]:
In [26]:
5 == 5
Out[26]:
i.e., is it True and True?
In [24]:
True and True
Out[24]:
Chooses the "truth-y" value between the two:
In [25]:
True or False
Out[25]:
True
and False
, with the first letter capitalized, are special values in Python that mean just what they say.
Much of Python's expressive power and flexibility comes from the way it handles sequences. A sequence could be a sequence of characters in a text string or a sequence of numbers.
A list is Python's built-in data structure for handling general, ordered sequences. Each element can be accessed by its index. Note that, in Python, we start counting from zero, not from one.
In [27]:
numbers = [1, 2, 3]
numbers[0]
Out[27]:
The square brackets are used to slice a sequence by one or more indices. Above, we have asked for the first (the zeroth) element of the numbers
sequence.
A for
loop is a useful way of accessing the elements of a sequence one at a time:
In [28]:
for number in numbers:
print(number)
Indentation is very important in Python. Note that the second line in the above example is indented. This is Python's way of marking a block of code. It's standard to indent by 4 spaces.
To add elements to the end of a list, we can use the append()
method:`
In [29]:
numbers.append(4)
print(numbers)
Note that there is no output associated with the append()
method; the numbers
sequence is modified in place so we don't need to assign the result to a variable.
Methods are a way to interact with an object in Python. We can invoke a method using the dot, followed by the method name and a list of arguments in parentheses. To find out what methods are available for an object, we can use the built-in help()
function.
In [30]:
help(numbers)
We can also access a list of methods using dir
.
In [31]:
dir(numbers)
Out[31]:
A tuple is similar to a list in that it's an ordered sequence of elements. However, tuples can not be changed once created; they are "immutable." Tuples are created by placing comma-separated values inside parentheses.
In [33]:
a_tuple = (1,2,3)
another_tuple = ('rabbit', 'mongoose', 'platypus')
still_a_tuple = (1,)
# Note that lists use square brackets
a_list = [1,2,3]
In [1]:
a_tuple = ('a', 'b', 'c', 'd')
a_list = ['a', 'b', 'c', 'd']
a_tuple[0] = 5
a_list[0] = 5
type(a_tuple)
into Python; what is the object's type?A dictionary is a container that holds key-value pairs. It is a data structure that is also referred to, in other programming languages, as an associative array.
In [6]:
mapping = {'tbsp': 'tablespoon', 'tsp': 'teaspoon'}
mapping['tbsp']
Out[6]:
Dictionaries are similar to lists in that we can store multiple things inside them. While we index lists with numeric indices, we index dictionaries with keys. A key is a unique identifier for a value in the dictionary.
Keys must be a particular data type; they must be hashable, therefore, only strings and numeric types are acceptable.
In [7]:
code_book = {1: 'Valid', 2: 'Invalid'}
code_book[1]
Out[7]:
In [9]:
{[1,2]: 'Valid/Invalid'}
Another way we can create dictionaries is using the dict()
function. Here, we give the dict()
function a list of key-value pairs, represented as tuples. The first element in the tuple is the key, the second element is the value.
In [37]:
dict([(1, 'Valid'), (2, 'Invalid')])
Out[37]:
Using for
loops with dictionaries is a little more complicated.
In [10]:
for key, value in code_book.items():
print(key, '->', value)
Alternatively:
In [11]:
for key in code_book.keys():
print(key, '->', code_book[key])
How can we change a value in our dictionary? Try to reassign one of the values in the code_book
dictionary.
You may have noticed that a dictionary sometimes returns its items in a different order than you expected. Dictionaries are intrinsically unordered; they do not retain the order of their items.
One of the chief reasons we program computers to do things for us is because computers are very good at tedious tasks (and humans are not). A block of Python code that does the same thing every time is best defined as function in Python. A function is a series of fixed Python statements, with or without input arguments, that are assigned a name so that we can easily call them over and over again.
We've already seen the built-in print()
and help()
functions. Now let's see how to write our own.
In [15]:
def pow10(exponent):
result = 10 ** exponent
return result
pow10(2)
Out[15]:
Key things to note about this example:
def
command followed by the name of the function and any arguments its takes, written just like we would call the function (recall the print()
and help()
functions);return
command to indicate what the result of the function , or its return value, should be. If we don't return
anything, the function's body is still executed, but we don't necessarily see any output when the function is called.How can we make a more general version of the function pow10()
?
In [4]:
def pow(exponent, base=10):
result = base ** exponent
return result
pow(2, 10)
Out[4]:
Here, the function pow()
can be used to calculate powers of any base, not just base 10. By default, it will calculate powers of 10, because the argument base
has a default argument of 10.
In [5]:
pow(2, base = 10)
Out[5]:
In [6]:
pow(base = 10, exponent = 2)
Out[6]:
pow()
is actually a function already built into Python, so we didn't need to write this function, but it demonstrates how you can create more flexible functions using default arguments.
To convert from temperatures in Fahrenheight to Celsius, we first subtract $32$ and then multiple by $5/9$. Write a function that converts temperatures from Fahrenheit to Celsius.
Another type of function in Python is the lambda function. This is a special, one-line function that is generally used for transforming inputs.
In [66]:
to_lowercase = lambda s: s.lower()
to_lowercase('Roger')
Out[66]:
In [82]:
names = ['Roger', 'Helen', 'Abdul', 'Lakshmi']
list(map(to_lowercase, names))
Out[82]:
Lambda functions are important for reasons we'll see later.
In general, when we have a data analysis task, there are specific steps that we undertake, in order:
We'll now see how Python plays a role in this workflow, starting from having "clean" or nearly clean data to begin analyzing.
We've seen how we can use spreadsheets to effectively organize data. You probably know how to do some basic analysis on tabular data using spreadsheet software programs like Microsoft Excel. Why, then, should we learn to use Python for data analysis?
For this lesson to run smoothly, let's make sure everyone is in the same directory. In the Jupyter Notebook file tree, navigate to your Desktop. We'll create a new Notebook here to use for the rest of the lesson.
For this lesson, we will be using the Portal Project Teaching Database, a subset of data from Ernst et al.'s (2009) long-term monitoring and experimental manipulation of a Chihuahuan desert ecosystem near Portal, Arizona, U.S.A.. We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma-separated variable (CSV) file: each row holds information for a single animal, and the columns reprsent:
Column | Description |
---|---|
record_id | Unique ID for the observation |
month | Month of observation |
day | Day of observation |
year | Year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code identifying the species |
sex | Sex of animal ("M","F") |
hindfoot_length | Length of the hindfoot in millimeters |
weight | Weight of the animal in grams |
A library in Python contains a set of tools (called functions) that perform tasks on our data.
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project.
Once a library is set up, it can be used to perform many tasks.
One of the libraries we'll be using in this lesson is the Python Data Analysis Library, or pandas
.
pandas
adds a number of things to base Python including more sophisticated data structures and tools to connect to other libraries like matplotlib
for producing data visualizations.
Python doesn't initially load all of the libraries that we have installed; we have to tell Python to import the librar(ies) that we want to use in any particular session.
To import a library, we use the syntax: import libraryName
.
Because all of the tools in that library will be known by the name we import, we have the option to give it a shorter name.
We'll do that here with Pandas.
In [2]:
import pandas as pd
Each time we call a function that's in the Pandas library, we use the syntax libraryName.functionName
.
This ensures that Python can find the tool or function we're asking for.
Most Pandas users abbreviate the library name to pd
and because this is such a popular convention, you can often find help on the internet by searching for a particular function in the pd
namespace.
For instance, I'll look up support for the pd.read_csv()
function on the Google search engine.
Tabular data, like those stored in CSV, tab-delimited, or fixed-width formats, can be read into Python using Pandas and stored in a Pandas DataFrame
.
If you've used the R programming language before, you'll be familiar with DataFrames.
A DataFrame is a 2-dimensional, tabular data structure that can store different types of data across multiple columns.
It is structurally identical to a spreadsheet in this way or, as we'll see, a table in a SQL database.
We can read in our survey data as a DataFrame using the pd.read_csv()
function.
In [ ]:
pd.read_csv('ecology-surveys.csv')
When we run this command, Pandas finds the CSV file, opens it, and reads in the contents, line-by-line, as rows in a new DataFrame, which it then prints out. We can see that:
The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame.
We can see from this output that Pandas read the file properly, but it hasn't saved the DataFrame to memory. We need to assign the DataFrame to a variable.
In [115]:
surveys = pd.read_csv('ecology-surveys.csv')
Note that when we do variable assignment, Python does not display anything on the screen.
The output of the pd.read_csv()
function, our table, instead of being displayed on the screen is not "captured" in the variable named surveys
.
In [5]:
type(surveys)
Out[5]:
One of the first things we might do after importing any kind of data in any environment is to make sure that are our data--the individual fields or columns--are represented in the right way; by the right data types.
In [6]:
surveys.dtypes
Out[6]:
Here, int64
refers to integer type; it cannot store decimal numbers. float64
stores decimal numbers with 64-bit precision. object
refers to character strings, or text.
Try executing each code sample below and see what is returned.
surveys.columns
surveys.head()
surveys.head(15)
surveys.tail()
surveys.shape
Take note of the output of surveys.shape
; what format does it return?
Finally, what is the difference between the code samples that end in parentheses and those that do not?
Each of the code samples above has us calling some attribute or method on the surveys DataFrame.
If we think of a person, an attribute is something that belongs to that person or describes that person, like hair color or number of siblings. A method is something that person does, like bake a pie or go for a run.
Let's perform some summary statistics to further verify that the data we imported look okay. First, let's remind ourselves what is in our data and what values we might use to group the data by in calculating summary statistics.
In [23]:
# Look at the column names
surveys.columns.values
Out[23]:
Let's get a list of all the species.
The pd.unique()
function tells us all of the unique values in the species_id
column.
In [11]:
pd.unique(surveys['species_id'])
Out[11]:
Note the bracket notation we've used here. This is how we access a column by name in a Pandas DataFrame. It's just like how we accessed the value in a Dictionary by its key name.
plot_names
. How many unique plots are there in the data? How many unique species are in the data?len(plot_names)
and plot_names.shape
?We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals in each plot.
We can calculate basic statistics for all records in a single column using the syntax below.
In [24]:
surveys['weight'].describe()
Out[24]:
Let's break that last command down.
weight
column in the surveys
table.describe()
method on that column.We can also extract a specific statistic.
In [26]:
surveys['weight'].mean()
Out[26]:
In [27]:
surveys['weight'].count()
Out[27]:
If we want to summarize by one or more variables, for example, sex
, we can use Pandas' groupby()
method.
Once we've created a grouped DataFrame, we can quickly calculated summary statistics by a group of our choice.
In [29]:
# Group data by sex
grouped_by_sex = surveys.groupby('sex')
# Summary statistics for all numeric columns, grouped by sex
grouped_by_sex.describe()
Out[29]:
In [30]:
grouped_by_sex.mean()
Out[30]:
Let's count the number of samples for each species.
We can do this in a few ways, but we'll use groupby()
combined with a count()
method.
In [42]:
species_counts = surveys.groupby('species_id')['record_id'].count()
species_counts
Out[42]:
Again, let's break this down:
surveys
DataFrame by the unique values in one of its columns: species_id
.record_id
in the output from the last part.count()
method on this column. Because the data are grouped by species_id
, we get a count within each unique value of species_id
, not a count of all the rows in the table.record_id
column. Try asking for a different column in the square brackets. Do you get a different result? Why or why not?species_id
set to DO
? Hint: You can build on the last command we executed; think about Dictionaries and key-value pairs.We can perform quick mathematical transformations on the values of a column in a straightforward way. For instance, we might normalize the measured hindfoot lengths by their mean.
In [ ]:
surveys['hindfoot_length'] / surveys['hindfoot_length'].mean()
We can also visualize summary statistics using Pandas. First, we'll fire off an iPython "magic" function that will allow us to view plots inline in the Jupyter Notebook.
In [40]:
%matplotlib inline
In [52]:
# Create a quick bar chart
species_counts.plot(kind = 'bar')
Out[52]:
This plot is kind of cramped and hard to read; we can make it slightly bigger by specifying a new weight and height (in inches).
In [53]:
species_counts.plot(figsize=(9, 6), kind = 'bar')
Out[53]:
Note: Some of the species have no weight measurements; they are entered as NaN
, which stands for "not a number" and refers to missing values.
Let's make a stacked bar plot, with the total number of individuals on the Y axis and the stacked variable being sex
. The plot should show total count by sex for each survey plot.
First, we'll calculate the total number of individual in each unique pair of plot_id
and sex
.
In [81]:
plot_sex_count = surveys.groupby(('plot_id', 'sex'))['record_id'].count()
plot_sex_count
Out[81]:
Next, we use the unstack()
method on our grouped data to figure out the total count that each sex contributes to each plot.
Here, unstack()
simply rearranges the rows and columns into a more convenient format.
In [85]:
my_plot = plot_sex_count.unstack().plot(kind = 'bar', stacked = True)
my_plot.set_title('Total weight by plot and sex')
my_plot.set_ylabel('Weight (grams)')
my_plot.set_xlabel('Plot ID')
Out[85]:
Up to this point, we have learned:
Often, once we've pulled in data and done some basic data cleaning, we want to transform our data to make it more useful for our analysis. For instance, we might find that:
Some of these objectives are perhaps best handled in OpenRefine, as we saw. However, we can also transform our data easily enough with Python and Pandas.
In our surveys
data, the animal weights are in grams. What if our collaborator says she needs the weights in kilograms?
In [45]:
# Create a copy of our original data
surveys_converted = surveys.copy()
# Create a new column in this table
surveys_converted['weight_kg'] = surveys_converted.weight / 1000
surveys_converted.tail()
Out[45]:
Note that when we call for a column as an attribute, it is read-only.
In [48]:
# Convert grams to pounds
surveys_converted.weight_lbs = surveys_converted.weight * 0.00220462
surveys_converted.tail()
Out[48]:
We can affect a more general transformation of a column's values. Let's say that, for whatever reason, we want the text strings in species_id
to be lower-case, not upper-case.
In [ ]:
surveys.species_id.apply(lambda x: x.lower()).head()
What happened? When Pandas imports a DataFrame and sees categorical data, like species_id
here, it often treats the categorical data like numeric data in the background, assigning a numeric key to each unique categorical value. We need to explicitly tell Pandas to treat this value like a string.
In [63]:
surveys.species_id.apply(lambda x: str(x).lower()).head()
Out[63]:
The use of the apply()
function here is part of a general practice called function application. We can also apply a function to our entire DataFrame along one of its two axes. For instance, we can use this to count how many non-null values are in each of the columns.
In [70]:
surveys.weight.count()
Out[70]:
In [73]:
surveys.apply(lambda column: column.count(), axis = 0)
Out[73]:
We can also apply functions across the columns.
In [79]:
# Returns True or False depending on whether weight is null
surveys.apply(lambda row: pd.notnull(row.weight), axis = 1).head()
Out[79]:
Now, we'll explore how to access different parts of our data, including querying subsets of the data based on certain criteria.
We use square brackets, []
, to select a subset of a Python object. For example, we can select all of the data from a column named species_id
:
In [ ]:
surveys['species_id']
surveys.species_id
We can pass a list of column names, too, as an index to select columns in a specified order.
In [ ]:
surveys[['species_id', 'plot_id']]
surveys[['plot_id', 'species_id']]
Recall that in Python, we start counting from zero instead of one. This means that the first element in an object is located at position zero.
In [94]:
grades = [88, 72, 93, 94]
In [95]:
grades[2]
Out[95]:
In [96]:
grades[1:3]
Out[96]:
What do each of these lines of code return?
grades[0]
grades[len(grades)]
grades[4]
Why do (2) and (3) return errors?
Slicing using the []
operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, we use the following syntax.
In [97]:
# Select the first three rows; rows 0,1,2 (but NOT 3)
surveys[0:3]
Out[97]:
Other examples:
In [98]:
# First three rows, starting index optional
surveys[:3]
Out[98]:
In [100]:
# Select the last element (last row)
surveys[-1:]
Out[100]:
We can also reassign values within subset of our DataFrame. Before we do that, let's make a copy of our DataFrame so as not to modify our original data.
In [119]:
surveys_copy = surveys
surveys_copy[0:3] = 0
surveys_copy.head()
Out[119]:
In [120]:
# Confirm that the original is unchanged
surveys.head()
Out[120]:
We thought that we were creating a copy of the surveys
DataFrame. However, when we assign a variable such as y = x
, this doesn't create a copy of x
; rather, it creates a new variable y
that refers to the same object that x
refers to. This means there is only one object (the DataFrame), and both x
and y
refer to it. Thus, when we modify surveys_copy
, the object that is modified is the same object surveys
points to.
In [122]:
surveys = pd.read_csv('ecology-surveys.csv')
surveys_copy = surveys.copy()
surveys_copy[0:3] = 0
surveys.head()
Out[122]:
We can select specific ranges of our data in both the row and column directions using either label-based or integer-based indexing.
To select a subset of rows and columns, we can use the iloc()
method (for "index location"). For example, we can select the month, day, and year columns (columns, 2, 3, 4) in the first three rows, like this:
In [124]:
surveys.iloc[0:3,1:4]
Out[124]:
Note that the order of terms in the square brackets is rows first, then columns.
In [125]:
# Get the same columns but all the rows
surveys.iloc[:,1:4].head()
Out[125]:
Here, the colon character in place of the row index indicates we want "all" elements (rows).
We can select multiple, discontiguous rows by passing a list of the row indices.
In [129]:
# Select all columns for rows of index values 0, 10
surveys.iloc[[0, 10], :]
Out[129]:
If we want to select columns by column names, we need to use the loc()
method in place of iloc()
.
In [132]:
surveys.loc[0, ['species_id', 'plot_id', 'weight']]
Out[132]:
When we use loc()
, we can also index using integers, but with one essential difference: loc()
indexes on row or column labels, not the position along the table's rows or columns.
For example, compare the outputs:
In [143]:
surveys.loc[[0, 10, 50000], :]
Out[143]:
In [144]:
surveys.iloc[[0, 10, 50000], :]
What happens when you type:
surveys[0:3]
surveys[:5]
surveys[-1:]
To review...
To index by rows in Pandas:
In [168]:
surveys[0:3]
surveys.iloc[0:3]
surveys.iloc[0:3,:]
Out[168]:
To index by columns (and rows) in Pandas
In [174]:
surveys[['month', 'day', 'year']]
surveys.loc[0:3, ['month', 'day', 'year']]
surveys.iloc[0:3, 1:4]
Out[174]:
Most importantly, we can also select a subset of our data based on certain criteria. For example, we can select all of the rows of our data that match observations in the year 2002.
In [176]:
surveys[surveys.year == 2002].head()
Out[176]:
Or, we can select all rows that do not match observations from 2002.
In [177]:
surveys[surveys.year != 2002].head()
Out[177]:
We can combine criteria using logical operators.
In [182]:
surveys[(surveys.year >= 1980) & (surveys.year <= 1985)].count()
Out[182]:
surveys
table to observations of female members of the DO
species. How many are there? What is their average weight?isin()
function (Hint: ?surveys.year.isin
). Use this function to filter the surveys
DataFrame to those rows that match the three species: OL
, OT
, OX
.Earlier, we saw that our data contain some missing values, filled in with NaN
or "not a number."
We'll next learn how to handle NaNs when they appear in our data using masks.
A mask can be useful to locate where a particular subset values exist (or don't exist).
To start, we can use the Pandas function isnull()
to find places in our surveys
DataFrame where there are null or NaN values.
In [190]:
pd.isnull(surveys)
Out[190]:
We can use the any()
function to ask if there are any null values in a given row. Here, the axis
argument specifies the axis along which to look for true values (values that are null); 1 indicates a look along the rows.
In [202]:
pd.isnull(surveys).any(axis=1).head()
Out[202]:
In [201]:
pd.isnull(surveys).any(axis=0)
Out[201]:
To select the rows where there are null values, we can use the mask as an index to subset our data, as follows.
In [205]:
surveys[pd.isnull(surveys).any(axis=1)].head()
Out[205]:
How can we select rows where there are NO null values? We can invert the mask using the tilde operator, ~
.
In [206]:
surveys[~pd.isnull(surveys).any(axis=1)].head()
Out[206]:
It is sometimes hard to know why data values are missing. Was it because of a data entry error? Was there some condition in the field that prevented observation of a particular variable?
Moreover, we need a way of representing missing data that is clear and unambiguous. If we encode missing values as zeroes, for instance, are we sure that zero won't be interpreted as a real value?
For text data, an empty string usually suffices for blank values. However, numeric data fields can't hold an empty string; they have to either hold some number or a special value called NaN
, which stands for "not a number."
In [207]:
len(surveys[pd.isnull(surveys.weight)])
Out[207]:
In [210]:
# Number of null values, in this case, is the same as number of non-zero values
len(surveys.weight) - len(surveys[surveys.weight > 0])
Out[210]:
We can replace all the NaN
values in our weight
column using the fillna()
method. Here, we use data management best practices, copying our original data so that we don't modify it.
In [213]:
df = surveys.copy()
df.weight = df.weight.fillna(0)
However, NaN
and zero will yield different results in analysis.
In [214]:
df.weight.mean()
Out[214]:
In [215]:
surveys.weight.mean()
Out[215]:
So, in general, it's better to leave NaN
values as NaN
and to subset the data to remove them, rather than re-coding them.
In [84]:
#TODO
At this point, we have learned:
In many situations, the data we want to use come in multiple files. We often need to combine these files into a single DataFrame in order to analyze the data.
Let's load a second table into our session.
In [216]:
surveys = pd.read_csv('ecology-surveys.csv')
species = pd.read_csv('ecology-species.csv')
We can use the concat
function in Pandas to append either columns or rows from one DataFrame to another. Let's create subsets of our data to see how this works...
In [220]:
# First 10 rows of surveys
surveys_first10 = surveys.head(10)
# Last 10 rows of surveys
surveys_last10 = surveys.tail(10)
surveys_last10
Out[220]:
In [221]:
# Reset the index so that we can concatenate properly
surveys_last10 = surveys_last10.reset_index(drop = True)
surveys_last10
Out[221]:
When we concatenate DataFrames, we need to specify the axis. With axis=0
, we are stacking DataFrames on top of one another (row-wise). Pandas will automatically detect that the column names are the same and stack them accordingly.
In [228]:
vertical_stack = pd.concat([surveys_first10, surveys_last10], axis = 0)
vertical_stack
Out[228]:
In [227]:
# In order to avoid duplication of index values
vertical_stack.reset_index()
Out[227]:
We can use the to_csv()
method to export a DataFrame to a CSV file.
In [229]:
vertical_stack.to_csv('out.csv')
Note: Use this webpage as a graphical reference for this segment.
When we concatenated our DataFrames, we simply added them together, stacking them either vertically or horizontally. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common identifier). Combining DataFrames in this way is referred to as a join.
Joining data frames in this way is especially useful when one DataFrame serves as a lookup table, containing additional data that we want to include in the other. For example, the species
table we imported is a lookup table. It contains additional information about each species_id
.
To better understand how this works, let's create a subset of the surveys table.
In [242]:
surveys_sub = surveys.tail(20)
When we join two tables together, they must share a common key. Usually, it is a field that uniquely identifies the records in one of the two tables. In this example, it is the species_id
field that the tables share.
In [243]:
surveys_sub.columns
Out[243]:
In [232]:
species.columns
Out[232]:
The most common type of join is called an inner join. An inner join combines two DataFrames based on a shared key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.
The Pandas function for performing joins is called merge()
; by default, it performs an inner join.
In [244]:
merged = pd.merge(left = surveys_sub, right = species, left_on = 'species_id', right_on = 'species_id')
merged.shape
Out[244]:
In [245]:
merged
Out[245]:
The result of an inner join of surveys_sub
and species
is a new DataFrame that contains the combined set of columns from surveys_sub
and species
.
Note that the result of this merge is only 19 rows, whereas surveys_sub
started with 20 rows.
If there are rows in surveys_sub
with a species_id
that is not found in the species
lookup table, then they are not included in the merged output. What if we want to add information from species
to surveys_sub
without losing any of the information from surveys
? A left join returns all of the rows from the left DataFrame while joining only whatever rows in the right data frame match.
In [246]:
# Because there is only one shared column, we can neglect giving any column mames
merged_left = pd.merge(left = surveys_sub, right = species, how = 'left')
merged_left.shape
Out[246]:
In [248]:
merged_left
Out[248]:
Create a new data frame by joining the contents of surveys
and species
. Then, calculate and plot the distribution of taxa
by plot_id
.
So far, we've used Python and the Pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The advantage of using a programming environment like Python, though, comes from its ability to automate data processing through the use of loops and functions.
Let's remind ourselves of how loops work.
In [34]:
animals = ['LIONS', 'TIGERS', 'BEARS']
for creature in animals:
print(creature.lower())
A powerful Python idiom that we can use here is the list comprehension:
In [35]:
[creature.lower() for creature in animals]
Out[35]:
Similarly, a dictionary comprehension is another quick way to transform data in Python.
In [36]:
dict((creature, creature.lower()) for creature in animals)
Out[36]:
As one example of how we can automate data processing with for loops, let's imagine that we want to split our large surveys
table into separate files, one for each survey year. We'll start by making a new directory to hold the output files for each year.
We can use Python's os
library to manipulate files and folders on our computer's file system.
In [252]:
import os
os.mkdir('yearly_files')
The command os.mkdir
is equivalent to the mkdir
command in the Unix shell, which you may have seen if you use a GNU/Linux or Mac computer. To confirm that we just created the yearly_files
folder, we can ask Python to list the contents of our current directory.
In [253]:
os.listdir('.')
Out[253]:
In previous lessons, we saw how to use the Pandas library to load a table into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame to a CSV file.
Let's now write a script that performs those three steps for just the year 2002.
In [254]:
import pandas as pd
# Load the data into a DataFrame
surveys = pd.read_csv('ecology-surveys.csv')
# Select only data for 2002
surveys2002 = surveys[surveys.year == 2002]
# Write the new DataFrame to a CSV file
surveys2002.to_csv('yearly_files/surveys_2002.csv')
One way we could create a file for every year would be to change this Python script for each year. This would be tedious, however, and tedium is what we're trying to avoid by using Python! Moreover, if we are changing our code frequently, we're likely to introduce an error at some point.
What we really want to do is to rewrite this script so that it is more general and can create all the files we need in one run.
Let's start by writing a loop that simply prints the names of the files we want to create; files denoted by each year.
In [255]:
# We can get the years in our data...
surveys.year
Out[255]:
But we want only the unique years... How do we do that again?
In [257]:
surveys.year.unique()
Out[257]:
Putting this into a loop, we can generate filenames for each year.
In [259]:
for year in surveys.year.unique():
filename = 'yearly_files/surveys_%s.csv' % str(year)
print(filename)
A few things to note about this code:
year
column of our DataFrame. Because year
is an integer, we need to convert it to a string before we can make it part of a filename. This is what the built-in str()
function does.%s
, to denote where in the text string we want to embed some other text. The %
operator is then followed by the string we want to embed.Now, we have everything we need to create these separate files by year.
In [261]:
surveys = pd.read_csv('ecology-surveys.csv')
for year in surveys.year.unique():
# Select data for the given year
surveys_in_year = surveys[surveys.year == year]
# Write the subdataset to a new file
filename = 'yearly_files/surveys_%s.csv' % str(year)
surveys_in_year.to_csv(filename)
os.listdir('yearly_files')
Out[261]:
NaN
values in one or more columns. Modify our for
loop so that the entries with null values are not included in the yearly files. for
loop to use with, e.g., range(1970, 1980)
.Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for
loop like the one above every time we needed to do it but that would be time-consuming and error-prone. A more elegant solution would be to create a resuable tool that performs this task with minimum input from the user.
To achieve this, we'll encapsulate the code we've written so far into a function. As we've seen, functions are reusable, self-contained pieces of code that are called with a single command. The first part of writing a function is envisioning how it will be used; how it will be called by the user.
In [262]:
def one_year_to_csv(all_data, the_year):
'''
Writes a CSV files for data from a given year.
Arguments:
all_data The DataFrame will multi-year data
the_year The year to write, an integer
'''
pass
Some things to note about this function:
the_year
and all_data
.pass
here, which is a special Python command that says, "do nothing."
In [263]:
?one_year_to_csv
Now, let's populate the body of this function with the code we wrote before, making the necessary changes.
In [279]:
def one_year_to_csv(all_data, the_year):
'''
Writes a CSV files for data from a given year.
Arguments:
all_data The DataFrame will multi-year data
the_year The year to write, an integer
'''
# Select data for the given year
data_in_year = all_data[all_data.year == the_year]
# Write the subdataset to a new file
filename = 'yearly_files/surveys_%s.csv' % str(the_year)
data_in_year.to_csv(filename)
In [280]:
one_year_to_csv(surveys, 2002)
os.listdir('yearly_files')
Out[280]:
This is a great start, but what we really want to do is create files for multiple years without having to request each year one at a time. Let's write another function that replicates the for
loop we wrote earlier.
In [286]:
def multiple_years_to_csv(all_data, start_year, end_year):
'''
Writes separate CSV files for each year of data.
Arguments:
all_data The DataFrame with multi-year data
start_year The first year of data we want
end_year The last year of data we want
'''
# Because range() generates a sequence up to but *not* including
# the last number, we add 1
for year in range(start_year, end_year + 1):
one_year_to_csv(all_data, year)
In [287]:
multiple_years_to_csv(surveys, 1977, 2002)
os.listdir('yearly_files')
Out[287]:
We can add optional arguments to help make this function even easier to use.
In [288]:
def multiple_years_to_csv(all_data, start_year=1997, end_year=2002):
'''
Writes separate CSV files for each year of data.
Arguments:
all_data The DataFrame with multi-year data
start_year The first year of data we want
end_year The last year of data we want
'''
# Because range() generates a sequence up to but *not* including
# the last number, we add 1
for year in range(start_year, end_year + 1):
one_year_to_csv(all_data, year)
But what if the years 1997 and 2002 are not in the data?
In [289]:
def multiple_years_to_csv(all_data, start_year=None, end_year=None):
'''
Writes separate CSV files for each year of data.
Arguments:
all_data The DataFrame with multi-year data
start_year The first year of data we want
end_year The last year of data we want
'''
if start_year is None:
start_year = min(all_data.year)
if end_year is None:
end_year = max(all_data.year)
# Because range() generates a sequence up to bubt *not* including
# the last number, we add 1
for year in range(start_year, end_year + 1):
one_year_to_csv(all_data, year)
In [291]:
multiple_years_to_csv(surveys)
os.listdir('yearly_files/')
Out[291]:
In the last example, we saw how we can build functions that call other functions we've built in the past. In this way, we can develop sophisticated data processing workflows piece-by-piece, based on code we already know works.
None
? (Hint: Create a variable set to None
and use the function type()
).multiple_years_to_csv()
with all_data
and an end_year
(that is, without providing a start_year
)? Can you write the function call with only a value for end_year
?There are many questions we ask about tabular data that we can answer with Pandas.
How many animals of each species were caught in each plot? We can create a cross-tabulation of the species_id
and plot_id
fields as follows.
In [83]:
pd.crosstab(surveys.species_id, surveys.plot_id)
Out[83]:
We might ask what the heaviest animals in our dataset are.
In [39]:
surveys.sort_values(by = 'weight', ascending = False).head(10)
Out[39]:
We might also ask for the last observation in our data. We could assume that the data in the table are in chronological order and simply look at the last row; however, there are many reasons why this may not be the case. We want to sort the data with date descending. We don't have a single date field but we can sort by multiple fields to achieve the same effect.
In [40]:
surveys.sort_values(by = ['year', 'month', 'day'], ascending = False).head()
Out[40]: