Overview

This lesson introduces Python as an environment for data analysis and visualization. The materials are based on the Data Carpentry Python for Ecologists lesson. However, the lesson focuses on general analysis and visualization of tabular data and is not specific to ecologists or ecological data. As Data Carpentry explains:

Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.

At the end of this lesson, you will be able to:

Read and write basic Python code;
Import and export tabular data with Python;
Subset and filter tabular data;
Understand different data types and data formats;
Understand pandas Data Frames and how they help organize tabular data;
Devise and intepret data processing workflows;
Automate your data cleaning and analysis with Python;
Visualize your data using matplotlib and pandas;
Connect to a SQLite database using Python.

This lesson will not prepare you to use Python as a general purpose programming language; there are some parts of the language we won't have time to cover. However, at the end of this lesson, you will have a good grasp of Python syntax and be well-prepared to learn the rest of the language, if you desire to do so. Even without seeing all of the Python programming language, you will be prepared to analyze and visualize data in Python using pandas and matplotlib.

License

As with the Data Carpentry ecology lesson, this lesson is licensed for open use under the CC BY 4.0 license.

Introduction to the Python Programming Language

Python is a general purpose programming language that allows for the rapid development of scientific workflows. Python's main advantages are:

It is open-source software, supported by the Python Software Foundation;
It is available on all platforms, including Windows, Mac OS X, and GNU/Linux;
It can be used to program any kind of task (it is a general purpose language);
It supports multiple programming paradigms (a fancy term computer scientists use to describe the different ways people like to design software);
Most importantly, it has a large and diverse community of users who share Python code they've already written to do a wide variety of things.

The Python Interpreter

The only language that computers really understand is machine language, or binary: ones and zeros. Anything we tell computers to do has to be translated to binary for computers to execute.

Python is what we call an interpreted language. This means that computers can translate Python to machine code as they are reading it. This distinguishes Python from languages like C, C++, or Java, which have to be compiled to machine code before they are run. The details aren't important to us; what is important is that we can use Python in two ways:

We can use the Python interpreter in interactive mode;
Or, we can use execute Python code that is stored in a text file, called a script.

Jupyter Notebook

For this lesson, we'll be using the Python interpreter that is embedded in Jupyter Notebook. Jupyter Notebook is a fancy, browser-based environment for literate programming, the combination of Python scripts with rich text for telling a story about the task you set out to do with Python. This is a powerful way for collecting the code, the analysis, the context, and the results in a single place.

The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:

In GNU/Linux or Mac OS X, launch the Terminal and type: jupyter notebook; then press ENTER.
In Windows, launch the Command Prompt and type jupyter notebook; then press ENTER.

Let's try out the Python interpreter.



In [1]:

    
print('Hello, world!')









    



Hello, world!

Alternatively, we could save that one line of Python code to a text file with a *.py file extension and then execute that file.

Python Data Types

Strings, Integers, and Floats



In [4]:

    
text = 'Data Carpentry' # A character string
number = 42 # An integer number
pi = 3.14159265 # A floating-point number or "float"

Here, we've assigned data to variables using the assignment operator or equal sign. The process of assignment takes a value and stores it under a name that we make up. This way, we can use that stored value again by calling its name.



In [5]:

    
number









    Out[5]:





42

Note that to recover a variable's stored value, we simply type the name of the variable and hit Enter. (This only works in interactive mode; if we wrote a script and want it to print out a value, we have to use the print() function.)

Variable names can only include letters, the underscore, and numbers. However, variable names cannot start with numbers.



In [7]:

    
my_variable = 'some text'

Operators

We can perform mathematical calculations in Python using the basic operators +, -, /, *, and %.



In [8]:

    
2 + 2









    Out[8]:





4



In [9]:

    
6 * 7









    Out[9]:





42



In [12]:

    
5 ** 2









    Out[12]:





25

This is "13 modulo 5" and the result is the remainder, 3.



In [22]:

    
13 % 5









    Out[22]:





3

We can also use comparison and logical operators. These operators return Boolean values; that is, they determine or describe whether something is True or False.



In [23]:

    
3 > 4









    Out[23]:





False



In [26]:

    
5 == 5









    Out[26]:





True

i.e., is it True and True?



In [24]:

    
True and True









    Out[24]:





True

Chooses the "truth-y" value between the two:



In [25]:

    
True or False









    Out[25]:





True

True and False, with the first letter capitalized, are special values in Python that mean just what they say.

Sequences

Much of Python's expressive power and flexibility comes from the way it handles sequences. A sequence could be a sequence of characters in a text string or a sequence of numbers.

A list is Python's built-in data structure for handling general, ordered sequences. Each element can be accessed by its index. Note that, in Python, we start counting from zero, not from one.



In [27]:

    
numbers = [1, 2, 3]
numbers[0]









    Out[27]:





1

The square brackets are used to slice a sequence by one or more indices. Above, we have asked for the first (the zeroth) element of the numbers sequence.

A for loop is a useful way of accessing the elements of a sequence one at a time:



In [28]:

    
for number in numbers:
    print(number)

Indentation is very important in Python. Note that the second line in the above example is indented. This is Python's way of marking a block of code. It's standard to indent by 4 spaces.

To add elements to the end of a list, we can use the append() method:`



In [29]:

    
numbers.append(4)
print(numbers)









    



[1, 2, 3, 4]

Note that there is no output associated with the append() method; the numbers sequence is modified in place so we don't need to assign the result to a variable.

Methods are a way to interact with an object in Python. We can invoke a method using the dot, followed by the method name and a list of arguments in parentheses. To find out what methods are available for an object, we can use the built-in help() function.



In [30]:

    
help(numbers)









    



Help on list object:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __mul__(self, value, /)
 |      Return self*value.n
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __reversed__(...)
 |      L.__reversed__() -- return a reverse iterator over the list
 |  
 |  __rmul__(self, value, /)
 |      Return self*value.
 |  
 |  __setitem__(self, key, value, /)
 |      Set self[key] to value.
 |  
 |  __sizeof__(...)
 |      L.__sizeof__() -- size of L in memory, in bytes
 |  
 |  append(...)
 |      L.append(object) -> None -- append object to end
 |  
 |  clear(...)
 |      L.clear() -> None -- remove all items from L
 |  
 |  copy(...)
 |      L.copy() -> list -- a shallow copy of L
 |  
 |  count(...)
 |      L.count(value) -> integer -- return number of occurrences of value
 |  
 |  extend(...)
 |      L.extend(iterable) -> None -- extend list by appending elements from the iterable
 |  
 |  index(...)
 |      L.index(value, [start, [stop]]) -> integer -- return first index of value.
 |      Raises ValueError if the value is not present.
 |  
 |  insert(...)
 |      L.insert(index, object) -- insert object before index
 |  
 |  pop(...)
 |      L.pop([index]) -> item -- remove and return item at index (default last).
 |      Raises IndexError if list is empty or index is out of range.
 |  
 |  remove(...)
 |      L.remove(value) -> None -- remove first occurrence of value.
 |      Raises ValueError if the value is not present.
 |  
 |  reverse(...)
 |      L.reverse() -- reverse *IN PLACE*
 |  
 |  sort(...)
 |      L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE*
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None

We can also access a list of methods using dir.



In [31]:

    
dir(numbers)









    Out[31]:





['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

Tuples

A tuple is similar to a list in that it's an ordered sequence of elements. However, tuples can not be changed once created; they are "immutable." Tuples are created by placing comma-separated values inside parentheses.



In [33]:

    
a_tuple = (1,2,3)
another_tuple = ('rabbit', 'mongoose', 'platypus')
still_a_tuple = (1,)

# Note that lists use square brackets
a_list = [1,2,3]

Challenge: Tuples and Lists



In [1]:

    
a_tuple = ('a', 'b', 'c', 'd')
a_list = ['a', 'b', 'c', 'd']

What happens when you try to re-assign the first value in each of the following examples?
```
a_tuple[0] = 5
a_list[0] = 5
```
Type type(a_tuple) into Python; what is the object's type?

Dictionaries

A dictionary is a container that holds key-value pairs. It is a data structure that is also referred to, in other programming languages, as an associative array.



In [6]:

    
mapping = {'tbsp': 'tablespoon', 'tsp': 'teaspoon'}
mapping['tbsp']









    Out[6]:





'tablespoon'

Dictionaries are similar to lists in that we can store multiple things inside them. While we index lists with numeric indices, we index dictionaries with keys. A key is a unique identifier for a value in the dictionary.

Keys must be a particular data type; they must be hashable, therefore, only strings and numeric types are acceptable.



In [7]:

    
code_book = {1: 'Valid', 2: 'Invalid'}
code_book[1]









    Out[7]:





'Valid'



In [9]:

    
{[1,2]: 'Valid/Invalid'}









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-428f6b27df1e> in <module>()
----> 1 {[1,2]: 'Valid/Invalid'}

TypeError: unhashable type: 'list'

Another way we can create dictionaries is using the dict() function. Here, we give the dict() function a list of key-value pairs, represented as tuples. The first element in the tuple is the key, the second element is the value.



In [37]:

    
dict([(1, 'Valid'), (2, 'Invalid')])









    Out[37]:





{1: 'Valid', 2: 'Invalid'}

Using for loops with dictionaries is a little more complicated.



In [10]:

    
for key, value in code_book.items():
    print(key, '->', value)









    



1 -> Valid
2 -> Invalid

Alternatively:



In [11]:

    
for key in code_book.keys():
    print(key, '->', code_book[key])









    



1 -> Valid
2 -> Invalid

Challenge: Reassignment in a Dictionary

How can we change a value in our dictionary? Try to reassign one of the values in the code_book dictionary.

You may have noticed that a dictionary sometimes returns its items in a different order than you expected. Dictionaries are intrinsically unordered; they do not retain the order of their items.

Functions

One of the chief reasons we program computers to do things for us is because computers are very good at tedious tasks (and humans are not). A block of Python code that does the same thing every time is best defined as function in Python. A function is a series of fixed Python statements, with or without input arguments, that are assigned a name so that we can easily call them over and over again.

We've already seen the built-in print() and help() functions. Now let's see how to write our own.



In [15]:

    
def pow10(exponent):
    result = 10 ** exponent
    return result

pow10(2)









    Out[15]:





100

Key things to note about this example:

We define a function using the def command followed by the name of the function and any arguments its takes, written just like we would call the function (recall the print() and help() functions);
The body of the function is indented;
We use the return command to indicate what the result of the function , or its return value, should be. If we don't return anything, the function's body is still executed, but we don't necessarily see any output when the function is called.

How can we make a more general version of the function pow10()?



In [4]:

    
def pow(exponent, base=10):
    result = base ** exponent
    return result

pow(2, 10)









    Out[4]:





100

Here, the function pow() can be used to calculate powers of any base, not just base 10. By default, it will calculate powers of 10, because the argument base has a default argument of 10.



In [5]:

    
pow(2, base = 10)









    Out[5]:





100



In [6]:

    
pow(base = 10, exponent = 2)









    Out[6]:





100

pow() is actually a function already built into Python, so we didn't need to write this function, but it demonstrates how you can create more flexible functions using default arguments.

Challenge: Writing Your First Function

To convert from temperatures in Fahrenheight to Celsius, we first subtract $32$ and then multiple by $5/9$. Write a function that converts temperatures from Fahrenheit to Celsius.

Lambda Functions

Another type of function in Python is the lambda function. This is a special, one-line function that is generally used for transforming inputs.



In [66]:

    
to_lowercase = lambda s: s.lower()
to_lowercase('Roger')









    Out[66]:





'roger'



In [82]:

    
names = ['Roger', 'Helen', 'Abdul', 'Lakshmi']
list(map(to_lowercase, names))









    Out[82]:





['roger', 'helen', 'abdul', 'lakshmi']

Lambda functions are important for reasons we'll see later.

Data Analysis Workflows

In general, when we have a data analysis task, there are specific steps that we undertake, in order:

Bringing data together in a common format that is both human-readable and machine-readable (e.g., through spreadsheets, before exporting as a CSV).
Cleaning data, e.g, merging text labels that are slightly different (e.g., "Tree frogs," "tree frogs," and "treefrogs"), splitting and combining fields as needed (e.g., in OpenRefine or in Python).
Archiving data, keeping the raw data separate from any cleaned data, for further use and analysis (e.g., using a relational database and SQL).
Deriving a subset of cleaned data for use in a specific analysis, such as a certain date range, certain field site, or certain collection (e.g., in Python or SQL).
Performing analysis to answer a question or make a decision, such as grouping and summarizing data, calculating a statistic, or generating a plot (e.g., in Python or SQL).

We'll now see how Python plays a role in this workflow, starting from having "clean" or nearly clean data to begin analyzing.

Managing Data in Python

We've seen how we can use spreadsheets to effectively organize data. You probably know how to do some basic analysis on tabular data using spreadsheet software programs like Microsoft Excel. Why, then, should we learn to use Python for data analysis?

Because computers are good at automating tedious tasks (and humans are not), we can use Python to instruct a computer to repeat a task over and over again. This saves us frustration and ensures that the task is performed consistently, without human error.
More specifically, once we have written the code to do some task with data the first time, we can apply it over and over again to subsequent datasets.
Ultimately, we can save the code and share it with other people if they want to perform the same analysis or understand what we did. The code is then a record of what we did to get our results. This is a key component of reproducibility in computer-aided science.

For this lesson to run smoothly, let's make sure everyone is in the same directory. In the Jupyter Notebook file tree, navigate to your Desktop. We'll create a new Notebook here to use for the rest of the lesson.

About the Data

For this lesson, we will be using the Portal Project Teaching Database, a subset of data from Ernst et al.'s (2009) long-term monitoring and experimental manipulation of a Chihuahuan desert ecosystem near Portal, Arizona, U.S.A.. We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma-separated variable (CSV) file: each row holds information for a single animal, and the columns reprsent:

Column	Description
record_id	Unique ID for the observation
month	Month of observation
day	Day of observation
year	Year of observation
plot_id	ID of a particular plot
species_id	2-letter code identifying the species
sex	Sex of animal ("M","F")
hindfoot_length	Length of the hindfoot in millimeters
weight	Weight of the animal in grams

About Libraries

A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used to perform many tasks. One of the libraries we'll be using in this lesson is the Python Data Analysis Library, or pandas. pandas adds a number of things to base Python including more sophisticated data structures and tools to connect to other libraries like matplotlib for producing data visualizations.

Python doesn't initially load all of the libraries that we have installed; we have to tell Python to import the librar(ies) that we want to use in any particular session. To import a library, we use the syntax: import libraryName. Because all of the tools in that library will be known by the name we import, we have the option to give it a shorter name. We'll do that here with Pandas.



In [2]:

    
import pandas as pd

Each time we call a function that's in the Pandas library, we use the syntax libraryName.functionName. This ensures that Python can find the tool or function we're asking for. Most Pandas users abbreviate the library name to pd and because this is such a popular convention, you can often find help on the internet by searching for a particular function in the pd namespace. For instance, I'll look up support for the pd.read_csv() function on the Google search engine.

Reading CSV Data in Pandas

Tabular data, like those stored in CSV, tab-delimited, or fixed-width formats, can be read into Python using Pandas and stored in a Pandas DataFrame. If you've used the R programming language before, you'll be familiar with DataFrames.

Data Frames

A DataFrame is a 2-dimensional, tabular data structure that can store different types of data across multiple columns. It is structurally identical to a spreadsheet in this way or, as we'll see, a table in a SQL database. We can read in our survey data as a DataFrame using the pd.read_csv() function.



In [ ]:

    
pd.read_csv('ecology-surveys.csv')

When we run this command, Pandas finds the CSV file, opens it, and reads in the contents, line-by-line, as rows in a new DataFrame, which it then prints out. We can see that:

There were 33,549 rows parsed.
Each has 9 columns.

The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame.

We can see from this output that Pandas read the file properly, but it hasn't saved the DataFrame to memory. We need to assign the DataFrame to a variable.



In [115]:

    
surveys = pd.read_csv('ecology-surveys.csv')

Note that when we do variable assignment, Python does not display anything on the screen. The output of the pd.read_csv() function, our table, instead of being displayed on the screen is not "captured" in the variable named surveys.



In [5]:

    
type(surveys)









    Out[5]:





pandas.core.frame.DataFrame

Manipulating Data in Python

One of the first things we might do after importing any kind of data in any environment is to make sure that are our data--the individual fields or columns--are represented in the right way; by the right data types.



In [6]:

    
surveys.dtypes









    Out[6]:





record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

Here, int64 refers to integer type; it cannot store decimal numbers. float64 stores decimal numbers with 64-bit precision. object refers to character strings, or text.

Challenge: Viewing DataFrames in Python

Try executing each code sample below and see what is returned.

surveys.columns
surveys.head()
surveys.head(15)
surveys.tail()
surveys.shape

Take note of the output of surveys.shape; what format does it return?

Finally, what is the difference between the code samples that end in parentheses and those that do not?

Each of the code samples above has us calling some attribute or method on the surveys DataFrame.

Methods are functions that belong to an object in Python, like a DataFrame. Just like the functions we saw earlier, functions take zero or more arguments that go inside the parentheses. Even if we have no arguments to provide, we still have to use the parentheses to get the function to do its work. In general, a method is a function that belongs to an object.
Attributes are a more general concept; an attribute is anything that belongs to an object in Python, including methods. Attributes that are not methods, however, don't need to be called with parentheses.

If we think of a person, an attribute is something that belongs to that person or describes that person, like hair color or number of siblings. A method is something that person does, like bake a pie or go for a run.

Calculating Statistics in a Pandas DataFrame

Let's perform some summary statistics to further verify that the data we imported look okay. First, let's remind ourselves what is in our data and what values we might use to group the data by in calculating summary statistics.



In [23]:

    
# Look at the column names
surveys.columns.values









    Out[23]:





array(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'], dtype=object)

Let's get a list of all the species. The pd.unique() function tells us all of the unique values in the species_id column.



In [11]:

    
pd.unique(surveys['species_id'])









    Out[11]:





array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

Note the bracket notation we've used here. This is how we access a column by name in a Pandas DataFrame. It's just like how we accessed the value in a Dictionary by its key name.

Challenge: Unique Levels for a Column

Create a list of unique plot IDs found in the survey data; assign the list of unique IDs to a variable called plot_names. How many unique plots are there in the data? How many unique species are in the data?
What is the difference between len(plot_names) and plot_names.shape?

Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals in each plot.

We can calculate basic statistics for all records in a single column using the syntax below.



In [24]:

    
surveys['weight'].describe()









    Out[24]:





count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

Let's break that last command down.

First, we asked for the weight column in the surveys table.
Second, we called the describe() method on that column.

We can also extract a specific statistic.



In [26]:

    
surveys['weight'].mean()









    Out[26]:





42.672428212991356



In [27]:

    
surveys['weight'].count()









    Out[27]:





32283

If we want to summarize by one or more variables, for example, sex, we can use Pandas' groupby() method. Once we've created a grouped DataFrame, we can quickly calculated summary statistics by a group of our choice.



In [29]:

    
# Group data by sex
grouped_by_sex = surveys.groupby('sex')

# Summary statistics for all numeric columns, grouped by sex
grouped_by_sex.describe()









    Out[29]:






  
    
      
      
      day
      hindfoot_length
      month
      plot_id
      record_id
      weight
      year
    
    
      sex
      
      
      
      
      
      
      
      
    
  
  
    
      F
      count
      15690.000000
      14894.000000
      15690.000000
      15690.000000
      15690.000000
      15303.000000
      15690.000000
    
    
      mean
      16.007138
      28.836780
      6.583047
      11.440854
      18036.412046
      42.170555
      1990.644997
    
    
      std
      8.271144
      9.463789
      3.367350
      6.870684
      10423.089000
      36.847958
      7.598725
    
    
      min
      1.000000
      7.000000
      1.000000
      1.000000
      3.000000
      4.000000
      1977.000000
    
    
      25%
      9.000000
      21.000000
      4.000000
      5.000000
      8917.500000
      20.000000
      1984.000000
    
    
      50%
      16.000000
      27.000000
      7.000000
      12.000000
      18075.500000
      34.000000
      1990.000000
    
    
      75%
      23.000000
      36.000000
      10.000000
      17.000000
      27250.000000
      46.000000
      1997.000000
    
    
      max
      31.000000
      64.000000
      12.000000
      24.000000
      35547.000000
      274.000000
      2002.000000
    
    
      M
      count
      17348.000000
      16476.000000
      17348.000000
      17348.000000
      17348.000000
      16879.000000
      17348.000000
    
    
      mean
      16.184286
      29.709578
      6.392668
      11.098282
      17754.835601
      42.995379
      1990.480401
    
    
      std
      8.199274
      9.629246
      3.420806
      6.728713
      10132.203323
      36.184981
      7.403655
    
    
      min
      1.000000
      2.000000
      1.000000
      1.000000
      1.000000
      4.000000
      1977.000000
    
    
      25%
      9.000000
      21.000000
      3.000000
      5.000000
      8969.750000
      20.000000
      1984.000000
    
    
      50%
      16.000000
      34.000000
      6.000000
      11.000000
      17727.500000
      39.000000
      1990.000000
    
    
      75%
      23.000000
      36.000000
      9.000000
      17.000000
      26454.250000
      49.000000
      1997.000000
    
    
      max
      31.000000
      58.000000
      12.000000
      24.000000
      35548.000000
      280.000000
      2002.000000



In [30]:

    
grouped_by_sex.mean()









    Out[30]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      hindfoot_length
      weight
    
    
      sex
      
      
      
      
      
      
      
    
  
  
    
      F
      18036.412046
      6.583047
      16.007138
      1990.644997
      11.440854
      28.836780
      42.170555
    
    
      M
      17754.835601
      6.392668
      16.184286
      1990.480401
      11.098282
      29.709578
      42.995379

Quickly Creating Summary Counts in Pandas

Let's count the number of samples for each species. We can do this in a few ways, but we'll use groupby() combined with a count() method.



In [42]:

    
species_counts = surveys.groupby('species_id')['record_id'].count()
species_counts









    Out[42]:





species_id
AB      303
AH      437
AS        2
BA       46
CB       50
CM       13
CQ       16
CS        1
CT        1
CU        1
CV        1
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PG        8
PH       32
PI        9
PL       36
PM      899
PP     3123
PU        5
PX        6
RF       75
RM     2609
RO        8
RX        2
SA       75
SC        1
SF       43
SH      147
SO       43
SS      248
ST        1
SU        5
UL        4
UP        8
UR       10
US        4
ZL        2
Name: record_id, dtype: int64

Again, let's break this down:

We group the rows of the surveys DataFrame by the unique values in one of its columns: species_id.
Second, we ask for the column record_id in the output from the last part.
Third, we call the count() method on this column. Because the data are grouped by species_id, we get a count within each unique value of species_id, not a count of all the rows in the table.

Challenge: Understanding Grouped DataFrames

In that last command, we asked for the record_id column. Try asking for a different column in the square brackets. Do you get a different result? Why or why not?
How can we get a count of just the records with species_id set to DO? Hint: You can build on the last command we executed; think about Dictionaries and key-value pairs.

Basic Math on DataFrame Columns

We can perform quick mathematical transformations on the values of a column in a straightforward way. For instance, we might normalize the measured hindfoot lengths by their mean.



In [ ]:

    
surveys['hindfoot_length'] / surveys['hindfoot_length'].mean()

Basic Plotting with Pandas

We can also visualize summary statistics using Pandas. First, we'll fire off an iPython "magic" function that will allow us to view plots inline in the Jupyter Notebook.



In [40]:

    
%matplotlib inline



In [52]:

    
# Create a quick bar chart
species_counts.plot(kind = 'bar')









    Out[52]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f1101662fd0>

This plot is kind of cramped and hard to read; we can make it slightly bigger by specifying a new weight and height (in inches).



In [53]:

    
species_counts.plot(figsize=(9, 6), kind = 'bar')









    Out[53]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f11013b27f0>

Challenge: Plotting

Create a plot of the average weight in each species.
Create a plot of total males and total females across the entire dataset.

Note: Some of the species have no weight measurements; they are entered as NaN, which stands for "not a number" and refers to missing values.

Multiple Grouping in Pandas

Let's make a stacked bar plot, with the total number of individuals on the Y axis and the stacked variable being sex. The plot should show total count by sex for each survey plot.

First, we'll calculate the total number of individual in each unique pair of plot_id and sex.



In [81]:

    
plot_sex_count = surveys.groupby(('plot_id', 'sex'))['record_id'].count()
plot_sex_count









    Out[81]:





plot_id  sex
1        F       848
         M      1095
2        F       970
         M      1144
3        F       893
         M       840
4        F       872
         M      1030
5        F       527
         M       586
6        F       737
         M       764
7        F       335
         M       307
8        F       840
         M       983
9        F       853
         M      1004
10       F       141
         M       142
11       F       822
         M      1029
12       F      1060
         M      1200
13       F       628
         M       783
14       F       709
         M      1070
15       F       481
         M       409
16       F       222
         M       273
17       F       892
         M      1053
18       F       755
         M       626
19       F       522
         M       581
20       F       650
         M       613
21       F       611
         M       451
22       F       671
         M       673
23       F       165
         M       207
24       F       486
         M       485
Name: record_id, dtype: int64

Next, we use the unstack() method on our grouped data to figure out the total count that each sex contributes to each plot. Here, unstack() simply rearranges the rows and columns into a more convenient format.



In [85]:

    
my_plot = plot_sex_count.unstack().plot(kind = 'bar', stacked = True)
my_plot.set_title('Total weight by plot and sex')
my_plot.set_ylabel('Weight (grams)')
my_plot.set_xlabel('Plot ID')









    Out[85]:





<matplotlib.text.Text at 0x7f1100bb5d68>

Transforming Data

Up to this point, we have learned:

The basic data types in Python;
How to read a CSV file into Python using pandas;
How tabular data in pandas are represented using a DataFrame;
How to check the data type of each column in a DataFrame;
How to summarize numerical data in a DataFrame;
How to group data according to the unique values of a column;
How to create basic plots of grouped, summarized data.

Often, once we've pulled in data and done some basic data cleaning, we want to transform our data to make it more useful for our analysis. For instance, we might find that:

The numeric values in one or more columns are not in the right units, e.g., we want pounds instead of kilograms.
A text column has keys written one way and we want them written another way.
We want to combine two or more columns; or, we want to split one column into multiple columns.

Some of these objectives are perhaps best handled in OpenRefine, as we saw. However, we can also transform our data easily enough with Python and Pandas.

Converting Units

In our surveys data, the animal weights are in grams. What if our collaborator says she needs the weights in kilograms?



In [45]:

    
# Create a copy of our original data
surveys_converted = surveys.copy()

# Create a new column in this table
surveys_converted['weight_kg'] = surveys_converted.weight / 1000
surveys_converted.tail()









    Out[45]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
      species_factor
      sex_factor
      species_id_codes
      weight_kg
    
    
      X
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      35544
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      1
      -1
      1
      NaN
    
    
      35545
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      1
      -1
      1
      NaN
    
    
      35546
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
      32
      0
      32
      0.014
    
    
      35547
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
      12
      1
      12
      0.051
    
    
      35548
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN
      -1
      -1
      -1
      NaN

Note that when we call for a column as an attribute, it is read-only.



In [48]:

    
# Convert grams to pounds
surveys_converted.weight_lbs = surveys_converted.weight * 0.00220462
surveys_converted.tail()









    Out[48]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
      species_factor
      sex_factor
      species_id_codes
      weight_kg
    
    
      X
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      35544
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      1
      -1
      1
      NaN
    
    
      35545
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      1
      -1
      1
      NaN
    
    
      35546
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
      32
      0
      32
      0.014
    
    
      35547
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
      12
      1
      12
      0.051
    
    
      35548
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN
      -1
      -1
      -1
      NaN

Transforming Values

We can affect a more general transformation of a column's values. Let's say that, for whatever reason, we want the text strings in species_id to be lower-case, not upper-case.



In [ ]:

    
surveys.species_id.apply(lambda x: x.lower()).head()

What happened? When Pandas imports a DataFrame and sees categorical data, like species_id here, it often treats the categorical data like numeric data in the background, assigning a numeric key to each unique categorical value. We need to explicitly tell Pandas to treat this value like a string.



In [63]:

    
surveys.species_id.apply(lambda x: str(x).lower()).head()









    Out[63]:





X
0    nl
1    nl
2    dm
3    dm
4    dm
Name: species_id, dtype: object

The use of the apply() function here is part of a general practice called function application. We can also apply a function to our entire DataFrame along one of its two axes. For instance, we can use this to count how many non-null values are in each of the columns.



In [70]:

    
surveys.weight.count()









    Out[70]:





32283



In [73]:

    
surveys.apply(lambda column: column.count(), axis = 0)









    Out[73]:





record_id           35549
month               35549
day                 35549
year                35549
plot_id             35549
species_id          34786
sex                 33038
hindfoot_length     31438
weight              32283
species_factor      35549
sex_factor          35549
species_id_codes    35549
dtype: int64

We can also apply functions across the columns.



In [79]:

    
# Returns True or False depending on whether weight is null
surveys.apply(lambda row: pd.notnull(row.weight), axis = 1).head()









    Out[79]:





X
0    False
1    False
2    False
3    False
4    False
dtype: bool

Indexing and Slicing Python DataFrames

Now, we'll explore how to access different parts of our data, including querying subsets of the data based on certain criteria.

Indexing and Slicing in Python

Selecting Data Using Labels (Column Headings)

We use square brackets, [], to select a subset of a Python object. For example, we can select all of the data from a column named species_id:



In [ ]:

    
surveys['species_id']
surveys.species_id

We can pass a list of column names, too, as an index to select columns in a specified order.



In [ ]:

    
surveys[['species_id', 'plot_id']]
surveys[['plot_id', 'species_id']]

Extracting a Range of Data with Slicing

Recall that in Python, we start counting from zero instead of one. This means that the first element in an object is located at position zero.



In [94]:

    
grades = [88, 72, 93, 94]



In [95]:

    
grades[2]









    Out[95]:





93



In [96]:

    
grades[1:3]









    Out[96]:





[72, 93]

Challenge: Slicing

What do each of these lines of code return?

grades[0]
grades[len(grades)]
grades[4]

Why do (2) and (3) return errors?

Slicing Subsets of Rows in Python

Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, we use the following syntax.



In [97]:

    
# Select the first three rows; rows 0,1,2 (but NOT 3)
surveys[0:3]









    Out[97]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN

Other examples:



In [98]:

    
# First three rows, starting index optional
surveys[:3]









    Out[98]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN



In [100]:

    
# Select the last element (last row)
surveys[-1:]









    Out[100]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      35548
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN

We can also reassign values within subset of our DataFrame. Before we do that, let's make a copy of our DataFrame so as not to modify our original data.



In [119]:

    
surveys_copy = surveys
surveys_copy[0:3] = 0
surveys_copy.head()









    Out[119]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0.0
      0.0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0.0
      0.0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0.0
      0.0
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN



In [120]:

    
# Confirm that the original is unchanged
surveys.head()









    Out[120]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0.0
      0.0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0.0
      0.0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0.0
      0.0
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN

Oops: Referencing versus Copying Objects in Python

We thought that we were creating a copy of the surveys DataFrame. However, when we assign a variable such as y = x, this doesn't create a copy of x; rather, it creates a new variable y that refers to the same object that x refers to. This means there is only one object (the DataFrame), and both x and y refer to it. Thus, when we modify surveys_copy, the object that is modified is the same object surveys points to.



In [122]:

    
surveys = pd.read_csv('ecology-surveys.csv')
surveys_copy = surveys.copy()
surveys_copy[0:3] = 0
surveys.head()









    Out[122]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN

Slicing Subsets of Rows and Columns with Pandas

We can select specific ranges of our data in both the row and column directions using either label-based or integer-based indexing.

To select a subset of rows and columns, we can use the iloc() method (for "index location"). For example, we can select the month, day, and year columns (columns, 2, 3, 4) in the first three rows, like this:



In [124]:

    
surveys.iloc[0:3,1:4]

Note that the order of terms in the square brackets is rows first, then columns.



In [125]:

    
# Get the same columns but all the rows
surveys.iloc[:,1:4].head()

Here, the colon character in place of the row index indicates we want "all" elements (rows).

We can select multiple, discontiguous rows by passing a list of the row indices.



In [129]:

    
# Select all columns for rows of index values 0, 10
surveys.iloc[[0, 10], :]









    Out[129]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      10
      11
      7
      16
      1977
      5
      DS
      F
      53.0
      NaN

If we want to select columns by column names, we need to use the loc() method in place of iloc().



In [132]:

    
surveys.loc[0, ['species_id', 'plot_id', 'weight']]









    Out[132]:





species_id     NL
plot_id         2
weight        NaN
Name: 0, dtype: object

When we use loc(), we can also index using integers, but with one essential difference: loc() indexes on row or column labels, not the position along the table's rows or columns.

For example, compare the outputs:



In [143]:

    
surveys.loc[[0, 10, 50000], :]









    Out[143]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1.0
      7.0
      16.0
      1977.0
      2.0
      NL
      M
      32.0
      NaN
    
    
      10
      11.0
      7.0
      16.0
      1977.0
      5.0
      DS
      F
      53.0
      NaN
    
    
      50000
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN



In [144]:

    
surveys.iloc[[0, 10, 50000], :]









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-144-46301c3f69c4> in <module>()
----> 1 surveys.iloc[[0, 10, 50000], :]

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
   1308 
   1309         if type(key) is tuple:
-> 1310             return self._getitem_tuple(key)
   1311         else:
   1312             return self._getitem_axis(key, axis=0)

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1558     def _getitem_tuple(self, tup):
   1559 
-> 1560         self._has_valid_tuple(tup)
   1561         try:
   1562             return self._getitem_lowerdim(tup)

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    149             if i >= self.obj.ndim:
    150                 raise IndexingError('Too many indexers')
--> 151             if not self._has_valid_type(k, i):
    152                 raise ValueError("Location based indexing can only have [%s] "
    153                                  "types" % self._valid_types)

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1528             return self._is_valid_integer(key, axis)
   1529         elif is_list_like_indexer(key):
-> 1530             return self._is_valid_list_like(key, axis)
   1531         return False
   1532 

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _is_valid_list_like(self, key, axis)
   1552         l = len(ax)
   1553         if len(arr) and (arr.max() >= l or arr.min() < -l):
-> 1554             raise IndexError("positional indexers are out-of-bounds")
   1555 
   1556         return True

IndexError: positional indexers are out-of-bounds

Challenge: Slicing Rows and Columns

What happens when you type:

surveys[0:3]
surveys[:5]
surveys[-1:]

To review...

To index by rows in Pandas:



In [168]:

    
surveys[0:3]
surveys.iloc[0:3]
surveys.iloc[0:3,:]









    Out[168]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN

To index by columns (and rows) in Pandas



In [174]:

    
surveys[['month', 'day', 'year']]
surveys.loc[0:3, ['month', 'day', 'year']]
surveys.iloc[0:3, 1:4]

Subsetting Data Using Criteria

Most importantly, we can also select a subset of our data based on certain criteria. For example, we can select all of the rows of our data that match observations in the year 2002.



In [176]:

    
surveys[surveys.year == 2002].head()









    Out[176]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      33320
      33321
      1
      12
      2002
      1
      DM
      M
      38.0
      44.0
    
    
      33321
      33322
      1
      12
      2002
      1
      DO
      M
      37.0
      58.0
    
    
      33322
      33323
      1
      12
      2002
      1
      PB
      M
      28.0
      45.0
    
    
      33323
      33324
      1
      12
      2002
      1
      AB
      NaN
      NaN
      NaN
    
    
      33324
      33325
      1
      12
      2002
      1
      DO
      M
      35.0
      29.0

Or, we can select all rows that do not match observations from 2002.



In [177]:

    
surveys[surveys.year != 2002].head()









    Out[177]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN

We can combine criteria using logical operators.



In [182]:

    
surveys[(surveys.year >= 1980) & (surveys.year <= 1985)].count()









    Out[182]:





record_id          8957
month              8957
day                8957
year               8957
plot_id            8957
species_id         8775
sex                8366
hindfoot_length    7348
weight             8278
dtype: int64

Challenge: Filtering Data

Filter the surveys table to observations of female members of the DO species. How many are there? What is their average weight?
Look at the help documentation for the isin() function (Hint: ?surveys.year.isin). Use this function to filter the surveys DataFrame to those rows that match the three species: OL, OT, OX.

Using Masks

Earlier, we saw that our data contain some missing values, filled in with NaN or "not a number." We'll next learn how to handle NaNs when they appear in our data using masks. A mask can be useful to locate where a particular subset values exist (or don't exist).

To start, we can use the Pandas function isnull() to find places in our surveys DataFrame where there are null or NaN values.



In [190]:

    
pd.isnull(surveys)









    Out[190]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      1
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      2
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      3
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      4
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      5
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      6
      False
      False
      False
      False
      False
      False
      False
      True
      True
    
    
      7
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      8
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      9
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      10
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      11
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      12
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      13
      False
      False
      False
      False
      False
      False
      True
      True
      True
    
    
      14
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      15
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      16
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      17
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      18
      False
      False
      False
      False
      False
      False
      True
      True
      True
    
    
      19
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      20
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      21
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      22
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      23
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      24
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      25
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      26
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      27
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      28
      False
      False
      False
      False
      False
      False
      False
      True
      True
    
    
      29
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      35519
      False
      False
      False
      False
      False
      False
      True
      False
      False
    
    
      35520
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35521
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35522
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35523
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35524
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35525
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35526
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35527
      False
      False
      False
      False
      False
      False
      True
      True
      True
    
    
      35528
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35529
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      35530
      False
      False
      False
      False
      False
      False
      False
      False
      True
    
    
      35531
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35532
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35533
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35534
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35535
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35536
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35537
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35538
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35539
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35540
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35541
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35542
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35543
      False
      False
      False
      False
      False
      False
      True
      True
      True
    
    
      35544
      False
      False
      False
      False
      False
      False
      True
      True
      True
    
    
      35545
      False
      False
      False
      False
      False
      False
      True
      True
      True
    
    
      35546
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35547
      False
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      35548
      False
      False
      False
      False
      False
      True
      True
      True
      True
    
  

35549 rows × 9 columns

We can use the any() function to ask if there are any null values in a given row. Here, the axis argument specifies the axis along which to look for true values (values that are null); 1 indicates a look along the rows.



In [202]:

    
pd.isnull(surveys).any(axis=1).head()









    Out[202]:





0    True
1    True
2    True
3    True
4    True
dtype: bool



In [201]:

    
pd.isnull(surveys).any(axis=0)









    Out[201]:





record_id          False
month              False
day                False
year               False
plot_id            False
species_id          True
sex                 True
hindfoot_length     True
weight              True
dtype: bool

To select the rows where there are null values, we can use the mask as an index to subset our data, as follows.



In [205]:

    
surveys[pd.isnull(surveys).any(axis=1)].head()









    Out[205]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN

How can we select rows where there are NO null values? We can invert the mask using the tilde operator, ~.



In [206]:

    
surveys[~pd.isnull(surveys).any(axis=1)].head()









    Out[206]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      62
      63
      8
      19
      1977
      3
      DM
      M
      35.0
      40.0
    
    
      63
      64
      8
      19
      1977
      7
      DM
      M
      37.0
      48.0
    
    
      64
      65
      8
      19
      1977
      4
      DM
      F
      34.0
      29.0
    
    
      65
      66
      8
      19
      1977
      4
      DM
      F
      35.0
      46.0
    
    
      66
      67
      8
      19
      1977
      7
      DM
      M
      35.0
      36.0

Dealing with Missing Data

It is sometimes hard to know why data values are missing. Was it because of a data entry error? Was there some condition in the field that prevented observation of a particular variable?

Moreover, we need a way of representing missing data that is clear and unambiguous. If we encode missing values as zeroes, for instance, are we sure that zero won't be interpreted as a real value?

For text data, an empty string usually suffices for blank values. However, numeric data fields can't hold an empty string; they have to either hold some number or a special value called NaN, which stands for "not a number."



In [207]:

    
len(surveys[pd.isnull(surveys.weight)])









    Out[207]:





3266



In [210]:

    
# Number of null values, in this case, is the same as number of non-zero values
len(surveys.weight) - len(surveys[surveys.weight > 0])









    Out[210]:





3266

We can replace all the NaN values in our weight column using the fillna() method. Here, we use data management best practices, copying our original data so that we don't modify it.



In [213]:

    
df = surveys.copy()
df.weight = df.weight.fillna(0)

However, NaN and zero will yield different results in analysis.



In [214]:

    
df.weight.mean()









    Out[214]:





38.751976145601844



In [215]:

    
surveys.weight.mean()









    Out[215]:





42.672428212991356

So, in general, it's better to leave NaN values as NaN and to subset the data to remove them, rather than re-coding them.

Indexing by Date and Time



In [84]:

    
#TODO

Combining Multiple Datasets

At this point, we have learned:

About zero-based indexing in Python;
How to manipulate and extract data using column headings and index locations;
How to employ slicing to select subsets of a DataFrame;
How to re-assign values within a DataFrame;
How to create a copy of a DataFrame;
How to query subsets of a DataFrame that match certain criteria using logical operators;
How to interpret NaN values and best practices for missing data;

In many situations, the data we want to use come in multiple files. We often need to combine these files into a single DataFrame in order to analyze the data.

Let's load a second table into our session.



In [216]:

    
surveys = pd.read_csv('ecology-surveys.csv')
species = pd.read_csv('ecology-species.csv')

Concatenating DataFrames

We can use the concat function in Pandas to append either columns or rows from one DataFrame to another. Let's create subsets of our data to see how this works...



In [220]:

    
# First 10 rows of surveys
surveys_first10 = surveys.head(10)

# Last 10 rows of surveys
surveys_last10 = surveys.tail(10)
surveys_last10









    Out[220]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      35539
      35540
      12
      31
      2002
      15
      PB
      F
      26.0
      23.0
    
    
      35540
      35541
      12
      31
      2002
      15
      PB
      F
      24.0
      31.0
    
    
      35541
      35542
      12
      31
      2002
      15
      PB
      F
      26.0
      29.0
    
    
      35542
      35543
      12
      31
      2002
      15
      PB
      F
      27.0
      34.0
    
    
      35543
      35544
      12
      31
      2002
      15
      US
      NaN
      NaN
      NaN
    
    
      35544
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      35545
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      35546
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
    
    
      35547
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
    
    
      35548
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN



In [221]:

    
# Reset the index so that we can concatenate properly
surveys_last10 = surveys_last10.reset_index(drop = True)
surveys_last10









    Out[221]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      35540
      12
      31
      2002
      15
      PB
      F
      26.0
      23.0
    
    
      1
      35541
      12
      31
      2002
      15
      PB
      F
      24.0
      31.0
    
    
      2
      35542
      12
      31
      2002
      15
      PB
      F
      26.0
      29.0
    
    
      3
      35543
      12
      31
      2002
      15
      PB
      F
      27.0
      34.0
    
    
      4
      35544
      12
      31
      2002
      15
      US
      NaN
      NaN
      NaN
    
    
      5
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      6
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      7
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
    
    
      8
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
    
    
      9
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN

When we concatenate DataFrames, we need to specify the axis. With axis=0, we are stacking DataFrames on top of one another (row-wise). Pandas will automatically detect that the column names are the same and stack them accordingly.



In [228]:

    
vertical_stack = pd.concat([surveys_first10, surveys_last10], axis = 0)
vertical_stack









    Out[228]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN
    
    
      5
      6
      7
      16
      1977
      1
      PF
      M
      14.0
      NaN
    
    
      6
      7
      7
      16
      1977
      2
      PE
      F
      NaN
      NaN
    
    
      7
      8
      7
      16
      1977
      1
      DM
      M
      37.0
      NaN
    
    
      8
      9
      7
      16
      1977
      1
      DM
      F
      34.0
      NaN
    
    
      9
      10
      7
      16
      1977
      6
      PF
      F
      20.0
      NaN
    
    
      0
      35540
      12
      31
      2002
      15
      PB
      F
      26.0
      23.0
    
    
      1
      35541
      12
      31
      2002
      15
      PB
      F
      24.0
      31.0
    
    
      2
      35542
      12
      31
      2002
      15
      PB
      F
      26.0
      29.0
    
    
      3
      35543
      12
      31
      2002
      15
      PB
      F
      27.0
      34.0
    
    
      4
      35544
      12
      31
      2002
      15
      US
      NaN
      NaN
      NaN
    
    
      5
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      6
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      7
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
    
    
      8
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
    
    
      9
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN



In [227]:

    
# In order to avoid duplication of index values
vertical_stack.reset_index()









    Out[227]:






  
    
      
      index
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
    
  
  
    
      0
      0
      1
      7
      16
      1977
      2
      NL
      M
      32.0
      NaN
    
    
      1
      1
      2
      7
      16
      1977
      3
      NL
      M
      33.0
      NaN
    
    
      2
      2
      3
      7
      16
      1977
      2
      DM
      F
      37.0
      NaN
    
    
      3
      3
      4
      7
      16
      1977
      7
      DM
      M
      36.0
      NaN
    
    
      4
      4
      5
      7
      16
      1977
      3
      DM
      M
      35.0
      NaN
    
    
      5
      5
      6
      7
      16
      1977
      1
      PF
      M
      14.0
      NaN
    
    
      6
      6
      7
      7
      16
      1977
      2
      PE
      F
      NaN
      NaN
    
    
      7
      7
      8
      7
      16
      1977
      1
      DM
      M
      37.0
      NaN
    
    
      8
      8
      9
      7
      16
      1977
      1
      DM
      F
      34.0
      NaN
    
    
      9
      9
      10
      7
      16
      1977
      6
      PF
      F
      20.0
      NaN
    
    
      10
      0
      35540
      12
      31
      2002
      15
      PB
      F
      26.0
      23.0
    
    
      11
      1
      35541
      12
      31
      2002
      15
      PB
      F
      24.0
      31.0
    
    
      12
      2
      35542
      12
      31
      2002
      15
      PB
      F
      26.0
      29.0
    
    
      13
      3
      35543
      12
      31
      2002
      15
      PB
      F
      27.0
      34.0
    
    
      14
      4
      35544
      12
      31
      2002
      15
      US
      NaN
      NaN
      NaN
    
    
      15
      5
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      16
      6
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
    
    
      17
      7
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
    
    
      18
      8
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
    
    
      19
      9
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN

Writing Data to a CSV File

We can use the to_csv() method to export a DataFrame to a CSV file.



In [229]:

    
vertical_stack.to_csv('out.csv')

Joining DataFrames

Note: Use this webpage as a graphical reference for this segment.

When we concatenated our DataFrames, we simply added them together, stacking them either vertically or horizontally. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common identifier). Combining DataFrames in this way is referred to as a join.

Joining data frames in this way is especially useful when one DataFrame serves as a lookup table, containing additional data that we want to include in the other. For example, the species table we imported is a lookup table. It contains additional information about each species_id.

To better understand how this works, let's create a subset of the surveys table.



In [242]:

    
surveys_sub = surveys.tail(20)

Identifying Join Keys

When we join two tables together, they must share a common key. Usually, it is a field that uniquely identifies the records in one of the two tables. In this example, it is the species_id field that the tables share.



In [243]:

    
surveys_sub.columns









    Out[243]:





Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'],
      dtype='object')



In [232]:

    
species.columns









    Out[232]:





Index(['species_id', 'genus', 'species', 'taxa'], dtype='object')

Inner Joins

The most common type of join is called an inner join. An inner join combines two DataFrames based on a shared key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

The Pandas function for performing joins is called merge(); by default, it performs an inner join.



In [244]:

    
merged = pd.merge(left = surveys_sub, right = species, left_on = 'species_id', right_on = 'species_id')
merged.shape









    Out[244]:





(19, 12)



In [245]:

    
merged









    Out[245]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
      genus
      species
      taxa
    
  
  
    
      0
      35530
      12
      31
      2002
      13
      OT
      F
      20.0
      NaN
      Onychomys
      torridus
      Rodent
    
    
      1
      35531
      12
      31
      2002
      13
      PB
      F
      27.0
      NaN
      Chaetodipus
      baileyi
      Rodent
    
    
      2
      35538
      12
      31
      2002
      15
      PB
      F
      26.0
      31.0
      Chaetodipus
      baileyi
      Rodent
    
    
      3
      35540
      12
      31
      2002
      15
      PB
      F
      26.0
      23.0
      Chaetodipus
      baileyi
      Rodent
    
    
      4
      35541
      12
      31
      2002
      15
      PB
      F
      24.0
      31.0
      Chaetodipus
      baileyi
      Rodent
    
    
      5
      35542
      12
      31
      2002
      15
      PB
      F
      26.0
      29.0
      Chaetodipus
      baileyi
      Rodent
    
    
      6
      35543
      12
      31
      2002
      15
      PB
      F
      27.0
      34.0
      Chaetodipus
      baileyi
      Rodent
    
    
      7
      35532
      12
      31
      2002
      14
      DM
      F
      34.0
      43.0
      Dipodomys
      merriami
      Rodent
    
    
      8
      35533
      12
      31
      2002
      14
      DM
      F
      36.0
      48.0
      Dipodomys
      merriami
      Rodent
    
    
      9
      35534
      12
      31
      2002
      14
      DM
      M
      37.0
      56.0
      Dipodomys
      merriami
      Rodent
    
    
      10
      35535
      12
      31
      2002
      14
      DM
      M
      37.0
      53.0
      Dipodomys
      merriami
      Rodent
    
    
      11
      35536
      12
      31
      2002
      14
      DM
      F
      35.0
      42.0
      Dipodomys
      merriami
      Rodent
    
    
      12
      35537
      12
      31
      2002
      14
      DM
      F
      36.0
      46.0
      Dipodomys
      merriami
      Rodent
    
    
      13
      35539
      12
      31
      2002
      15
      SF
      M
      26.0
      68.0
      Sigmodon
      fulviventer
      Rodent
    
    
      14
      35544
      12
      31
      2002
      15
      US
      NaN
      NaN
      NaN
      Sparrow
      sp.
      Bird
    
    
      15
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      Ammospermophilus
      harrisi
      Rodent
    
    
      16
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      Ammospermophilus
      harrisi
      Rodent
    
    
      17
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
      Reithrodontomys
      megalotis
      Rodent
    
    
      18
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
      Dipodomys
      ordii
      Rodent

The result of an inner join of surveys_sub and species is a new DataFrame that contains the combined set of columns from surveys_sub and species.

Note that the result of this merge is only 19 rows, whereas surveys_sub started with 20 rows.

Left Joins

If there are rows in surveys_sub with a species_id that is not found in the species lookup table, then they are not included in the merged output. What if we want to add information from species to surveys_sub without losing any of the information from surveys? A left join returns all of the rows from the left DataFrame while joining only whatever rows in the right data frame match.



In [246]:

    
# Because there is only one shared column, we can neglect giving any column mames
merged_left = pd.merge(left = surveys_sub, right = species, how = 'left')
merged_left.shape









    Out[246]:





(20, 12)



In [248]:

    
merged_left









    Out[248]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
      genus
      species
      taxa
    
  
  
    
      0
      35530
      12
      31
      2002
      13
      OT
      F
      20.0
      NaN
      Onychomys
      torridus
      Rodent
    
    
      1
      35531
      12
      31
      2002
      13
      PB
      F
      27.0
      NaN
      Chaetodipus
      baileyi
      Rodent
    
    
      2
      35532
      12
      31
      2002
      14
      DM
      F
      34.0
      43.0
      Dipodomys
      merriami
      Rodent
    
    
      3
      35533
      12
      31
      2002
      14
      DM
      F
      36.0
      48.0
      Dipodomys
      merriami
      Rodent
    
    
      4
      35534
      12
      31
      2002
      14
      DM
      M
      37.0
      56.0
      Dipodomys
      merriami
      Rodent
    
    
      5
      35535
      12
      31
      2002
      14
      DM
      M
      37.0
      53.0
      Dipodomys
      merriami
      Rodent
    
    
      6
      35536
      12
      31
      2002
      14
      DM
      F
      35.0
      42.0
      Dipodomys
      merriami
      Rodent
    
    
      7
      35537
      12
      31
      2002
      14
      DM
      F
      36.0
      46.0
      Dipodomys
      merriami
      Rodent
    
    
      8
      35538
      12
      31
      2002
      15
      PB
      F
      26.0
      31.0
      Chaetodipus
      baileyi
      Rodent
    
    
      9
      35539
      12
      31
      2002
      15
      SF
      M
      26.0
      68.0
      Sigmodon
      fulviventer
      Rodent
    
    
      10
      35540
      12
      31
      2002
      15
      PB
      F
      26.0
      23.0
      Chaetodipus
      baileyi
      Rodent
    
    
      11
      35541
      12
      31
      2002
      15
      PB
      F
      24.0
      31.0
      Chaetodipus
      baileyi
      Rodent
    
    
      12
      35542
      12
      31
      2002
      15
      PB
      F
      26.0
      29.0
      Chaetodipus
      baileyi
      Rodent
    
    
      13
      35543
      12
      31
      2002
      15
      PB
      F
      27.0
      34.0
      Chaetodipus
      baileyi
      Rodent
    
    
      14
      35544
      12
      31
      2002
      15
      US
      NaN
      NaN
      NaN
      Sparrow
      sp.
      Bird
    
    
      15
      35545
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      Ammospermophilus
      harrisi
      Rodent
    
    
      16
      35546
      12
      31
      2002
      15
      AH
      NaN
      NaN
      NaN
      Ammospermophilus
      harrisi
      Rodent
    
    
      17
      35547
      12
      31
      2002
      10
      RM
      F
      15.0
      14.0
      Reithrodontomys
      megalotis
      Rodent
    
    
      18
      35548
      12
      31
      2002
      7
      DO
      M
      36.0
      51.0
      Dipodomys
      ordii
      Rodent
    
    
      19
      35549
      12
      31
      2002
      5
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

Other Joins

A right (outer) join is similar to a left join except that it takes all rows from the right data frame and only matching rows from the left.
A full (outer) join returns all pairwise combinations of the two matching columns. This type is rarely used.

Challenges: Joins

Create a new data frame by joining the contents of surveys and species. Then, calculate and plot the distribution of taxa by plot_id.

Automating Data Workflows with Python

So far, we've used Python and the Pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The advantage of using a programming environment like Python, though, comes from its ability to automate data processing through the use of loops and functions.

Let's remind ourselves of how loops work.



In [34]:

    
animals = ['LIONS', 'TIGERS', 'BEARS']

for creature in animals:
    print(creature.lower())









    



lions
tigers
bears

A powerful Python idiom that we can use here is the list comprehension:



In [35]:

    
[creature.lower() for creature in animals]









    Out[35]:





['lions', 'tigers', 'bears']

Similarly, a dictionary comprehension is another quick way to transform data in Python.



In [36]:

    
dict((creature, creature.lower()) for creature in animals)









    Out[36]:





{'BEARS': 'bears', 'LIONS': 'lions', 'TIGERS': 'tigers'}

Automating Data Processing with For Loops

As one example of how we can automate data processing with for loops, let's imagine that we want to split our large surveys table into separate files, one for each survey year. We'll start by making a new directory to hold the output files for each year.

We can use Python's os library to manipulate files and folders on our computer's file system.



In [252]:

    
import os
os.mkdir('yearly_files')

The command os.mkdir is equivalent to the mkdir command in the Unix shell, which you may have seen if you use a GNU/Linux or Mac computer. To confirm that we just created the yearly_files folder, we can ask Python to list the contents of our current directory.



In [253]:

    
os.listdir('.')









    Out[253]:





['setup.sh',
 'yearly_files',
 '.ipynb_checkpoints',
 'slicing-indexing.svg',
 'Python-Intro-Ecology.ipynb',
 'ecology-surveys.csv',
 'check_env.py',
 'slicing-slicing.svg',
 'virtualenv.sh',
 'ecology-species.csv']

In previous lessons, we saw how to use the Pandas library to load a table into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame to a CSV file.

Let's now write a script that performs those three steps for just the year 2002.



In [254]:

    
import pandas as pd

# Load the data into a DataFrame
surveys = pd.read_csv('ecology-surveys.csv')

# Select only data for 2002
surveys2002 = surveys[surveys.year == 2002]

# Write the new DataFrame to a CSV file
surveys2002.to_csv('yearly_files/surveys_2002.csv')

One way we could create a file for every year would be to change this Python script for each year. This would be tedious, however, and tedium is what we're trying to avoid by using Python! Moreover, if we are changing our code frequently, we're likely to introduce an error at some point.

What we really want to do is to rewrite this script so that it is more general and can create all the files we need in one run.

Let's start by writing a loop that simply prints the names of the files we want to create; files denoted by each year.



In [255]:

    
# We can get the years in our data...
surveys.year









    Out[255]:





0        1977
1        1977
2        1977
3        1977
4        1977
5        1977
6        1977
7        1977
8        1977
9        1977
10       1977
11       1977
12       1977
13       1977
14       1977
15       1977
16       1977
17       1977
18       1977
19       1977
20       1977
21       1977
22       1977
23       1977
24       1977
25       1977
26       1977
27       1977
28       1977
29       1977
         ... 
35519    2002
35520    2002
35521    2002
35522    2002
35523    2002
35524    2002
35525    2002
35526    2002
35527    2002
35528    2002
35529    2002
35530    2002
35531    2002
35532    2002
35533    2002
35534    2002
35535    2002
35536    2002
35537    2002
35538    2002
35539    2002
35540    2002
35541    2002
35542    2002
35543    2002
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, dtype: int64

But we want only the unique years... How do we do that again?



In [257]:

    
surveys.year.unique()









    Out[257]:





array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002])

Putting this into a loop, we can generate filenames for each year.



In [259]:

    
for year in surveys.year.unique():
    filename = 'yearly_files/surveys_%s.csv' % str(year)
    print(filename)









    



yearly_files/surveys_1977.csv
yearly_files/surveys_1978.csv
yearly_files/surveys_1979.csv
yearly_files/surveys_1980.csv
yearly_files/surveys_1981.csv
yearly_files/surveys_1982.csv
yearly_files/surveys_1983.csv
yearly_files/surveys_1984.csv
yearly_files/surveys_1985.csv
yearly_files/surveys_1986.csv
yearly_files/surveys_1987.csv
yearly_files/surveys_1988.csv
yearly_files/surveys_1989.csv
yearly_files/surveys_1990.csv
yearly_files/surveys_1991.csv
yearly_files/surveys_1992.csv
yearly_files/surveys_1993.csv
yearly_files/surveys_1994.csv
yearly_files/surveys_1995.csv
yearly_files/surveys_1996.csv
yearly_files/surveys_1997.csv
yearly_files/surveys_1998.csv
yearly_files/surveys_1999.csv
yearly_files/surveys_2000.csv
yearly_files/surveys_2001.csv
yearly_files/surveys_2002.csv

A few things to note about this code:

We are looping over the years in the year column of our DataFrame. Because year is an integer, we need to convert it to a string before we can make it part of a filename. This is what the built-in str() function does.
Python allows us to embed text strings in other text strings through string formatting. Here, we use a special formatting character, %s, to denote where in the text string we want to embed some other text. The % operator is then followed by the string we want to embed.

Now, we have everything we need to create these separate files by year.



In [261]:

    
surveys = pd.read_csv('ecology-surveys.csv')

for year in surveys.year.unique():
    # Select data for the given year
    surveys_in_year = surveys[surveys.year == year]
    
    # Write the subdataset to a new file
    filename = 'yearly_files/surveys_%s.csv' % str(year)
    surveys_in_year.to_csv(filename)
    
os.listdir('yearly_files')









    Out[261]:





['surveys_1978.csv',
 'surveys_1992.csv',
 'surveys_1982.csv',
 'surveys_1998.csv',
 'surveys_1994.csv',
 'surveys_2002.csv',
 'surveys_1980.csv',
 'surveys_1987.csv',
 'surveys_1986.csv',
 'surveys_1997.csv',
 'surveys_1984.csv',
 'surveys_1990.csv',
 'surveys_1996.csv',
 'surveys_1979.csv',
 'surveys_1983.csv',
 'surveys_1985.csv',
 'surveys_2001.csv',
 'surveys2002.csv',
 'surveys_2000.csv',
 'surveys_1993.csv',
 'surveys_1989.csv',
 'surveys_1991.csv',
 'surveys_1981.csv',
 'surveys_1988.csv',
 'surveys_1977.csv',
 'surveys_1999.csv',
 'surveys_1995.csv']

Challenge: Automation with For Loops

Some of the surveys we saved are missing data; they have NaN values in one or more columns. Modify our for loop so that the entries with null values are not included in the yearly files.
What happens if there are no data for a year in the sequence? You can generate a list of years for the for loop to use with, e.g., range(1970, 1980).
Let's say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977?

Building Reusable Code with Functions

Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for loop like the one above every time we needed to do it but that would be time-consuming and error-prone. A more elegant solution would be to create a resuable tool that performs this task with minimum input from the user.

To achieve this, we'll encapsulate the code we've written so far into a function. As we've seen, functions are reusable, self-contained pieces of code that are called with a single command. The first part of writing a function is envisioning how it will be used; how it will be called by the user.



In [262]:

    
def one_year_to_csv(all_data, the_year):
    '''
    Writes a CSV files for data from a given year.
    Arguments:
        all_data    The DataFrame will multi-year data
        the_year    The year to write, an integer
    '''
    pass

Some things to note about this function:

Our function takes two arguments, the_year and all_data.
We've included a docstring in our function. The docstring is any string that comes on the line immediately after the function definition. It can be a single-line string but it is most commonly a multi-line string, indicated by three quote symbols in order.
It doesn't do anything yet. We wanted to set up the function first so that we have an end goal in mind; functions have to have a body of code so we used the keyword pass here, which is a special Python command that says, "do nothing."



In [263]:

    
?one_year_to_csv

Now, let's populate the body of this function with the code we wrote before, making the necessary changes.



In [279]:

    
def one_year_to_csv(all_data, the_year):
    '''
    Writes a CSV files for data from a given year.
    Arguments:
        all_data    The DataFrame will multi-year data
        the_year    The year to write, an integer
    '''
    # Select data for the given year
    data_in_year = all_data[all_data.year == the_year]
    
    # Write the subdataset to a new file
    filename = 'yearly_files/surveys_%s.csv' % str(the_year)
    data_in_year.to_csv(filename)



In [280]:

    
one_year_to_csv(surveys, 2002)
os.listdir('yearly_files')









    Out[280]:





['surveys_2002.csv']

This is a great start, but what we really want to do is create files for multiple years without having to request each year one at a time. Let's write another function that replicates the for loop we wrote earlier.



In [286]:

    
def multiple_years_to_csv(all_data, start_year, end_year):
    '''
    Writes separate CSV files for each year of data.
    Arguments:
        all_data      The DataFrame with multi-year data
        start_year    The first year of data we want
        end_year      The last year of data we want
    '''
    # Because range() generates a sequence up to but *not* including
    #     the last number, we add 1
    for year in range(start_year, end_year + 1):
        one_year_to_csv(all_data, year)



In [287]:

    
multiple_years_to_csv(surveys, 1977, 2002)
os.listdir('yearly_files')









    Out[287]:





['surveys_1978.csv',
 'surveys_1992.csv',
 'surveys_1982.csv',
 'surveys_1998.csv',
 'surveys_1994.csv',
 'surveys_2002.csv',
 'surveys_1980.csv',
 'surveys_1987.csv',
 'surveys_1986.csv',
 'surveys_1997.csv',
 'surveys_1984.csv',
 'surveys_1990.csv',
 'surveys_1996.csv',
 'surveys_1979.csv',
 'surveys_1983.csv',
 'surveys_1985.csv',
 'surveys_2001.csv',
 'surveys_2000.csv',
 'surveys_1993.csv',
 'surveys_1989.csv',
 'surveys_1991.csv',
 'surveys_1981.csv',
 'surveys_1988.csv',
 'surveys_1977.csv',
 'surveys_1999.csv',
 'surveys_1995.csv']

We can add optional arguments to help make this function even easier to use.



In [288]:

    
def multiple_years_to_csv(all_data, start_year=1997, end_year=2002):
    '''
    Writes separate CSV files for each year of data.
    Arguments:
        all_data      The DataFrame with multi-year data
        start_year    The first year of data we want
        end_year      The last year of data we want
    '''
    # Because range() generates a sequence up to but *not* including
    #     the last number, we add 1
    for year in range(start_year, end_year + 1):
        one_year_to_csv(all_data, year)

But what if the years 1997 and 2002 are not in the data?



In [289]:

    
def multiple_years_to_csv(all_data, start_year=None, end_year=None):
    '''
    Writes separate CSV files for each year of data.
    Arguments:
        all_data      The DataFrame with multi-year data
        start_year    The first year of data we want
        end_year      The last year of data we want
    '''
    if start_year is None:
        start_year = min(all_data.year)
        
    if end_year is None:
        end_year = max(all_data.year)
    
    # Because range() generates a sequence up to bubt *not* including
    #     the last number, we add 1
    for year in range(start_year, end_year + 1):
        one_year_to_csv(all_data, year)



In [291]:

    
multiple_years_to_csv(surveys)
os.listdir('yearly_files/')









    Out[291]:





['surveys_1978.csv',
 'surveys_1992.csv',
 'surveys_1982.csv',
 'surveys_1998.csv',
 'surveys_1994.csv',
 'surveys_2002.csv',
 'surveys_1980.csv',
 'surveys_1987.csv',
 'surveys_1986.csv',
 'surveys_1997.csv',
 'surveys_1984.csv',
 'surveys_1990.csv',
 'surveys_1996.csv',
 'surveys_1979.csv',
 'surveys_1983.csv',
 'surveys_1985.csv',
 'surveys_2001.csv',
 'surveys_2000.csv',
 'surveys_1993.csv',
 'surveys_1989.csv',
 'surveys_1991.csv',
 'surveys_1981.csv',
 'surveys_1988.csv',
 'surveys_1977.csv',
 'surveys_1999.csv',
 'surveys_1995.csv']

In the last example, we saw how we can build functions that call other functions we've built in the past. In this way, we can develop sophisticated data processing workflows piece-by-piece, based on code we already know works.

Challenge: Writing Reusable Functions

What type of object corresponds to a variable declared as None? (Hint: Create a variable set to None and use the function type()).
What happens if you only call multiple_years_to_csv() with all_data and an end_year (that is, without providing a start_year)? Can you write the function call with only a value for end_year?

Answering Questions with Data

There are many questions we ask about tabular data that we can answer with Pandas.

Validating Assumptions about Data

How many animals of each species were caught in each plot? We can create a cross-tabulation of the species_id and plot_id fields as follows.



In [83]:

    
pd.crosstab(surveys.species_id, surveys.plot_id)









    Out[83]:






  
    
      plot_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
    
    
      species_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      AB
      7
      14
      10
      3
      2
      17
      23
      6
      9
      6
      ...
      12
      9
      3
      9
      11
      37
      8
      12
      26
      16
    
    
      AH
      7
      7
      2
      2
      3
      6
      26
      12
      16
      1
      ...
      90
      10
      34
      15
      9
      35
      19
      4
      23
      17
    
    
      AS
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      BA
      1
      1
      19
      0
      4
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      2
      15
      1
      3
      0
      0
      0
    
    
      CB
      0
      0
      1
      1
      1
      1
      4
      2
      1
      1
      ...
      4
      3
      1
      0
      0
      3
      1
      2
      3
      0
    
    
      CM
      0
      1
      3
      0
      0
      0
      0
      1
      1
      2
      ...
      1
      0
      0
      1
      0
      0
      2
      0
      1
      0
    
    
      CQ
      0
      1
      1
      0
      2
      0
      0
      0
      0
      0
      ...
      0
      2
      0
      0
      1
      4
      0
      1
      4
      0
    
    
      CS
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      CT
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      CU
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      CV
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      DM
      607
      578
      107
      1075
      443
      222
      14
      569
      977
      9
      ...
      68
      82
      869
      315
      29
      258
      41
      522
      26
      277
    
    
      DO
      488
      313
      83
      62
      98
      224
      7
      488
      205
      3
      ...
      1
      31
      261
      42
      1
      54
      0
      5
      4
      65
    
    
      DS
      234
      137
      18
      277
      70
      107
      3
      212
      286
      4
      ...
      5
      5
      121
      127
      1
      118
      2
      302
      5
      17
    
    
      DX
      1
      4
      0
      3
      1
      0
      0
      3
      6
      0
      ...
      1
      1
      4
      0
      0
      0
      1
      1
      0
      0
    
    
      NL
      54
      201
      97
      6
      28
      17
      12
      8
      23
      2
      ...
      48
      4
      106
      62
      3
      134
      18
      18
      2
      99
    
    
      OL
      51
      68
      63
      57
      75
      46
      0
      55
      70
      14
      ...
      19
      11
      43
      41
      9
      16
      17
      21
      1
      11
    
    
      OT
      102
      194
      154
      76
      109
      150
      13
      97
      86
      8
      ...
      83
      38
      169
      83
      67
      76
      73
      81
      21
      39
    
    
      OX
      0
      2
      1
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      2
      0
      2
      0
      0
      0
    
    
      PB
      112
      171
      337
      97
      0
      245
      22
      104
      21
      1
      ...
      84
      30
      67
      252
      263
      132
      363
      256
      3
      8
    
    
      PC
      1
      0
      2
      1
      0
      1
      1
      1
      2
      3
      ...
      5
      0
      0
      0
      1
      4
      0
      0
      3
      0
    
    
      PE
      13
      146
      66
      1
      34
      69
      148
      11
      19
      7
      ...
      68
      43
      39
      30
      75
      74
      5
      23
      55
      188
    
    
      PF
      31
      21
      129
      136
      40
      110
      1
      56
      37
      0
      ...
      142
      4
      66
      150
      141
      114
      166
      56
      6
      4
    
    
      PG
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      ...
      0
      1
      0
      1
      0
      0
      2
      0
      0
      0
    
    
      PH
      3
      0
      2
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      9
      9
      0
      0
    
    
      PI
      0
      0
      4
      1
      0
      0
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      PL
      0
      0
      2
      0
      1
      0
      5
      0
      2
      7
      ...
      3
      2
      0
      0
      1
      0
      0
      0
      4
      8
    
    
      PM
      15
      36
      40
      4
      50
      27
      71
      7
      11
      54
      ...
      53
      58
      22
      61
      113
      32
      33
      4
      56
      73
    
    
      PP
      220
      187
      373
      115
      59
      203
      113
      217
      102
      2
      ...
      201
      39
      111
      107
      182
      137
      123
      43
      5
      23
    
    
      PU
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      3
      0
      0
      1
      1
    
    
      PX
      0
      0
      1
      0
      0
      0
      2
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      RF
      0
      1
      5
      0
      5
      8
      2
      0
      0
      12
      ...
      1
      1
      0
      2
      2
      10
      2
      0
      3
      5
    
    
      RM
      23
      71
      208
      19
      105
      81
      259
      13
      27
      166
      ...
      117
      166
      82
      120
      197
      116
      202
      17
      189
      160
    
    
      RO
      1
      1
      0
      0
      1
      0
      0
      0
      0
      2
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      2
      0
    
    
      RX
      0
      0
      0
      0
      1
      0
      0
      0
      1
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      SA
      2
      3
      5
      0
      1
      0
      0
      1
      1
      10
      ...
      3
      3
      6
      1
      8
      0
      1
      2
      8
      2
    
    
      SC
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      SF
      0
      0
      3
      0
      0
      0
      0
      0
      1
      0
      ...
      2
      0
      0
      3
      7
      0
      7
      1
      1
      0
    
    
      SH
      0
      15
      33
      2
      3
      9
      0
      0
      2
      2
      ...
      5
      0
      15
      6
      8
      1
      23
      0
      0
      4
    
    
      SO
      0
      8
      22
      0
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      6
      0
      0
      0
      3
    
    
      SS
      12
      7
      17
      22
      12
      15
      1
      15
      23
      7
      ...
      6
      6
      16
      5
      2
      6
      11
      5
      0
      5
    
    
      ST
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      SU
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      2
      1
      1
      0
      0
    
    
      UL
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      UP
      3
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      2
      0
    
    
      UR
      1
      1
      0
      0
      0
      1
      1
      1
      0
      1
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      US
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      ZL
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
  

48 rows × 24 columns

Sorting on Values

We might ask what the heaviest animals in our dataset are.



In [39]:

    
surveys.sort_values(by = 'weight', ascending = False).head(10)









    Out[39]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
      species_factor
      sex_factor
      species_id_codes
    
    
      X
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      33048
      33049
      11
      17
      2001
      12
      NL
      M
      33.0
      280.0
      15
      1
      15
    
    
      12870
      12871
      5
      28
      1987
      2
      NL
      M
      32.0
      278.0
      15
      1
      15
    
    
      15458
      15459
      1
      11
      1989
      9
      NL
      M
      36.0
      275.0
      15
      1
      15
    
    
      2132
      2133
      10
      25
      1979
      2
      NL
      F
      33.0
      274.0
      15
      0
      15
    
    
      12728
      12729
      4
      26
      1987
      2
      NL
      M
      32.0
      270.0
      15
      1
      15
    
    
      13113
      13114
      7
      26
      1987
      2
      NL
      M
      NaN
      269.0
      15
      1
      15
    
    
      30174
      30175
      1
      8
      2000
      2
      NL
      M
      34.0
      265.0
      15
      1
      15
    
    
      4961
      4962
      11
      22
      1981
      12
      NL
      F
      NaN
      264.0
      15
      0
      15
    
    
      13024
      13025
      7
      1
      1987
      2
      NL
      M
      33.0
      260.0
      15
      1
      15
    
    
      12601
      12602
      4
      6
      1987
      2
      NL
      M
      34.0
      260.0
      15
      1
      15

We might also ask for the last observation in our data. We could assume that the data in the table are in chronological order and simply look at the last row; however, there are many reasons why this may not be the case. We want to sort the data with date descending. We don't have a single date field but we can sort by multiple fields to achieve the same effect.



In [40]:

    
surveys.sort_values(by = ['year', 'month', 'day'], ascending = False).head()









    Out[40]:






  
    
      
      record_id
      month
      day
      year
      plot_id
      species_id
      sex
      hindfoot_length
      weight
      species_factor
      sex_factor
      species_id_codes
    
    
      X
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      35489
      35490
      12
      31
      2002
      3
      PE
      F
      19.0
      19.0
      21
      0
      21
    
    
      35490
      35491
      12
      31
      2002
      3
      PB
      M
      26.0
      46.0
      19
      1
      19
    
    
      35491
      35492
      12
      31
      2002
      3
      PB
      F
      26.0
      26.0
      19
      0
      19
    
    
      35492
      35493
      12
      31
      2002
      3
      PB
      F
      26.0
      34.0
      19
      0
      19
    
    
      35493
      35494
      12
      31
      2002
      3
      PB
      M
      28.0
      43.0
      19
      1
      19

		day	hindfoot_length	month	plot_id	record_id	weight	year
sex
F	count	15690.000000	14894.000000	15690.000000	15690.000000	15690.000000	15303.000000	15690.000000
	mean	16.007138	28.836780	6.583047	11.440854	18036.412046	42.170555	1990.644997
	std	8.271144	9.463789	3.367350	6.870684	10423.089000	36.847958	7.598725
	min	1.000000	7.000000	1.000000	1.000000	3.000000	4.000000	1977.000000
	25%	9.000000	21.000000	4.000000	5.000000	8917.500000	20.000000	1984.000000
	50%	16.000000	27.000000	7.000000	12.000000	18075.500000	34.000000	1990.000000
	75%	23.000000	36.000000	10.000000	17.000000	27250.000000	46.000000	1997.000000
	max	31.000000	64.000000	12.000000	24.000000	35547.000000	274.000000	2002.000000
M	count	17348.000000	16476.000000	17348.000000	17348.000000	17348.000000	16879.000000	17348.000000
	mean	16.184286	29.709578	6.392668	11.098282	17754.835601	42.995379	1990.480401
	std	8.199274	9.629246	3.420806	6.728713	10132.203323	36.184981	7.403655
	min	1.000000	2.000000	1.000000	1.000000	1.000000	4.000000	1977.000000
	25%	9.000000	21.000000	3.000000	5.000000	8969.750000	20.000000	1984.000000
	50%	16.000000	34.000000	6.000000	11.000000	17727.500000	39.000000	1990.000000
	75%	23.000000	36.000000	9.000000	17.000000	26454.250000	49.000000	1997.000000
	max	31.000000	58.000000	12.000000	24.000000	35548.000000	280.000000	2002.000000

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	species_factor	sex_factor	species_id_codes	weight_kg
X
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN	1	-1	1	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN	1	-1	1	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0	32	0	32	0.014
35547	35548	12	31	2002	7	DO	M	36.0	51.0	12	1	12	0.051
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN	-1	-1	-1	NaN

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1.0	7.0	16.0	1977.0	2.0	NL	M	32.0	NaN
10	11.0	7.0	16.0	1977.0	5.0	DS	F	53.0	NaN
50000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
33320	33321	1	12	2002	1	DM	M	38.0	44.0
33321	33322	1	12	2002	1	DO	M	37.0	58.0
33322	33323	1	12	2002	1	PB	M	28.0	45.0
33323	33324	1	12	2002	1	AB	NaN	NaN	NaN
33324	33325	1	12	2002	1	DO	M	35.0	29.0

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	False	False	False	False	False	False	False	False	True
1	False	False	False	False	False	False	False	False	True
2	False	False	False	False	False	False	False	False	True
3	False	False	False	False	False	False	False	False	True
4	False	False	False	False	False	False	False	False	True
5	False	False	False	False	False	False	False	False	True
6	False	False	False	False	False	False	False	True	True
7	False	False	False	False	False	False	False	False	True
8	False	False	False	False	False	False	False	False	True
9	False	False	False	False	False	False	False	False	True
10	False	False	False	False	False	False	False	False	True
11	False	False	False	False	False	False	False	False	True
12	False	False	False	False	False	False	False	False	True
13	False	False	False	False	False	False	True	True	True
14	False	False	False	False	False	False	False	False	True
15	False	False	False	False	False	False	False	False	True
16	False	False	False	False	False	False	False	False	True
17	False	False	False	False	False	False	False	False	True
18	False	False	False	False	False	False	True	True	True
19	False	False	False	False	False	False	False	False	True
20	False	False	False	False	False	False	False	False	True
21	False	False	False	False	False	False	False	False	True
22	False	False	False	False	False	False	False	False	True
23	False	False	False	False	False	False	False	False	True
24	False	False	False	False	False	False	False	False	True
25	False	False	False	False	False	False	False	False	True
26	False	False	False	False	False	False	False	False	True
27	False	False	False	False	False	False	False	False	True
28	False	False	False	False	False	False	False	True	True
29	False	False	False	False	False	False	False	False	True
...	...	...	...	...	...	...	...	...	...
35519	False	False	False	False	False	False	True	False	False
35520	False	False	False	False	False	False	False	False	False
35521	False	False	False	False	False	False	False	False	False
35522	False	False	False	False	False	False	False	False	False
35523	False	False	False	False	False	False	False	False	False
35524	False	False	False	False	False	False	False	False	False
35525	False	False	False	False	False	False	False	False	False
35526	False	False	False	False	False	False	False	False	False
35527	False	False	False	False	False	False	True	True	True
35528	False	False	False	False	False	False	False	False	False
35529	False	False	False	False	False	False	False	False	True
35530	False	False	False	False	False	False	False	False	True
35531	False	False	False	False	False	False	False	False	False
35532	False	False	False	False	False	False	False	False	False
35533	False	False	False	False	False	False	False	False	False
35534	False	False	False	False	False	False	False	False	False
35535	False	False	False	False	False	False	False	False	False
35536	False	False	False	False	False	False	False	False	False
35537	False	False	False	False	False	False	False	False	False
35538	False	False	False	False	False	False	False	False	False
35539	False	False	False	False	False	False	False	False	False
35540	False	False	False	False	False	False	False	False	False
35541	False	False	False	False	False	False	False	False	False
35542	False	False	False	False	False	False	False	False	False
35543	False	False	False	False	False	False	True	True	True
35544	False	False	False	False	False	False	True	True	True
35545	False	False	False	False	False	False	True	True	True
35546	False	False	False	False	False	False	False	False	False
35547	False	False	False	False	False	False	False	False	False
35548	False	False	False	False	False	True	True	True	True

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
62	63	8	19	1977	3	DM	M	35.0	40.0
63	64	8	19	1977	7	DM	M	37.0	48.0
64	65	8	19	1977	4	DM	F	34.0	29.0
65	66	8	19	1977	4	DM	F	35.0	46.0
66	67	8	19	1977	7	DM	M	35.0	36.0

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	35540	12	31	2002	15	PB	F	26.0	23.0
1	35541	12	31	2002	15	PB	F	24.0	31.0
2	35542	12	31	2002	15	PB	F	26.0	29.0
3	35543	12	31	2002	15	PB	F	27.0	34.0
4	35544	12	31	2002	15	US	NaN	NaN	NaN
5	35545	12	31	2002	15	AH	NaN	NaN	NaN
6	35546	12	31	2002	15	AH	NaN	NaN	NaN
7	35547	12	31	2002	10	RM	F	15.0	14.0
8	35548	12	31	2002	7	DO	M	36.0	51.0
9	35549	12	31	2002	5	NaN	NaN	NaN	NaN

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	genus	species	taxa
0	35530	12	31	2002	13	OT	F	20.0	NaN	Onychomys	torridus	Rodent
1	35531	12	31	2002	13	PB	F	27.0	NaN	Chaetodipus	baileyi	Rodent
2	35538	12	31	2002	15	PB	F	26.0	31.0	Chaetodipus	baileyi	Rodent
3	35540	12	31	2002	15	PB	F	26.0	23.0	Chaetodipus	baileyi	Rodent
4	35541	12	31	2002	15	PB	F	24.0	31.0	Chaetodipus	baileyi	Rodent
5	35542	12	31	2002	15	PB	F	26.0	29.0	Chaetodipus	baileyi	Rodent
6	35543	12	31	2002	15	PB	F	27.0	34.0	Chaetodipus	baileyi	Rodent
7	35532	12	31	2002	14	DM	F	34.0	43.0	Dipodomys	merriami	Rodent
8	35533	12	31	2002	14	DM	F	36.0	48.0	Dipodomys	merriami	Rodent
9	35534	12	31	2002	14	DM	M	37.0	56.0	Dipodomys	merriami	Rodent
10	35535	12	31	2002	14	DM	M	37.0	53.0	Dipodomys	merriami	Rodent
11	35536	12	31	2002	14	DM	F	35.0	42.0	Dipodomys	merriami	Rodent
12	35537	12	31	2002	14	DM	F	36.0	46.0	Dipodomys	merriami	Rodent
13	35539	12	31	2002	15	SF	M	26.0	68.0	Sigmodon	fulviventer	Rodent
14	35544	12	31	2002	15	US	NaN	NaN	NaN	Sparrow	sp.	Bird
15	35545	12	31	2002	15	AH	NaN	NaN	NaN	Ammospermophilus	harrisi	Rodent
16	35546	12	31	2002	15	AH	NaN	NaN	NaN	Ammospermophilus	harrisi	Rodent
17	35547	12	31	2002	10	RM	F	15.0	14.0	Reithrodontomys	megalotis	Rodent
18	35548	12	31	2002	7	DO	M	36.0	51.0	Dipodomys	ordii	Rodent

plot_id	1	2	3	4	5	6	7	8	9	10	...	15	16	17	18	19	20	21	22	23	24
species_id
AB	7	14	10	3	2	17	23	6	9	6	...	12	9	3	9	11	37	8	12	26	16
AH	7	7	2	2	3	6	26	12	16	1	...	90	10	34	15	9	35	19	4	23	17
AS	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
BA	1	1	19	0	4	0	0	0	0	0	...	0	0	0	2	15	1	3	0	0	0
CB	0	0	1	1	1	1	4	2	1	1	...	4	3	1	0	0	3	1	2	3	0
CM	0	1	3	0	0	0	0	1	1	2	...	1	0	0	1	0	0	2	0	1	0
CQ	0	1	1	0	2	0	0	0	0	0	...	0	2	0	0	1	4	0	1	4	0
CS	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
CT	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
CU	0	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
CV	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	0
DM	607	578	107	1075	443	222	14	569	977	9	...	68	82	869	315	29	258	41	522	26	277
DO	488	313	83	62	98	224	7	488	205	3	...	1	31	261	42	1	54	0	5	4	65
DS	234	137	18	277	70	107	3	212	286	4	...	5	5	121	127	1	118	2	302	5	17
DX	1	4	0	3	1	0	0	3	6	0	...	1	1	4	0	0	0	1	1	0	0
NL	54	201	97	6	28	17	12	8	23	2	...	48	4	106	62	3	134	18	18	2	99
OL	51	68	63	57	75	46	0	55	70	14	...	19	11	43	41	9	16	17	21	1	11
OT	102	194	154	76	109	150	13	97	86	8	...	83	38	169	83	67	76	73	81	21	39
OX	0	2	1	0	1	0	0	0	0	0	...	0	0	1	0	2	0	2	0	0	0
PB	112	171	337	97	0	245	22	104	21	1	...	84	30	67	252	263	132	363	256	3	8
PC	1	0	2	1	0	1	1	1	2	3	...	5	0	0	0	1	4	0	0	3	0
PE	13	146	66	1	34	69	148	11	19	7	...	68	43	39	30	75	74	5	23	55	188
PF	31	21	129	136	40	110	1	56	37	0	...	142	4	66	150	141	114	166	56	6	4
PG	0	0	0	0	0	1	0	0	0	0	...	0	1	0	1	0	0	2	0	0	0
PH	3	0	2	0	1	0	0	0	0	0	...	0	0	0	1	0	0	9	9	0	0
PI	0	0	4	1	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
PL	0	0	2	0	1	0	5	0	2	7	...	3	2	0	0	1	0	0	0	4	8
PM	15	36	40	4	50	27	71	7	11	54	...	53	58	22	61	113	32	33	4	56	73
PP	220	187	373	115	59	203	113	217	102	2	...	201	39	111	107	182	137	123	43	5	23
PU	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	3	0	0	1	1
PX	0	0	1	0	0	0	2	0	0	0	...	0	0	0	0	0	0	0	0	1	0
RF	0	1	5	0	5	8	2	0	0	12	...	1	1	0	2	2	10	2	0	3	5
RM	23	71	208	19	105	81	259	13	27	166	...	117	166	82	120	197	116	202	17	189	160
RO	1	1	0	0	1	0	0	0	0	2	...	0	0	0	0	1	0	0	0	2	0
RX	0	0	0	0	1	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
SA	2	3	5	0	1	0	0	1	1	10	...	3	3	6	1	8	0	1	2	8	2
SC	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
SF	0	0	3	0	0	0	0	0	1	0	...	2	0	0	3	7	0	7	1	1	0
SH	0	15	33	2	3	9	0	0	2	2	...	5	0	15	6	8	1	23	0	0	4
SO	0	8	22	0	0	1	0	0	0	0	...	0	0	0	1	0	6	0	0	0	3
SS	12	7	17	22	12	15	1	15	23	7	...	6	6	16	5	2	6	11	5	0	5
ST	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	1	0	0	0	0
SU	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	2	1	1	0	0
UL	0	1	0	0	0	0	0	1	0	0	...	0	0	1	0	0	0	0	0	0	0
UP	3	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	2	0
UR	1	1	0	0	0	1	1	1	0	1	...	0	0	0	1	0	0	0	0	0	0
US	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	0
ZL	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	1	0

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	species_factor	sex_factor	species_id_codes
X
33048	33049	11	17	2001	12	NL	M	33.0	280.0	15	1	15
12870	12871	5	28	1987	2	NL	M	32.0	278.0	15	1	15
15458	15459	1	11	1989	9	NL	M	36.0	275.0	15	1	15
2132	2133	10	25	1979	2	NL	F	33.0	274.0	15	0	15
12728	12729	4	26	1987	2	NL	M	32.0	270.0	15	1	15
13113	13114	7	26	1987	2	NL	M	NaN	269.0	15	1	15
30174	30175	1	8	2000	2	NL	M	34.0	265.0	15	1	15
4961	4962	11	22	1981	12	NL	F	NaN	264.0	15	0	15
13024	13025	7	1	1987	2	NL	M	33.0	260.0	15	1	15
12601	12602	4	6	1987	2	NL	M	34.0	260.0	15	1	15

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	species_factor	sex_factor	species_id_codes
X
35489	35490	12	31	2002	3	PE	F	19.0	19.0	21	0	21
35490	35491	12	31	2002	3	PB	M	26.0	46.0	19	1	19
35491	35492	12	31	2002	3	PB	F	26.0	26.0	19	0	19
35492	35493	12	31	2002	3	PB	F	26.0	34.0	19	0	19
35493	35494	12	31	2002	3	PB	M	28.0	43.0	19	1	19