Overview

This lesson introduces Python as an environment for data analysis and visualization. The materials are based on the Data Carpentry Python for Ecologists lesson. However, the lesson focuses on general analysis and visualization of tabular data and is not specific to ecologists or ecological data. As Data Carpentry explains:

Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.

At the end of this lesson, you will be able to:

  • Read and write basic Python code;
  • Import and export tabular data with Python;
  • Subset and filter tabular data;
  • Understand different data types and data formats;
  • Understand pandas Data Frames and how they help organize tabular data;
  • Devise and intepret data processing workflows;
  • Automate your data cleaning and analysis with Python;
  • Visualize your data using matplotlib and pandas;
  • Connect to a SQLite database using Python.

This lesson will not prepare you to use Python as a general purpose programming language; there are some parts of the language we won't have time to cover. However, at the end of this lesson, you will have a good grasp of Python syntax and be well-prepared to learn the rest of the language, if you desire to do so. Even without seeing all of the Python programming language, you will be prepared to analyze and visualize data in Python using pandas and matplotlib.

License

As with the Data Carpentry ecology lesson, this lesson is licensed for open use under the CC BY 4.0 license.

Introduction to the Python Programming Language

Python is a general purpose programming language that allows for the rapid development of scientific workflows. Python's main advantages are:

  • It is open-source software, supported by the Python Software Foundation;
  • It is available on all platforms, including Windows, Mac OS X, and GNU/Linux;
  • It can be used to program any kind of task (it is a general purpose language);
  • It supports multiple programming paradigms (a fancy term computer scientists use to describe the different ways people like to design software);
  • Most importantly, it has a large and diverse community of users who share Python code they've already written to do a wide variety of things.

The Python Interpreter

The only language that computers really understand is machine language, or binary: ones and zeros. Anything we tell computers to do has to be translated to binary for computers to execute.

Python is what we call an interpreted language. This means that computers can translate Python to machine code as they are reading it. This distinguishes Python from languages like C, C++, or Java, which have to be compiled to machine code before they are run. The details aren't important to us; what is important is that we can use Python in two ways:

  • We can use the Python interpreter in interactive mode;
  • Or, we can use execute Python code that is stored in a text file, called a script.

Jupyter Notebook

For this lesson, we'll be using the Python interpreter that is embedded in Jupyter Notebook. Jupyter Notebook is a fancy, browser-based environment for literate programming, the combination of Python scripts with rich text for telling a story about the task you set out to do with Python. This is a powerful way for collecting the code, the analysis, the context, and the results in a single place.

The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:

  • In GNU/Linux or Mac OS X, launch the Terminal and type: jupyter notebook; then press ENTER.
  • In Windows, launch the Command Prompt and type jupyter notebook; then press ENTER.

Let's try out the Python interpreter.


In [1]:
print('Hello, world!')


Hello, world!

Alternatively, we could save that one line of Python code to a text file with a *.py file extension and then execute that file.

Python Data Types

Strings, Integers, and Floats


In [4]:
text = 'Data Carpentry' # A character string
number = 42 # An integer number
pi = 3.14159265 # A floating-point number or "float"

Here, we've assigned data to variables using the assignment operator or equal sign. The process of assignment takes a value and stores it under a name that we make up. This way, we can use that stored value again by calling its name.


In [5]:
number


Out[5]:
42

Note that to recover a variable's stored value, we simply type the name of the variable and hit Enter. (This only works in interactive mode; if we wrote a script and want it to print out a value, we have to use the print() function.)

Variable names can only include letters, the underscore, and numbers. However, variable names cannot start with numbers.


In [7]:
my_variable = 'some text'

Operators

We can perform mathematical calculations in Python using the basic operators +, -, /, *, and %.


In [8]:
2 + 2


Out[8]:
4

In [9]:
6 * 7


Out[9]:
42

In [12]:
5 ** 2


Out[12]:
25

This is "13 modulo 5" and the result is the remainder, 3.


In [22]:
13 % 5


Out[22]:
3

We can also use comparison and logical operators. These operators return Boolean values; that is, they determine or describe whether something is True or False.


In [23]:
3 > 4


Out[23]:
False

In [26]:
5 == 5


Out[26]:
True

i.e., is it True and True?


In [24]:
True and True


Out[24]:
True

Chooses the "truth-y" value between the two:


In [25]:
True or False


Out[25]:
True

True and False, with the first letter capitalized, are special values in Python that mean just what they say.

Sequences

Much of Python's expressive power and flexibility comes from the way it handles sequences. A sequence could be a sequence of characters in a text string or a sequence of numbers.

A list is Python's built-in data structure for handling general, ordered sequences. Each element can be accessed by its index. Note that, in Python, we start counting from zero, not from one.


In [27]:
numbers = [1, 2, 3]
numbers[0]


Out[27]:
1

The square brackets are used to slice a sequence by one or more indices. Above, we have asked for the first (the zeroth) element of the numbers sequence.

A for loop is a useful way of accessing the elements of a sequence one at a time:


In [28]:
for number in numbers:
    print(number)


1
2
3

Indentation is very important in Python. Note that the second line in the above example is indented. This is Python's way of marking a block of code. It's standard to indent by 4 spaces.

To add elements to the end of a list, we can use the append() method:`


In [29]:
numbers.append(4)
print(numbers)


[1, 2, 3, 4]

Note that there is no output associated with the append() method; the numbers sequence is modified in place so we don't need to assign the result to a variable.

Methods are a way to interact with an object in Python. We can invoke a method using the dot, followed by the method name and a list of arguments in parentheses. To find out what methods are available for an object, we can use the built-in help() function.


In [30]:
help(numbers)


Help on list object:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __mul__(self, value, /)
 |      Return self*value.n
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __reversed__(...)
 |      L.__reversed__() -- return a reverse iterator over the list
 |  
 |  __rmul__(self, value, /)
 |      Return self*value.
 |  
 |  __setitem__(self, key, value, /)
 |      Set self[key] to value.
 |  
 |  __sizeof__(...)
 |      L.__sizeof__() -- size of L in memory, in bytes
 |  
 |  append(...)
 |      L.append(object) -> None -- append object to end
 |  
 |  clear(...)
 |      L.clear() -> None -- remove all items from L
 |  
 |  copy(...)
 |      L.copy() -> list -- a shallow copy of L
 |  
 |  count(...)
 |      L.count(value) -> integer -- return number of occurrences of value
 |  
 |  extend(...)
 |      L.extend(iterable) -> None -- extend list by appending elements from the iterable
 |  
 |  index(...)
 |      L.index(value, [start, [stop]]) -> integer -- return first index of value.
 |      Raises ValueError if the value is not present.
 |  
 |  insert(...)
 |      L.insert(index, object) -- insert object before index
 |  
 |  pop(...)
 |      L.pop([index]) -> item -- remove and return item at index (default last).
 |      Raises IndexError if list is empty or index is out of range.
 |  
 |  remove(...)
 |      L.remove(value) -> None -- remove first occurrence of value.
 |      Raises ValueError if the value is not present.
 |  
 |  reverse(...)
 |      L.reverse() -- reverse *IN PLACE*
 |  
 |  sort(...)
 |      L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE*
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None

We can also access a list of methods using dir.


In [31]:
dir(numbers)


Out[31]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

Tuples

A tuple is similar to a list in that it's an ordered sequence of elements. However, tuples can not be changed once created; they are "immutable." Tuples are created by placing comma-separated values inside parentheses.


In [33]:
a_tuple = (1,2,3)
another_tuple = ('rabbit', 'mongoose', 'platypus')
still_a_tuple = (1,)

# Note that lists use square brackets
a_list = [1,2,3]

Challenge: Tuples and Lists


In [1]:
a_tuple = ('a', 'b', 'c', 'd')
a_list = ['a', 'b', 'c', 'd']
  1. What happens when you try to re-assign the first value in each of the following examples?
    a_tuple[0] = 5
    a_list[0] = 5
    
  2. Type type(a_tuple) into Python; what is the object's type?

Dictionaries

A dictionary is a container that holds key-value pairs. It is a data structure that is also referred to, in other programming languages, as an associative array.


In [6]:
mapping = {'tbsp': 'tablespoon', 'tsp': 'teaspoon'}
mapping['tbsp']


Out[6]:
'tablespoon'

Dictionaries are similar to lists in that we can store multiple things inside them. While we index lists with numeric indices, we index dictionaries with keys. A key is a unique identifier for a value in the dictionary.

Keys must be a particular data type; they must be hashable, therefore, only strings and numeric types are acceptable.


In [7]:
code_book = {1: 'Valid', 2: 'Invalid'}
code_book[1]


Out[7]:
'Valid'

In [9]:
{[1,2]: 'Valid/Invalid'}


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-428f6b27df1e> in <module>()
----> 1 {[1,2]: 'Valid/Invalid'}

TypeError: unhashable type: 'list'

Another way we can create dictionaries is using the dict() function. Here, we give the dict() function a list of key-value pairs, represented as tuples. The first element in the tuple is the key, the second element is the value.


In [37]:
dict([(1, 'Valid'), (2, 'Invalid')])


Out[37]:
{1: 'Valid', 2: 'Invalid'}

Using for loops with dictionaries is a little more complicated.


In [10]:
for key, value in code_book.items():
    print(key, '->', value)


1 -> Valid
2 -> Invalid

Alternatively:


In [11]:
for key in code_book.keys():
    print(key, '->', code_book[key])


1 -> Valid
2 -> Invalid

Challenge: Reassignment in a Dictionary

How can we change a value in our dictionary? Try to reassign one of the values in the code_book dictionary.

You may have noticed that a dictionary sometimes returns its items in a different order than you expected. Dictionaries are intrinsically unordered; they do not retain the order of their items.

Functions

One of the chief reasons we program computers to do things for us is because computers are very good at tedious tasks (and humans are not). A block of Python code that does the same thing every time is best defined as function in Python. A function is a series of fixed Python statements, with or without input arguments, that are assigned a name so that we can easily call them over and over again.

We've already seen the built-in print() and help() functions. Now let's see how to write our own.


In [15]:
def pow10(exponent):
    result = 10 ** exponent
    return result

pow10(2)


Out[15]:
100

Key things to note about this example:

  • We define a function using the def command followed by the name of the function and any arguments its takes, written just like we would call the function (recall the print() and help() functions);
  • The body of the function is indented;
  • We use the return command to indicate what the result of the function , or its return value, should be. If we don't return anything, the function's body is still executed, but we don't necessarily see any output when the function is called.

How can we make a more general version of the function pow10()?


In [4]:
def pow(exponent, base=10):
    result = base ** exponent
    return result

pow(2, 10)


Out[4]:
100

Here, the function pow() can be used to calculate powers of any base, not just base 10. By default, it will calculate powers of 10, because the argument base has a default argument of 10.


In [5]:
pow(2, base = 10)


Out[5]:
100

In [6]:
pow(base = 10, exponent = 2)


Out[6]:
100

pow() is actually a function already built into Python, so we didn't need to write this function, but it demonstrates how you can create more flexible functions using default arguments.

Challenge: Writing Your First Function

To convert from temperatures in Fahrenheight to Celsius, we first subtract $32$ and then multiple by $5/9$. Write a function that converts temperatures from Fahrenheit to Celsius.

Lambda Functions

Another type of function in Python is the lambda function. This is a special, one-line function that is generally used for transforming inputs.


In [66]:
to_lowercase = lambda s: s.lower()
to_lowercase('Roger')


Out[66]:
'roger'

In [82]:
names = ['Roger', 'Helen', 'Abdul', 'Lakshmi']
list(map(to_lowercase, names))


Out[82]:
['roger', 'helen', 'abdul', 'lakshmi']

Lambda functions are important for reasons we'll see later.

Data Analysis Workflows

In general, when we have a data analysis task, there are specific steps that we undertake, in order:

  1. Bringing data together in a common format that is both human-readable and machine-readable (e.g., through spreadsheets, before exporting as a CSV).
  2. Cleaning data, e.g, merging text labels that are slightly different (e.g., "Tree frogs," "tree frogs," and "treefrogs"), splitting and combining fields as needed (e.g., in OpenRefine or in Python).
  3. Archiving data, keeping the raw data separate from any cleaned data, for further use and analysis (e.g., using a relational database and SQL).
  4. Deriving a subset of cleaned data for use in a specific analysis, such as a certain date range, certain field site, or certain collection (e.g., in Python or SQL).
  5. Performing analysis to answer a question or make a decision, such as grouping and summarizing data, calculating a statistic, or generating a plot (e.g., in Python or SQL).

We'll now see how Python plays a role in this workflow, starting from having "clean" or nearly clean data to begin analyzing.

Managing Data in Python

We've seen how we can use spreadsheets to effectively organize data. You probably know how to do some basic analysis on tabular data using spreadsheet software programs like Microsoft Excel. Why, then, should we learn to use Python for data analysis?

  • Because computers are good at automating tedious tasks (and humans are not), we can use Python to instruct a computer to repeat a task over and over again. This saves us frustration and ensures that the task is performed consistently, without human error.
  • More specifically, once we have written the code to do some task with data the first time, we can apply it over and over again to subsequent datasets.
  • Ultimately, we can save the code and share it with other people if they want to perform the same analysis or understand what we did. The code is then a record of what we did to get our results. This is a key component of reproducibility in computer-aided science.

For this lesson to run smoothly, let's make sure everyone is in the same directory. In the Jupyter Notebook file tree, navigate to your Desktop. We'll create a new Notebook here to use for the rest of the lesson.

About the Data

For this lesson, we will be using the Portal Project Teaching Database, a subset of data from Ernst et al.'s (2009) long-term monitoring and experimental manipulation of a Chihuahuan desert ecosystem near Portal, Arizona, U.S.A.. We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma-separated variable (CSV) file: each row holds information for a single animal, and the columns reprsent:

Column Description
record_id Unique ID for the observation
month Month of observation
day Day of observation
year Year of observation
plot_id ID of a particular plot
species_id 2-letter code identifying the species
sex Sex of animal ("M","F")
hindfoot_length Length of the hindfoot in millimeters
weight Weight of the animal in grams

About Libraries

A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used to perform many tasks. One of the libraries we'll be using in this lesson is the Python Data Analysis Library, or pandas. pandas adds a number of things to base Python including more sophisticated data structures and tools to connect to other libraries like matplotlib for producing data visualizations.

Python doesn't initially load all of the libraries that we have installed; we have to tell Python to import the librar(ies) that we want to use in any particular session. To import a library, we use the syntax: import libraryName. Because all of the tools in that library will be known by the name we import, we have the option to give it a shorter name. We'll do that here with Pandas.


In [2]:
import pandas as pd

Each time we call a function that's in the Pandas library, we use the syntax libraryName.functionName. This ensures that Python can find the tool or function we're asking for. Most Pandas users abbreviate the library name to pd and because this is such a popular convention, you can often find help on the internet by searching for a particular function in the pd namespace. For instance, I'll look up support for the pd.read_csv() function on the Google search engine.

Reading CSV Data in Pandas

Tabular data, like those stored in CSV, tab-delimited, or fixed-width formats, can be read into Python using Pandas and stored in a Pandas DataFrame. If you've used the R programming language before, you'll be familiar with DataFrames.

Data Frames

A DataFrame is a 2-dimensional, tabular data structure that can store different types of data across multiple columns. It is structurally identical to a spreadsheet in this way or, as we'll see, a table in a SQL database. We can read in our survey data as a DataFrame using the pd.read_csv() function.


In [ ]:
pd.read_csv('ecology-surveys.csv')

When we run this command, Pandas finds the CSV file, opens it, and reads in the contents, line-by-line, as rows in a new DataFrame, which it then prints out. We can see that:

  • There were 33,549 rows parsed.
  • Each has 9 columns.

The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame.

We can see from this output that Pandas read the file properly, but it hasn't saved the DataFrame to memory. We need to assign the DataFrame to a variable.


In [115]:
surveys = pd.read_csv('ecology-surveys.csv')

Note that when we do variable assignment, Python does not display anything on the screen. The output of the pd.read_csv() function, our table, instead of being displayed on the screen is not "captured" in the variable named surveys.


In [5]:
type(surveys)


Out[5]:
pandas.core.frame.DataFrame

Manipulating Data in Python

One of the first things we might do after importing any kind of data in any environment is to make sure that are our data--the individual fields or columns--are represented in the right way; by the right data types.


In [6]:
surveys.dtypes


Out[6]:
record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

Here, int64 refers to integer type; it cannot store decimal numbers. float64 stores decimal numbers with 64-bit precision. object refers to character strings, or text.

Challenge: Viewing DataFrames in Python

Try executing each code sample below and see what is returned.

  • surveys.columns
  • surveys.head()
  • surveys.head(15)
  • surveys.tail()
  • surveys.shape

Take note of the output of surveys.shape; what format does it return?

Finally, what is the difference between the code samples that end in parentheses and those that do not?

Each of the code samples above has us calling some attribute or method on the surveys DataFrame.

  • Methods are functions that belong to an object in Python, like a DataFrame. Just like the functions we saw earlier, functions take zero or more arguments that go inside the parentheses. Even if we have no arguments to provide, we still have to use the parentheses to get the function to do its work. In general, a method is a function that belongs to an object.
  • Attributes are a more general concept; an attribute is anything that belongs to an object in Python, including methods. Attributes that are not methods, however, don't need to be called with parentheses.

If we think of a person, an attribute is something that belongs to that person or describes that person, like hair color or number of siblings. A method is something that person does, like bake a pie or go for a run.

Calculating Statistics in a Pandas DataFrame

Let's perform some summary statistics to further verify that the data we imported look okay. First, let's remind ourselves what is in our data and what values we might use to group the data by in calculating summary statistics.


In [23]:
# Look at the column names
surveys.columns.values


Out[23]:
array(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'], dtype=object)

Let's get a list of all the species. The pd.unique() function tells us all of the unique values in the species_id column.


In [11]:
pd.unique(surveys['species_id'])


Out[11]:
array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

Note the bracket notation we've used here. This is how we access a column by name in a Pandas DataFrame. It's just like how we accessed the value in a Dictionary by its key name.

Challenge: Unique Levels for a Column

  1. Create a list of unique plot IDs found in the survey data; assign the list of unique IDs to a variable called plot_names. How many unique plots are there in the data? How many unique species are in the data?
  2. What is the difference between len(plot_names) and plot_names.shape?

Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals in each plot.

We can calculate basic statistics for all records in a single column using the syntax below.


In [24]:
surveys['weight'].describe()


Out[24]:
count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

Let's break that last command down.

  1. First, we asked for the weight column in the surveys table.
  2. Second, we called the describe() method on that column.

We can also extract a specific statistic.


In [26]:
surveys['weight'].mean()


Out[26]:
42.672428212991356

In [27]:
surveys['weight'].count()


Out[27]:
32283

If we want to summarize by one or more variables, for example, sex, we can use Pandas' groupby() method. Once we've created a grouped DataFrame, we can quickly calculated summary statistics by a group of our choice.


In [29]:
# Group data by sex
grouped_by_sex = surveys.groupby('sex')

# Summary statistics for all numeric columns, grouped by sex
grouped_by_sex.describe()


Out[29]:
day hindfoot_length month plot_id record_id weight year
sex
F count 15690.000000 14894.000000 15690.000000 15690.000000 15690.000000 15303.000000 15690.000000
mean 16.007138 28.836780 6.583047 11.440854 18036.412046 42.170555 1990.644997
std 8.271144 9.463789 3.367350 6.870684 10423.089000 36.847958 7.598725
min 1.000000 7.000000 1.000000 1.000000 3.000000 4.000000 1977.000000
25% 9.000000 21.000000 4.000000 5.000000 8917.500000 20.000000 1984.000000
50% 16.000000 27.000000 7.000000 12.000000 18075.500000 34.000000 1990.000000
75% 23.000000 36.000000 10.000000 17.000000 27250.000000 46.000000 1997.000000
max 31.000000 64.000000 12.000000 24.000000 35547.000000 274.000000 2002.000000
M count 17348.000000 16476.000000 17348.000000 17348.000000 17348.000000 16879.000000 17348.000000
mean 16.184286 29.709578 6.392668 11.098282 17754.835601 42.995379 1990.480401
std 8.199274 9.629246 3.420806 6.728713 10132.203323 36.184981 7.403655
min 1.000000 2.000000 1.000000 1.000000 1.000000 4.000000 1977.000000
25% 9.000000 21.000000 3.000000 5.000000 8969.750000 20.000000 1984.000000
50% 16.000000 34.000000 6.000000 11.000000 17727.500000 39.000000 1990.000000
75% 23.000000 36.000000 9.000000 17.000000 26454.250000 49.000000 1997.000000
max 31.000000 58.000000 12.000000 24.000000 35548.000000 280.000000 2002.000000

In [30]:
grouped_by_sex.mean()


Out[30]:
record_id month day year plot_id hindfoot_length weight
sex
F 18036.412046 6.583047 16.007138 1990.644997 11.440854 28.836780 42.170555
M 17754.835601 6.392668 16.184286 1990.480401 11.098282 29.709578 42.995379

Quickly Creating Summary Counts in Pandas

Let's count the number of samples for each species. We can do this in a few ways, but we'll use groupby() combined with a count() method.


In [42]:
species_counts = surveys.groupby('species_id')['record_id'].count()
species_counts


Out[42]:
species_id
AB      303
AH      437
AS        2
BA       46
CB       50
CM       13
CQ       16
CS        1
CT        1
CU        1
CV        1
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PG        8
PH       32
PI        9
PL       36
PM      899
PP     3123
PU        5
PX        6
RF       75
RM     2609
RO        8
RX        2
SA       75
SC        1
SF       43
SH      147
SO       43
SS      248
ST        1
SU        5
UL        4
UP        8
UR       10
US        4
ZL        2
Name: record_id, dtype: int64

Again, let's break this down:

  1. We group the rows of the surveys DataFrame by the unique values in one of its columns: species_id.
  2. Second, we ask for the column record_id in the output from the last part.
  3. Third, we call the count() method on this column. Because the data are grouped by species_id, we get a count within each unique value of species_id, not a count of all the rows in the table.

Challenge: Understanding Grouped DataFrames

  1. In that last command, we asked for the record_id column. Try asking for a different column in the square brackets. Do you get a different result? Why or why not?
  2. How can we get a count of just the records with species_id set to DO? Hint: You can build on the last command we executed; think about Dictionaries and key-value pairs.

Basic Math on DataFrame Columns

We can perform quick mathematical transformations on the values of a column in a straightforward way. For instance, we might normalize the measured hindfoot lengths by their mean.


In [ ]:
surveys['hindfoot_length'] / surveys['hindfoot_length'].mean()

Basic Plotting with Pandas

We can also visualize summary statistics using Pandas. First, we'll fire off an iPython "magic" function that will allow us to view plots inline in the Jupyter Notebook.


In [40]:
%matplotlib inline

In [52]:
# Create a quick bar chart
species_counts.plot(kind = 'bar')


Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1101662fd0>

This plot is kind of cramped and hard to read; we can make it slightly bigger by specifying a new weight and height (in inches).


In [53]:
species_counts.plot(figsize=(9, 6), kind = 'bar')


Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f11013b27f0>

Challenge: Plotting

  1. Create a plot of the average weight in each species.
  2. Create a plot of total males and total females across the entire dataset.

Note: Some of the species have no weight measurements; they are entered as NaN, which stands for "not a number" and refers to missing values.

Multiple Grouping in Pandas

Let's make a stacked bar plot, with the total number of individuals on the Y axis and the stacked variable being sex. The plot should show total count by sex for each survey plot.

First, we'll calculate the total number of individual in each unique pair of plot_id and sex.


In [81]:
plot_sex_count = surveys.groupby(('plot_id', 'sex'))['record_id'].count()
plot_sex_count


Out[81]:
plot_id  sex
1        F       848
         M      1095
2        F       970
         M      1144
3        F       893
         M       840
4        F       872
         M      1030
5        F       527
         M       586
6        F       737
         M       764
7        F       335
         M       307
8        F       840
         M       983
9        F       853
         M      1004
10       F       141
         M       142
11       F       822
         M      1029
12       F      1060
         M      1200
13       F       628
         M       783
14       F       709
         M      1070
15       F       481
         M       409
16       F       222
         M       273
17       F       892
         M      1053
18       F       755
         M       626
19       F       522
         M       581
20       F       650
         M       613
21       F       611
         M       451
22       F       671
         M       673
23       F       165
         M       207
24       F       486
         M       485
Name: record_id, dtype: int64

Next, we use the unstack() method on our grouped data to figure out the total count that each sex contributes to each plot. Here, unstack() simply rearranges the rows and columns into a more convenient format.


In [85]:
my_plot = plot_sex_count.unstack().plot(kind = 'bar', stacked = True)
my_plot.set_title('Total weight by plot and sex')
my_plot.set_ylabel('Weight (grams)')
my_plot.set_xlabel('Plot ID')


Out[85]:
<matplotlib.text.Text at 0x7f1100bb5d68>

Transforming Data

Up to this point, we have learned:

  • The basic data types in Python;
  • How to read a CSV file into Python using pandas;
  • How tabular data in pandas are represented using a DataFrame;
  • How to check the data type of each column in a DataFrame;
  • How to summarize numerical data in a DataFrame;
  • How to group data according to the unique values of a column;
  • How to create basic plots of grouped, summarized data.

Often, once we've pulled in data and done some basic data cleaning, we want to transform our data to make it more useful for our analysis. For instance, we might find that:

  • The numeric values in one or more columns are not in the right units, e.g., we want pounds instead of kilograms.
  • A text column has keys written one way and we want them written another way.
  • We want to combine two or more columns; or, we want to split one column into multiple columns.

Some of these objectives are perhaps best handled in OpenRefine, as we saw. However, we can also transform our data easily enough with Python and Pandas.

Converting Units

In our surveys data, the animal weights are in grams. What if our collaborator says she needs the weights in kilograms?


In [45]:
# Create a copy of our original data
surveys_converted = surveys.copy()

# Create a new column in this table
surveys_converted['weight_kg'] = surveys_converted.weight / 1000
surveys_converted.tail()


Out[45]:
record_id month day year plot_id species_id sex hindfoot_length weight species_factor sex_factor species_id_codes weight_kg
X
35544 35545 12 31 2002 15 AH NaN NaN NaN 1 -1 1 NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN 1 -1 1 NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0 32 0 32 0.014
35547 35548 12 31 2002 7 DO M 36.0 51.0 12 1 12 0.051
35548 35549 12 31 2002 5 NaN NaN NaN NaN -1 -1 -1 NaN

Note that when we call for a column as an attribute, it is read-only.


In [48]:
# Convert grams to pounds
surveys_converted.weight_lbs = surveys_converted.weight * 0.00220462
surveys_converted.tail()


Out[48]:
record_id month day year plot_id species_id sex hindfoot_length weight species_factor sex_factor species_id_codes weight_kg
X
35544 35545 12 31 2002 15 AH NaN NaN NaN 1 -1 1 NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN 1 -1 1 NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0 32 0 32 0.014
35547 35548 12 31 2002 7 DO M 36.0 51.0 12 1 12 0.051
35548 35549 12 31 2002 5 NaN NaN NaN NaN -1 -1 -1 NaN

Transforming Values

We can affect a more general transformation of a column's values. Let's say that, for whatever reason, we want the text strings in species_id to be lower-case, not upper-case.


In [ ]:
surveys.species_id.apply(lambda x: x.lower()).head()

What happened? When Pandas imports a DataFrame and sees categorical data, like species_id here, it often treats the categorical data like numeric data in the background, assigning a numeric key to each unique categorical value. We need to explicitly tell Pandas to treat this value like a string.


In [63]:
surveys.species_id.apply(lambda x: str(x).lower()).head()


Out[63]:
X
0    nl
1    nl
2    dm
3    dm
4    dm
Name: species_id, dtype: object

The use of the apply() function here is part of a general practice called function application. We can also apply a function to our entire DataFrame along one of its two axes. For instance, we can use this to count how many non-null values are in each of the columns.


In [70]:
surveys.weight.count()


Out[70]:
32283

In [73]:
surveys.apply(lambda column: column.count(), axis = 0)


Out[73]:
record_id           35549
month               35549
day                 35549
year                35549
plot_id             35549
species_id          34786
sex                 33038
hindfoot_length     31438
weight              32283
species_factor      35549
sex_factor          35549
species_id_codes    35549
dtype: int64

We can also apply functions across the columns.


In [79]:
# Returns True or False depending on whether weight is null
surveys.apply(lambda row: pd.notnull(row.weight), axis = 1).head()


Out[79]:
X
0    False
1    False
2    False
3    False
4    False
dtype: bool

Indexing and Slicing Python DataFrames

Now, we'll explore how to access different parts of our data, including querying subsets of the data based on certain criteria.

Indexing and Slicing in Python

Selecting Data Using Labels (Column Headings)

We use square brackets, [], to select a subset of a Python object. For example, we can select all of the data from a column named species_id:


In [ ]:
surveys['species_id']
surveys.species_id

We can pass a list of column names, too, as an index to select columns in a specified order.


In [ ]:
surveys[['species_id', 'plot_id']]
surveys[['plot_id', 'species_id']]

Extracting a Range of Data with Slicing

Recall that in Python, we start counting from zero instead of one. This means that the first element in an object is located at position zero.


In [94]:
grades = [88, 72, 93, 94]


In [95]:
grades[2]


Out[95]:
93


In [96]:
grades[1:3]


Out[96]:
[72, 93]

Challenge: Slicing

What do each of these lines of code return?

  1.  grades[0]
  2.  grades[len(grades)]
  3.  grades[4]

Why do (2) and (3) return errors?

Slicing Subsets of Rows in Python

Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, we use the following syntax.


In [97]:
# Select the first three rows; rows 0,1,2 (but NOT 3)
surveys[0:3]


Out[97]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN

Other examples:


In [98]:
# First three rows, starting index optional
surveys[:3]


Out[98]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN

In [100]:
# Select the last element (last row)
surveys[-1:]


Out[100]:
record_id month day year plot_id species_id sex hindfoot_length weight
35548 35549 12 31 2002 5 NaN NaN NaN NaN

We can also reassign values within subset of our DataFrame. Before we do that, let's make a copy of our DataFrame so as not to modify our original data.


In [119]:
surveys_copy = surveys
surveys_copy[0:3] = 0
surveys_copy.head()


Out[119]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 0 0 0 0 0 0 0 0.0 0.0
1 0 0 0 0 0 0 0 0.0 0.0
2 0 0 0 0 0 0 0 0.0 0.0
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

In [120]:
# Confirm that the original is unchanged
surveys.head()


Out[120]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 0 0 0 0 0 0 0 0.0 0.0
1 0 0 0 0 0 0 0 0.0 0.0
2 0 0 0 0 0 0 0 0.0 0.0
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

Oops: Referencing versus Copying Objects in Python

We thought that we were creating a copy of the surveys DataFrame. However, when we assign a variable such as y = x, this doesn't create a copy of x; rather, it creates a new variable y that refers to the same object that x refers to. This means there is only one object (the DataFrame), and both x and y refer to it. Thus, when we modify surveys_copy, the object that is modified is the same object surveys points to.


In [122]:
surveys = pd.read_csv('ecology-surveys.csv')
surveys_copy = surveys.copy()
surveys_copy[0:3] = 0
surveys.head()


Out[122]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

Slicing Subsets of Rows and Columns with Pandas

We can select specific ranges of our data in both the row and column directions using either label-based or integer-based indexing.

To select a subset of rows and columns, we can use the iloc() method (for "index location"). For example, we can select the month, day, and year columns (columns, 2, 3, 4) in the first three rows, like this:


In [124]:
surveys.iloc[0:3,1:4]


Out[124]:
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977

Note that the order of terms in the square brackets is rows first, then columns.


In [125]:
# Get the same columns but all the rows
surveys.iloc[:,1:4].head()


Out[125]:
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977
3 7 16 1977
4 7 16 1977

Here, the colon character in place of the row index indicates we want "all" elements (rows).

We can select multiple, discontiguous rows by passing a list of the row indices.


In [129]:
# Select all columns for rows of index values 0, 10
surveys.iloc[[0, 10], :]


Out[129]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
10 11 7 16 1977 5 DS F 53.0 NaN

If we want to select columns by column names, we need to use the loc() method in place of iloc().


In [132]:
surveys.loc[0, ['species_id', 'plot_id', 'weight']]


Out[132]:
species_id     NL
plot_id         2
weight        NaN
Name: 0, dtype: object

When we use loc(), we can also index using integers, but with one essential difference: loc() indexes on row or column labels, not the position along the table's rows or columns.

For example, compare the outputs:


In [143]:
surveys.loc[[0, 10, 50000], :]


Out[143]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1.0 7.0 16.0 1977.0 2.0 NL M 32.0 NaN
10 11.0 7.0 16.0 1977.0 5.0 DS F 53.0 NaN
50000 NaN NaN NaN NaN NaN NaN NaN NaN NaN

In [144]:
surveys.iloc[[0, 10, 50000], :]


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-144-46301c3f69c4> in <module>()
----> 1 surveys.iloc[[0, 10, 50000], :]

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
   1308 
   1309         if type(key) is tuple:
-> 1310             return self._getitem_tuple(key)
   1311         else:
   1312             return self._getitem_axis(key, axis=0)

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1558     def _getitem_tuple(self, tup):
   1559 
-> 1560         self._has_valid_tuple(tup)
   1561         try:
   1562             return self._getitem_lowerdim(tup)

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    149             if i >= self.obj.ndim:
    150                 raise IndexingError('Too many indexers')
--> 151             if not self._has_valid_type(k, i):
    152                 raise ValueError("Location based indexing can only have [%s] "
    153                                  "types" % self._valid_types)

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1528             return self._is_valid_integer(key, axis)
   1529         elif is_list_like_indexer(key):
-> 1530             return self._is_valid_list_like(key, axis)
   1531         return False
   1532 

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py in _is_valid_list_like(self, key, axis)
   1552         l = len(ax)
   1553         if len(arr) and (arr.max() >= l or arr.min() < -l):
-> 1554             raise IndexError("positional indexers are out-of-bounds")
   1555 
   1556         return True

IndexError: positional indexers are out-of-bounds

Challenge: Slicing Rows and Columns

What happens when you type:

  1.  surveys[0:3]
  2.  surveys[:5]
  3.  surveys[-1:]

To review...

To index by rows in Pandas:


In [168]:
surveys[0:3]
surveys.iloc[0:3]
surveys.iloc[0:3,:]


Out[168]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN

To index by columns (and rows) in Pandas


In [174]:
surveys[['month', 'day', 'year']]
surveys.loc[0:3, ['month', 'day', 'year']]
surveys.iloc[0:3, 1:4]


Out[174]:
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977

Subsetting Data Using Criteria

Most importantly, we can also select a subset of our data based on certain criteria. For example, we can select all of the rows of our data that match observations in the year 2002.


In [176]:
surveys[surveys.year == 2002].head()


Out[176]:
record_id month day year plot_id species_id sex hindfoot_length weight
33320 33321 1 12 2002 1 DM M 38.0 44.0
33321 33322 1 12 2002 1 DO M 37.0 58.0
33322 33323 1 12 2002 1 PB M 28.0 45.0
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35.0 29.0

Or, we can select all rows that do not match observations from 2002.


In [177]:
surveys[surveys.year != 2002].head()


Out[177]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

We can combine criteria using logical operators.


In [182]:
surveys[(surveys.year >= 1980) & (surveys.year <= 1985)].count()


Out[182]:
record_id          8957
month              8957
day                8957
year               8957
plot_id            8957
species_id         8775
sex                8366
hindfoot_length    7348
weight             8278
dtype: int64

Challenge: Filtering Data

  1. Filter the surveys table to observations of female members of the DO species. How many are there? What is their average weight?
  2. Look at the help documentation for the isin() function (Hint: ?surveys.year.isin). Use this function to filter the surveys DataFrame to those rows that match the three species: OL, OT, OX.

Using Masks

Earlier, we saw that our data contain some missing values, filled in with NaN or "not a number." We'll next learn how to handle NaNs when they appear in our data using masks. A mask can be useful to locate where a particular subset values exist (or don't exist).

To start, we can use the Pandas function isnull() to find places in our surveys DataFrame where there are null or NaN values.


In [190]:
pd.isnull(surveys)


Out[190]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 False False False False False False False False True
1 False False False False False False False False True
2 False False False False False False False False True
3 False False False False False False False False True
4 False False False False False False False False True
5 False False False False False False False False True
6 False False False False False False False True True
7 False False False False False False False False True
8 False False False False False False False False True
9 False False False False False False False False True
10 False False False False False False False False True
11 False False False False False False False False True
12 False False False False False False False False True
13 False False False False False False True True True
14 False False False False False False False False True
15 False False False False False False False False True
16 False False False False False False False False True
17 False False False False False False False False True
18 False False False False False False True True True
19 False False False False False False False False True
20 False False False False False False False False True
21 False False False False False False False False True
22 False False False False False False False False True
23 False False False False False False False False True
24 False False False False False False False False True
25 False False False False False False False False True
26 False False False False False False False False True
27 False False False False False False False False True
28 False False False False False False False True True
29 False False False False False False False False True
... ... ... ... ... ... ... ... ... ...
35519 False False False False False False True False False
35520 False False False False False False False False False
35521 False False False False False False False False False
35522 False False False False False False False False False
35523 False False False False False False False False False
35524 False False False False False False False False False
35525 False False False False False False False False False
35526 False False False False False False False False False
35527 False False False False False False True True True
35528 False False False False False False False False False
35529 False False False False False False False False True
35530 False False False False False False False False True
35531 False False False False False False False False False
35532 False False False False False False False False False
35533 False False False False False False False False False
35534 False False False False False False False False False
35535 False False False False False False False False False
35536 False False False False False False False False False
35537 False False False False False False False False False
35538 False False False False False False False False False
35539 False False False False False False False False False
35540 False False False False False False False False False
35541 False False False False False False False False False
35542 False False False False False False False False False
35543 False False False False False False True True True
35544 False False False False False False True True True
35545 False False False False False False True True True
35546 False False False False False False False False False
35547 False False False False False False False False False
35548 False False False False False True True True True

35549 rows × 9 columns

We can use the any() function to ask if there are any null values in a given row. Here, the axis argument specifies the axis along which to look for true values (values that are null); 1 indicates a look along the rows.


In [202]:
pd.isnull(surveys).any(axis=1).head()


Out[202]:
0    True
1    True
2    True
3    True
4    True
dtype: bool

In [201]:
pd.isnull(surveys).any(axis=0)


Out[201]:
record_id          False
month              False
day                False
year               False
plot_id            False
species_id          True
sex                 True
hindfoot_length     True
weight              True
dtype: bool

To select the rows where there are null values, we can use the mask as an index to subset our data, as follows.


In [205]:
surveys[pd.isnull(surveys).any(axis=1)].head()


Out[205]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

How can we select rows where there are NO null values? We can invert the mask using the tilde operator, ~.


In [206]:
surveys[~pd.isnull(surveys).any(axis=1)].head()


Out[206]:
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0

Dealing with Missing Data

It is sometimes hard to know why data values are missing. Was it because of a data entry error? Was there some condition in the field that prevented observation of a particular variable?

Moreover, we need a way of representing missing data that is clear and unambiguous. If we encode missing values as zeroes, for instance, are we sure that zero won't be interpreted as a real value?

For text data, an empty string usually suffices for blank values. However, numeric data fields can't hold an empty string; they have to either hold some number or a special value called NaN, which stands for "not a number."


In [207]:
len(surveys[pd.isnull(surveys.weight)])


Out[207]:
3266

In [210]:
# Number of null values, in this case, is the same as number of non-zero values
len(surveys.weight) - len(surveys[surveys.weight > 0])


Out[210]:
3266

We can replace all the NaN values in our weight column using the fillna() method. Here, we use data management best practices, copying our original data so that we don't modify it.


In [213]:
df = surveys.copy()
df.weight = df.weight.fillna(0)

However, NaN and zero will yield different results in analysis.


In [214]:
df.weight.mean()


Out[214]:
38.751976145601844

In [215]:
surveys.weight.mean()


Out[215]:
42.672428212991356

So, in general, it's better to leave NaN values as NaN and to subset the data to remove them, rather than re-coding them.

Indexing by Date and Time


In [84]:
#TODO

Combining Multiple Datasets

At this point, we have learned:

  • About zero-based indexing in Python;
  • How to manipulate and extract data using column headings and index locations;
  • How to employ slicing to select subsets of a DataFrame;
  • How to re-assign values within a DataFrame;
  • How to create a copy of a DataFrame;
  • How to query subsets of a DataFrame that match certain criteria using logical operators;
  • How to interpret NaN values and best practices for missing data;

In many situations, the data we want to use come in multiple files. We often need to combine these files into a single DataFrame in order to analyze the data.

Let's load a second table into our session.


In [216]:
surveys = pd.read_csv('ecology-surveys.csv')
species = pd.read_csv('ecology-species.csv')

Concatenating DataFrames

We can use the concat function in Pandas to append either columns or rows from one DataFrame to another. Let's create subsets of our data to see how this works...


In [220]:
# First 10 rows of surveys
surveys_first10 = surveys.head(10)

# Last 10 rows of surveys
surveys_last10 = surveys.tail(10)
surveys_last10


Out[220]:
record_id month day year plot_id species_id sex hindfoot_length weight
35539 35540 12 31 2002 15 PB F 26.0 23.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35543 35544 12 31 2002 15 US NaN NaN NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

In [221]:
# Reset the index so that we can concatenate properly
surveys_last10 = surveys_last10.reset_index(drop = True)
surveys_last10


Out[221]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 35540 12 31 2002 15 PB F 26.0 23.0
1 35541 12 31 2002 15 PB F 24.0 31.0
2 35542 12 31 2002 15 PB F 26.0 29.0
3 35543 12 31 2002 15 PB F 27.0 34.0
4 35544 12 31 2002 15 US NaN NaN NaN
5 35545 12 31 2002 15 AH NaN NaN NaN
6 35546 12 31 2002 15 AH NaN NaN NaN
7 35547 12 31 2002 10 RM F 15.0 14.0
8 35548 12 31 2002 7 DO M 36.0 51.0
9 35549 12 31 2002 5 NaN NaN NaN NaN

When we concatenate DataFrames, we need to specify the axis. With axis=0, we are stacking DataFrames on top of one another (row-wise). Pandas will automatically detect that the column names are the same and stack them accordingly.


In [228]:
vertical_stack = pd.concat([surveys_first10, surveys_last10], axis = 0)
vertical_stack


Out[228]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN
6 7 7 16 1977 2 PE F NaN NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
9 10 7 16 1977 6 PF F 20.0 NaN
0 35540 12 31 2002 15 PB F 26.0 23.0
1 35541 12 31 2002 15 PB F 24.0 31.0
2 35542 12 31 2002 15 PB F 26.0 29.0
3 35543 12 31 2002 15 PB F 27.0 34.0
4 35544 12 31 2002 15 US NaN NaN NaN
5 35545 12 31 2002 15 AH NaN NaN NaN
6 35546 12 31 2002 15 AH NaN NaN NaN
7 35547 12 31 2002 10 RM F 15.0 14.0
8 35548 12 31 2002 7 DO M 36.0 51.0
9 35549 12 31 2002 5 NaN NaN NaN NaN

In [227]:
# In order to avoid duplication of index values
vertical_stack.reset_index()


Out[227]:
index record_id month day year plot_id species_id sex hindfoot_length weight
0 0 1 7 16 1977 2 NL M 32.0 NaN
1 1 2 7 16 1977 3 NL M 33.0 NaN
2 2 3 7 16 1977 2 DM F 37.0 NaN
3 3 4 7 16 1977 7 DM M 36.0 NaN
4 4 5 7 16 1977 3 DM M 35.0 NaN
5 5 6 7 16 1977 1 PF M 14.0 NaN
6 6 7 7 16 1977 2 PE F NaN NaN
7 7 8 7 16 1977 1 DM M 37.0 NaN
8 8 9 7 16 1977 1 DM F 34.0 NaN
9 9 10 7 16 1977 6 PF F 20.0 NaN
10 0 35540 12 31 2002 15 PB F 26.0 23.0
11 1 35541 12 31 2002 15 PB F 24.0 31.0
12 2 35542 12 31 2002 15 PB F 26.0 29.0
13 3 35543 12 31 2002 15 PB F 27.0 34.0
14 4 35544 12 31 2002 15 US NaN NaN NaN
15 5 35545 12 31 2002 15 AH NaN NaN NaN
16 6 35546 12 31 2002 15 AH NaN NaN NaN
17 7 35547 12 31 2002 10 RM F 15.0 14.0
18 8 35548 12 31 2002 7 DO M 36.0 51.0
19 9 35549 12 31 2002 5 NaN NaN NaN NaN

Writing Data to a CSV File

We can use the to_csv() method to export a DataFrame to a CSV file.


In [229]:
vertical_stack.to_csv('out.csv')

Joining DataFrames

Note: Use this webpage as a graphical reference for this segment.

When we concatenated our DataFrames, we simply added them together, stacking them either vertically or horizontally. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common identifier). Combining DataFrames in this way is referred to as a join.

Joining data frames in this way is especially useful when one DataFrame serves as a lookup table, containing additional data that we want to include in the other. For example, the species table we imported is a lookup table. It contains additional information about each species_id.

To better understand how this works, let's create a subset of the surveys table.


In [242]:
surveys_sub = surveys.tail(20)

Identifying Join Keys

When we join two tables together, they must share a common key. Usually, it is a field that uniquely identifies the records in one of the two tables. In this example, it is the species_id field that the tables share.


In [243]:
surveys_sub.columns


Out[243]:
Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'],
      dtype='object')

In [232]:
species.columns


Out[232]:
Index(['species_id', 'genus', 'species', 'taxa'], dtype='object')

Inner Joins

The most common type of join is called an inner join. An inner join combines two DataFrames based on a shared key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

The Pandas function for performing joins is called merge(); by default, it performs an inner join.


In [244]:
merged = pd.merge(left = surveys_sub, right = species, left_on = 'species_id', right_on = 'species_id')
merged.shape


Out[244]:
(19, 12)

In [245]:
merged


Out[245]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 35530 12 31 2002 13 OT F 20.0 NaN Onychomys torridus Rodent
1 35531 12 31 2002 13 PB F 27.0 NaN Chaetodipus baileyi Rodent
2 35538 12 31 2002 15 PB F 26.0 31.0 Chaetodipus baileyi Rodent
3 35540 12 31 2002 15 PB F 26.0 23.0 Chaetodipus baileyi Rodent
4 35541 12 31 2002 15 PB F 24.0 31.0 Chaetodipus baileyi Rodent
5 35542 12 31 2002 15 PB F 26.0 29.0 Chaetodipus baileyi Rodent
6 35543 12 31 2002 15 PB F 27.0 34.0 Chaetodipus baileyi Rodent
7 35532 12 31 2002 14 DM F 34.0 43.0 Dipodomys merriami Rodent
8 35533 12 31 2002 14 DM F 36.0 48.0 Dipodomys merriami Rodent
9 35534 12 31 2002 14 DM M 37.0 56.0 Dipodomys merriami Rodent
10 35535 12 31 2002 14 DM M 37.0 53.0 Dipodomys merriami Rodent
11 35536 12 31 2002 14 DM F 35.0 42.0 Dipodomys merriami Rodent
12 35537 12 31 2002 14 DM F 36.0 46.0 Dipodomys merriami Rodent
13 35539 12 31 2002 15 SF M 26.0 68.0 Sigmodon fulviventer Rodent
14 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird
15 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
16 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
17 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
18 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent

The result of an inner join of surveys_sub and species is a new DataFrame that contains the combined set of columns from surveys_sub and species.

Note that the result of this merge is only 19 rows, whereas surveys_sub started with 20 rows.

Left Joins

If there are rows in surveys_sub with a species_id that is not found in the species lookup table, then they are not included in the merged output. What if we want to add information from species to surveys_sub without losing any of the information from surveys? A left join returns all of the rows from the left DataFrame while joining only whatever rows in the right data frame match.


In [246]:
# Because there is only one shared column, we can neglect giving any column mames
merged_left = pd.merge(left = surveys_sub, right = species, how = 'left')
merged_left.shape


Out[246]:
(20, 12)

In [248]:
merged_left


Out[248]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 35530 12 31 2002 13 OT F 20.0 NaN Onychomys torridus Rodent
1 35531 12 31 2002 13 PB F 27.0 NaN Chaetodipus baileyi Rodent
2 35532 12 31 2002 14 DM F 34.0 43.0 Dipodomys merriami Rodent
3 35533 12 31 2002 14 DM F 36.0 48.0 Dipodomys merriami Rodent
4 35534 12 31 2002 14 DM M 37.0 56.0 Dipodomys merriami Rodent
5 35535 12 31 2002 14 DM M 37.0 53.0 Dipodomys merriami Rodent
6 35536 12 31 2002 14 DM F 35.0 42.0 Dipodomys merriami Rodent
7 35537 12 31 2002 14 DM F 36.0 46.0 Dipodomys merriami Rodent
8 35538 12 31 2002 15 PB F 26.0 31.0 Chaetodipus baileyi Rodent
9 35539 12 31 2002 15 SF M 26.0 68.0 Sigmodon fulviventer Rodent
10 35540 12 31 2002 15 PB F 26.0 23.0 Chaetodipus baileyi Rodent
11 35541 12 31 2002 15 PB F 24.0 31.0 Chaetodipus baileyi Rodent
12 35542 12 31 2002 15 PB F 26.0 29.0 Chaetodipus baileyi Rodent
13 35543 12 31 2002 15 PB F 27.0 34.0 Chaetodipus baileyi Rodent
14 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird
15 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
16 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
17 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
18 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent
19 35549 12 31 2002 5 NaN NaN NaN NaN NaN NaN NaN

Other Joins

  • A right (outer) join is similar to a left join except that it takes all rows from the right data frame and only matching rows from the left.
  • A full (outer) join returns all pairwise combinations of the two matching columns. This type is rarely used.

Challenges: Joins

Create a new data frame by joining the contents of surveys and species. Then, calculate and plot the distribution of taxa by plot_id.

Automating Data Workflows with Python

So far, we've used Python and the Pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The advantage of using a programming environment like Python, though, comes from its ability to automate data processing through the use of loops and functions.

Let's remind ourselves of how loops work.


In [34]:
animals = ['LIONS', 'TIGERS', 'BEARS']

for creature in animals:
    print(creature.lower())


lions
tigers
bears

A powerful Python idiom that we can use here is the list comprehension:


In [35]:
[creature.lower() for creature in animals]


Out[35]:
['lions', 'tigers', 'bears']

Similarly, a dictionary comprehension is another quick way to transform data in Python.


In [36]:
dict((creature, creature.lower()) for creature in animals)


Out[36]:
{'BEARS': 'bears', 'LIONS': 'lions', 'TIGERS': 'tigers'}

Automating Data Processing with For Loops

As one example of how we can automate data processing with for loops, let's imagine that we want to split our large surveys table into separate files, one for each survey year. We'll start by making a new directory to hold the output files for each year.

We can use Python's os library to manipulate files and folders on our computer's file system.


In [252]:
import os
os.mkdir('yearly_files')

The command os.mkdir is equivalent to the mkdir command in the Unix shell, which you may have seen if you use a GNU/Linux or Mac computer. To confirm that we just created the yearly_files folder, we can ask Python to list the contents of our current directory.


In [253]:
os.listdir('.')


Out[253]:
['setup.sh',
 'yearly_files',
 '.ipynb_checkpoints',
 'slicing-indexing.svg',
 'Python-Intro-Ecology.ipynb',
 'ecology-surveys.csv',
 'check_env.py',
 'slicing-slicing.svg',
 'virtualenv.sh',
 'ecology-species.csv']

In previous lessons, we saw how to use the Pandas library to load a table into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame to a CSV file.

Let's now write a script that performs those three steps for just the year 2002.


In [254]:
import pandas as pd

# Load the data into a DataFrame
surveys = pd.read_csv('ecology-surveys.csv')

# Select only data for 2002
surveys2002 = surveys[surveys.year == 2002]

# Write the new DataFrame to a CSV file
surveys2002.to_csv('yearly_files/surveys_2002.csv')

One way we could create a file for every year would be to change this Python script for each year. This would be tedious, however, and tedium is what we're trying to avoid by using Python! Moreover, if we are changing our code frequently, we're likely to introduce an error at some point.

What we really want to do is to rewrite this script so that it is more general and can create all the files we need in one run.

Let's start by writing a loop that simply prints the names of the files we want to create; files denoted by each year.


In [255]:
# We can get the years in our data...
surveys.year


Out[255]:
0        1977
1        1977
2        1977
3        1977
4        1977
5        1977
6        1977
7        1977
8        1977
9        1977
10       1977
11       1977
12       1977
13       1977
14       1977
15       1977
16       1977
17       1977
18       1977
19       1977
20       1977
21       1977
22       1977
23       1977
24       1977
25       1977
26       1977
27       1977
28       1977
29       1977
         ... 
35519    2002
35520    2002
35521    2002
35522    2002
35523    2002
35524    2002
35525    2002
35526    2002
35527    2002
35528    2002
35529    2002
35530    2002
35531    2002
35532    2002
35533    2002
35534    2002
35535    2002
35536    2002
35537    2002
35538    2002
35539    2002
35540    2002
35541    2002
35542    2002
35543    2002
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, dtype: int64

But we want only the unique years... How do we do that again?


In [257]:
surveys.year.unique()


Out[257]:
array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002])

Putting this into a loop, we can generate filenames for each year.


In [259]:
for year in surveys.year.unique():
    filename = 'yearly_files/surveys_%s.csv' % str(year)
    print(filename)


yearly_files/surveys_1977.csv
yearly_files/surveys_1978.csv
yearly_files/surveys_1979.csv
yearly_files/surveys_1980.csv
yearly_files/surveys_1981.csv
yearly_files/surveys_1982.csv
yearly_files/surveys_1983.csv
yearly_files/surveys_1984.csv
yearly_files/surveys_1985.csv
yearly_files/surveys_1986.csv
yearly_files/surveys_1987.csv
yearly_files/surveys_1988.csv
yearly_files/surveys_1989.csv
yearly_files/surveys_1990.csv
yearly_files/surveys_1991.csv
yearly_files/surveys_1992.csv
yearly_files/surveys_1993.csv
yearly_files/surveys_1994.csv
yearly_files/surveys_1995.csv
yearly_files/surveys_1996.csv
yearly_files/surveys_1997.csv
yearly_files/surveys_1998.csv
yearly_files/surveys_1999.csv
yearly_files/surveys_2000.csv
yearly_files/surveys_2001.csv
yearly_files/surveys_2002.csv

A few things to note about this code:

  • We are looping over the years in the year column of our DataFrame. Because year is an integer, we need to convert it to a string before we can make it part of a filename. This is what the built-in str() function does.
  • Python allows us to embed text strings in other text strings through string formatting. Here, we use a special formatting character, %s, to denote where in the text string we want to embed some other text. The % operator is then followed by the string we want to embed.

Now, we have everything we need to create these separate files by year.


In [261]:
surveys = pd.read_csv('ecology-surveys.csv')

for year in surveys.year.unique():
    # Select data for the given year
    surveys_in_year = surveys[surveys.year == year]
    
    # Write the subdataset to a new file
    filename = 'yearly_files/surveys_%s.csv' % str(year)
    surveys_in_year.to_csv(filename)
    
os.listdir('yearly_files')


Out[261]:
['surveys_1978.csv',
 'surveys_1992.csv',
 'surveys_1982.csv',
 'surveys_1998.csv',
 'surveys_1994.csv',
 'surveys_2002.csv',
 'surveys_1980.csv',
 'surveys_1987.csv',
 'surveys_1986.csv',
 'surveys_1997.csv',
 'surveys_1984.csv',
 'surveys_1990.csv',
 'surveys_1996.csv',
 'surveys_1979.csv',
 'surveys_1983.csv',
 'surveys_1985.csv',
 'surveys_2001.csv',
 'surveys2002.csv',
 'surveys_2000.csv',
 'surveys_1993.csv',
 'surveys_1989.csv',
 'surveys_1991.csv',
 'surveys_1981.csv',
 'surveys_1988.csv',
 'surveys_1977.csv',
 'surveys_1999.csv',
 'surveys_1995.csv']

Challenge: Automation with For Loops

  1. Some of the surveys we saved are missing data; they have NaN values in one or more columns. Modify our for loop so that the entries with null values are not included in the yearly files.
  2. What happens if there are no data for a year in the sequence? You can generate a list of years for the for loop to use with, e.g., range(1970, 1980).
  3. Let's say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977?

Building Reusable Code with Functions

Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for loop like the one above every time we needed to do it but that would be time-consuming and error-prone. A more elegant solution would be to create a resuable tool that performs this task with minimum input from the user.

To achieve this, we'll encapsulate the code we've written so far into a function. As we've seen, functions are reusable, self-contained pieces of code that are called with a single command. The first part of writing a function is envisioning how it will be used; how it will be called by the user.


In [262]:
def one_year_to_csv(all_data, the_year):
    '''
    Writes a CSV files for data from a given year.
    Arguments:
        all_data    The DataFrame will multi-year data
        the_year    The year to write, an integer
    '''
    pass

Some things to note about this function:

  • Our function takes two arguments, the_year and all_data.
  • We've included a docstring in our function. The docstring is any string that comes on the line immediately after the function definition. It can be a single-line string but it is most commonly a multi-line string, indicated by three quote symbols in order.
  • It doesn't do anything yet. We wanted to set up the function first so that we have an end goal in mind; functions have to have a body of code so we used the keyword pass here, which is a special Python command that says, "do nothing."

In [263]:
?one_year_to_csv

Now, let's populate the body of this function with the code we wrote before, making the necessary changes.


In [279]:
def one_year_to_csv(all_data, the_year):
    '''
    Writes a CSV files for data from a given year.
    Arguments:
        all_data    The DataFrame will multi-year data
        the_year    The year to write, an integer
    '''
    # Select data for the given year
    data_in_year = all_data[all_data.year == the_year]
    
    # Write the subdataset to a new file
    filename = 'yearly_files/surveys_%s.csv' % str(the_year)
    data_in_year.to_csv(filename)

In [280]:
one_year_to_csv(surveys, 2002)
os.listdir('yearly_files')


Out[280]:
['surveys_2002.csv']

This is a great start, but what we really want to do is create files for multiple years without having to request each year one at a time. Let's write another function that replicates the for loop we wrote earlier.


In [286]:
def multiple_years_to_csv(all_data, start_year, end_year):
    '''
    Writes separate CSV files for each year of data.
    Arguments:
        all_data      The DataFrame with multi-year data
        start_year    The first year of data we want
        end_year      The last year of data we want
    '''
    # Because range() generates a sequence up to but *not* including
    #     the last number, we add 1
    for year in range(start_year, end_year + 1):
        one_year_to_csv(all_data, year)

In [287]:
multiple_years_to_csv(surveys, 1977, 2002)
os.listdir('yearly_files')


Out[287]:
['surveys_1978.csv',
 'surveys_1992.csv',
 'surveys_1982.csv',
 'surveys_1998.csv',
 'surveys_1994.csv',
 'surveys_2002.csv',
 'surveys_1980.csv',
 'surveys_1987.csv',
 'surveys_1986.csv',
 'surveys_1997.csv',
 'surveys_1984.csv',
 'surveys_1990.csv',
 'surveys_1996.csv',
 'surveys_1979.csv',
 'surveys_1983.csv',
 'surveys_1985.csv',
 'surveys_2001.csv',
 'surveys_2000.csv',
 'surveys_1993.csv',
 'surveys_1989.csv',
 'surveys_1991.csv',
 'surveys_1981.csv',
 'surveys_1988.csv',
 'surveys_1977.csv',
 'surveys_1999.csv',
 'surveys_1995.csv']

We can add optional arguments to help make this function even easier to use.


In [288]:
def multiple_years_to_csv(all_data, start_year=1997, end_year=2002):
    '''
    Writes separate CSV files for each year of data.
    Arguments:
        all_data      The DataFrame with multi-year data
        start_year    The first year of data we want
        end_year      The last year of data we want
    '''
    # Because range() generates a sequence up to but *not* including
    #     the last number, we add 1
    for year in range(start_year, end_year + 1):
        one_year_to_csv(all_data, year)

But what if the years 1997 and 2002 are not in the data?


In [289]:
def multiple_years_to_csv(all_data, start_year=None, end_year=None):
    '''
    Writes separate CSV files for each year of data.
    Arguments:
        all_data      The DataFrame with multi-year data
        start_year    The first year of data we want
        end_year      The last year of data we want
    '''
    if start_year is None:
        start_year = min(all_data.year)
        
    if end_year is None:
        end_year = max(all_data.year)
    
    # Because range() generates a sequence up to bubt *not* including
    #     the last number, we add 1
    for year in range(start_year, end_year + 1):
        one_year_to_csv(all_data, year)

In [291]:
multiple_years_to_csv(surveys)
os.listdir('yearly_files/')


Out[291]:
['surveys_1978.csv',
 'surveys_1992.csv',
 'surveys_1982.csv',
 'surveys_1998.csv',
 'surveys_1994.csv',
 'surveys_2002.csv',
 'surveys_1980.csv',
 'surveys_1987.csv',
 'surveys_1986.csv',
 'surveys_1997.csv',
 'surveys_1984.csv',
 'surveys_1990.csv',
 'surveys_1996.csv',
 'surveys_1979.csv',
 'surveys_1983.csv',
 'surveys_1985.csv',
 'surveys_2001.csv',
 'surveys_2000.csv',
 'surveys_1993.csv',
 'surveys_1989.csv',
 'surveys_1991.csv',
 'surveys_1981.csv',
 'surveys_1988.csv',
 'surveys_1977.csv',
 'surveys_1999.csv',
 'surveys_1995.csv']

In the last example, we saw how we can build functions that call other functions we've built in the past. In this way, we can develop sophisticated data processing workflows piece-by-piece, based on code we already know works.

Challenge: Writing Reusable Functions

  1. What type of object corresponds to a variable declared as None? (Hint: Create a variable set to None and use the function type()).
  2. What happens if you only call multiple_years_to_csv() with all_data and an end_year (that is, without providing a start_year)? Can you write the function call with only a value for end_year?

Answering Questions with Data

There are many questions we ask about tabular data that we can answer with Pandas.

Validating Assumptions about Data

How many animals of each species were caught in each plot? We can create a cross-tabulation of the species_id and plot_id fields as follows.


In [83]:
pd.crosstab(surveys.species_id, surveys.plot_id)


Out[83]:
plot_id 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23 24
species_id
AB 7 14 10 3 2 17 23 6 9 6 ... 12 9 3 9 11 37 8 12 26 16
AH 7 7 2 2 3 6 26 12 16 1 ... 90 10 34 15 9 35 19 4 23 17
AS 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
BA 1 1 19 0 4 0 0 0 0 0 ... 0 0 0 2 15 1 3 0 0 0
CB 0 0 1 1 1 1 4 2 1 1 ... 4 3 1 0 0 3 1 2 3 0
CM 0 1 3 0 0 0 0 1 1 2 ... 1 0 0 1 0 0 2 0 1 0
CQ 0 1 1 0 2 0 0 0 0 0 ... 0 2 0 0 1 4 0 1 4 0
CS 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CT 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CU 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CV 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
DM 607 578 107 1075 443 222 14 569 977 9 ... 68 82 869 315 29 258 41 522 26 277
DO 488 313 83 62 98 224 7 488 205 3 ... 1 31 261 42 1 54 0 5 4 65
DS 234 137 18 277 70 107 3 212 286 4 ... 5 5 121 127 1 118 2 302 5 17
DX 1 4 0 3 1 0 0 3 6 0 ... 1 1 4 0 0 0 1 1 0 0
NL 54 201 97 6 28 17 12 8 23 2 ... 48 4 106 62 3 134 18 18 2 99
OL 51 68 63 57 75 46 0 55 70 14 ... 19 11 43 41 9 16 17 21 1 11
OT 102 194 154 76 109 150 13 97 86 8 ... 83 38 169 83 67 76 73 81 21 39
OX 0 2 1 0 1 0 0 0 0 0 ... 0 0 1 0 2 0 2 0 0 0
PB 112 171 337 97 0 245 22 104 21 1 ... 84 30 67 252 263 132 363 256 3 8
PC 1 0 2 1 0 1 1 1 2 3 ... 5 0 0 0 1 4 0 0 3 0
PE 13 146 66 1 34 69 148 11 19 7 ... 68 43 39 30 75 74 5 23 55 188
PF 31 21 129 136 40 110 1 56 37 0 ... 142 4 66 150 141 114 166 56 6 4
PG 0 0 0 0 0 1 0 0 0 0 ... 0 1 0 1 0 0 2 0 0 0
PH 3 0 2 0 1 0 0 0 0 0 ... 0 0 0 1 0 0 9 9 0 0
PI 0 0 4 1 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
PL 0 0 2 0 1 0 5 0 2 7 ... 3 2 0 0 1 0 0 0 4 8
PM 15 36 40 4 50 27 71 7 11 54 ... 53 58 22 61 113 32 33 4 56 73
PP 220 187 373 115 59 203 113 217 102 2 ... 201 39 111 107 182 137 123 43 5 23
PU 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 3 0 0 1 1
PX 0 0 1 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
RF 0 1 5 0 5 8 2 0 0 12 ... 1 1 0 2 2 10 2 0 3 5
RM 23 71 208 19 105 81 259 13 27 166 ... 117 166 82 120 197 116 202 17 189 160
RO 1 1 0 0 1 0 0 0 0 2 ... 0 0 0 0 1 0 0 0 2 0
RX 0 0 0 0 1 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
SA 2 3 5 0 1 0 0 1 1 10 ... 3 3 6 1 8 0 1 2 8 2
SC 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
SF 0 0 3 0 0 0 0 0 1 0 ... 2 0 0 3 7 0 7 1 1 0
SH 0 15 33 2 3 9 0 0 2 2 ... 5 0 15 6 8 1 23 0 0 4
SO 0 8 22 0 0 1 0 0 0 0 ... 0 0 0 1 0 6 0 0 0 3
SS 12 7 17 22 12 15 1 15 23 7 ... 6 6 16 5 2 6 11 5 0 5
ST 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
SU 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 2 1 1 0 0
UL 0 1 0 0 0 0 0 1 0 0 ... 0 0 1 0 0 0 0 0 0 0
UP 3 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 2 0
UR 1 1 0 0 0 1 1 1 0 1 ... 0 0 0 1 0 0 0 0 0 0
US 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
ZL 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 1 0

48 rows × 24 columns

Sorting on Values

We might ask what the heaviest animals in our dataset are.


In [39]:
surveys.sort_values(by = 'weight', ascending = False).head(10)


Out[39]:
record_id month day year plot_id species_id sex hindfoot_length weight species_factor sex_factor species_id_codes
X
33048 33049 11 17 2001 12 NL M 33.0 280.0 15 1 15
12870 12871 5 28 1987 2 NL M 32.0 278.0 15 1 15
15458 15459 1 11 1989 9 NL M 36.0 275.0 15 1 15
2132 2133 10 25 1979 2 NL F 33.0 274.0 15 0 15
12728 12729 4 26 1987 2 NL M 32.0 270.0 15 1 15
13113 13114 7 26 1987 2 NL M NaN 269.0 15 1 15
30174 30175 1 8 2000 2 NL M 34.0 265.0 15 1 15
4961 4962 11 22 1981 12 NL F NaN 264.0 15 0 15
13024 13025 7 1 1987 2 NL M 33.0 260.0 15 1 15
12601 12602 4 6 1987 2 NL M 34.0 260.0 15 1 15

We might also ask for the last observation in our data. We could assume that the data in the table are in chronological order and simply look at the last row; however, there are many reasons why this may not be the case. We want to sort the data with date descending. We don't have a single date field but we can sort by multiple fields to achieve the same effect.


In [40]:
surveys.sort_values(by = ['year', 'month', 'day'], ascending = False).head()


Out[40]:
record_id month day year plot_id species_id sex hindfoot_length weight species_factor sex_factor species_id_codes
X
35489 35490 12 31 2002 3 PE F 19.0 19.0 21 0 21
35490 35491 12 31 2002 3 PB M 26.0 46.0 19 1 19
35491 35492 12 31 2002 3 PB F 26.0 26.0 19 0 19
35492 35493 12 31 2002 3 PB F 26.0 34.0 19 0 19
35493 35494 12 31 2002 3 PB M 28.0 43.0 19 1 19