Overview

This lesson introduces Python as an environment for data analysis and visualization. The materials are based on the Data Carpentry Python for Ecologists lesson. However, the lesson focuses on general analysis and visualization of tabular data and is not specific to ecologists or ecological data. As Data Carpentry explains:

Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.

At the end of this lesson, you will be able to:

  • Read and write basic Python code;
  • Import and export tabular data with Python;
  • Subset and filter tabular data;
  • Understand different data types and data formats;
  • Understand pandas Data Frames and how they help organize tabular data;
  • Devise and intepret data processing workflows;
  • Automate your data cleaning and analysis with Python;
  • Visualize your data using matplotlib and pandas;
  • Connect to a SQLite database using Python.

This lesson will not prepare you to use Python as a general purpose programming language; there are some parts of the language we won't have time to cover. However, at the end of this lesson, you will have a good grasp of Python syntax and be well-prepared to learn the rest of the language, if you desire to do so. Even without seeing all of the Python programming language, you will be prepared to analyze and visualize data in Python using pandas and matplotlib.

Introduction to the Python Programming Language

The Python Interpreter

Jupyter Notebook

For this lesson, we'll be using the Python interpreter that is embedded in Jupyter Notebook. Jupyter Notebook is a fancy, browser-based environment for literate programming, the combination of Python scripts with rich text for telling a story about the task you set out to do with Python. This is a powerful way for collecting the code, the analysis, the context, and the results in a single place.

The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:

  • In GNU/Linux or Mac OS X, launch the Terminal and type: jupyter notebook; then press ENTER.
  • In Windows, launch the Command Prompt and type jupyter notebook; then press ENTER.

Python Data Types

Strings, Integers, and Floats

Operators

Sequences

Tuples

Challenge: Tuples and Lists


In [1]:
a_tuple = ('a', 'b', 'c', 'd')
a_list = ['a', 'b', 'c', 'd']
  1. What happens when you try to re-assign the first value in each of the following examples?
    a_tuple[0] = 5
    a_list[0] = 5
    
  2. Type type(a_tuple) into Python; what is the object's type?

Dictionaries

Challenge: Reassignment in a Dictionary

How can we change a value in our dictionary? Try to reassign one of the values in the code_book dictionary.

Functions

Challenge: Writing Your First Function

To convert from temperatures in Fahrenheight to Celsius, we first subtract $32$ and then multiple by $5/9$. Write a function that converts temperatures from Fahrenheit to Celsius.

Data Analysis Workflows

In general, when we have a data analysis task, there are specific steps that we undertake, in order:

  1. Bringing data together in a common format that is both human-readable and machine-readable (e.g., through spreadsheets, before exporting as a CSV).
  2. Cleaning data, e.g, merging text labels that are slightly different (e.g., "Tree frogs," "tree frogs," and "treefrogs"), splitting and combining fields as needed (e.g., in OpenRefine or in Python).
  3. Archiving data, keeping the raw data separate from any cleaned data, for further use and analysis (e.g., using a relational database and SQL).
  4. Deriving a subset of cleaned data for use in a specific analysis, such as a certain date range, certain field site, or certain collection (e.g., in Python or SQL).
  5. Performing analysis to answer a question or make a decision, such as grouping and summarizing data, calculating a statistic, or generating a plot (e.g., in Python or SQL).

Managing Data in Python

We've seen how we can use spreadsheets to effectively organize data. You probably know how to do some basic analysis on tabular data using spreadsheet software programs like Microsoft Excel. Why, then, should we learn to use Python for data analysis?

  • Because computers are good at automating tedious tasks (and humans are not), we can use Python to instruct a computer to repeat a task over and over again. This saves us frustration and ensures that the task is performed consistently, without human error.
  • More specifically, once we have written the code to do some task with data the first time, we can apply it over and over again to subsequent datasets.
  • Ultimately, we can save the code and share it with other people if they want to perform the same analysis or understand what we did. The code is then a record of what we did to get our results. This is a key component of reproducibility in computer-aided science.

About the Data

For this lesson, we will be using the Portal Project Teaching Database, a subset of data from Ernst et al.'s (2009) long-term monitoring and experimental manipulation of a Chihuahuan desert ecosystem near Portal, Arizona, U.S.A.. We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma-separated variable (CSV) file: each row holds information for a single animal, and the columns reprsent:

Column Description
record_id Unique ID for the observation
month Month of observation
day Day of observation
year Year of observation
plot_id ID of a particular plot
species_id 2-letter code identifying the species
sex Sex of animal ("M","F")
hindfoot_length Length of the hindfoot in millimeters
weight Weight of the animal in grams

About Libraries

Reading CSV Data in Pandas

Data Frames

Manipulating Data in Python

Challenge: Viewing DataFrames in Python

Try executing each code sample below and see what is returned.

  • surveys.columns
  • surveys.head()
  • surveys.head(15)
  • surveys.tail()
  • surveys.shape

Take note of the output of surveys.shape; what format does it return?

Finally, what is the difference between the code samples that end in parentheses and those that do not?

Calculating Statistics in a Pandas DataFrame

Challenge: Unique Levels for a Column

  1. Create a list of unique plot IDs found in the survey data; assign the list of unique IDs to a variable called plot_names. How many unique plots are there in the data? How many unique species are in the data?
  2. What is the difference between len(plot_names) and plot_names.shape?

Groups in Pandas

Quickly Creating Summary Counts in Pandas

Challenge: Understanding Grouped DataFrames

  1. In that last command, we asked for the record_id column. Try asking for a different column in the square brackets. Do you get a different result? Why or why not?
  2. How can we get a count of just the records with species_id set to DO? Hint: You can build on the last command we executed; think about Dictionaries and key-value pairs.

Basic Math on DataFrame Columns

Basic Plotting with Pandas

Challenge: Plotting

  1. Create a plot of the average weight in each species.
  2. Create a plot of total males and total females across the entire dataset.

Note: Some of the species have no weight measurements; they are entered as NaN, which stands for "not a number" and refers to missing values.

Multiple Grouping in Pandas

Transforming Data

Up to this point, we have learned:

  • The basic data types in Python;
  • How to read a CSV file into Python using pandas;
  • How tabular data in pandas are represented using a DataFrame;
  • How to check the data type of each column in a DataFrame;
  • How to summarize numerical data in a DataFrame;
  • How to group data according to the unique values of a column;
  • How to create basic plots of grouped, summarized data.

Converting Units

Transforming Values

Indexing and Slicing Python DataFrames

Indexing and Slicing in Python

Selecting Data Using Labels (Column Headings)

Extracting a Range of Data with Slicing

Recall that in Python, we start counting from zero instead of one. This means that the first element in an object is located at position zero.


In [94]:
grades = [88, 72, 93, 94]


In [95]:
grades[2]


Out[95]:
93


In [96]:
grades[1:3]


Out[96]:
[72, 93]

Challenge: Slicing

What do each of these lines of code return?

  1.  grades[0]
  2.  grades[len(grades)]
  3.  grades[4]

Why do (2) and (3) return errors?

Slicing Subsets of Rows in Python

Oops: Referencing versus Copying Objects in Python

Slicing Subsets of Rows and Columns with Pandas

Challenge: Slicing Rows and Columns

What happens when you type:

  1.  surveys[0:3]
  2.  surveys[:5]
  3.  surveys[-1:]

To review...

To index by rows in Pandas:


In [168]:
surveys[0:3]
surveys.iloc[0:3]
surveys.iloc[0:3,:]


Out[168]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN

To index by columns (and rows) in Pandas


In [174]:
surveys[['month', 'day', 'year']]
surveys.loc[0:3, ['month', 'day', 'year']]
surveys.iloc[0:3, 1:4]


Out[174]:
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977

Subsetting Data Using Criteria

Challenge: Filtering Data

  1. Filter the surveys table to observations of female members of the DO species. How many are there? What is their average weight?
  2. Look at the help documentation for the isin() function (Hint: ?surveys.year.isin). Use this function to filter the surveys DataFrame to those rows that match the three species: OL, OT, OX.

Using Masks

Dealing with Missing Data

Combining Multiple Datasets

At this point, we have learned:

  • About zero-based indexing in Python;
  • How to manipulate and extract data using column headings and index locations;
  • How to employ slicing to select subsets of a DataFrame;
  • How to re-assign values within a DataFrame;
  • How to create a copy of a DataFrame;
  • How to query subsets of a DataFrame that match certain criteria using logical operators;
  • How to interpret NaN values and best practices for missing data;

Concatenating DataFrames

Writing Data to a CSV File

Joining DataFrames

Note: Use this webpage as a graphical reference for this segment.

Identifying Join Keys

Inner Joins

Left Joins

Other Joins

  • A right (outer) join is similar to a left join except that it takes all rows from the right data frame and only matching rows from the left.
  • A full (outer) join returns all pairwise combinations of the two matching columns. This type is rarely used.

Challenges: Joins

Create a new data frame by joining the contents of surveys and species. Then, calculate and plot the distribution of taxa by plot_id.

Automating Data Workflows with Python

Automating Data Processing with For Loops

Challenge: Automation with For Loops

  1. Some of the surveys we saved are missing data; they have NaN values in one or more columns. Modify our for loop so that the entries with null values are not included in the yearly files.
  2. What happens if there are no data for a year in the sequence? You can generate a list of years for the for loop to use with, e.g., range(1970, 1980).
  3. Let's say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977?

Building Reusable Code with Functions

Challenge: Writing Reusable Functions

  1. What type of object corresponds to a variable declared as None? (Hint: Create a variable set to None and use the function type()).
  2. What happens if you only call multiple_years_to_csv() with all_data and an end_year (that is, without providing a start_year)? Can you write the function call with only a value for end_year?

Answering Questions with Data

Validating Assumptions about Data

Sorting on Values