This lesson introduces Python as an environment for data analysis and visualization. The materials are based on the Data Carpentry Python for Ecologists lesson. However, the lesson focuses on general analysis and visualization of tabular data and is not specific to ecologists or ecological data. As Data Carpentry explains:
Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.
At the end of this lesson, you will be able to:
This lesson will not prepare you to use Python as a general purpose programming language; there are some parts of the language we won't have time to cover. However, at the end of this lesson, you will have a good grasp of Python syntax and be well-prepared to learn the rest of the language, if you desire to do so. Even without seeing all of the Python programming language, you will be prepared to analyze and visualize data in Python using pandas and matplotlib.
For this lesson, we'll be using the Python interpreter that is embedded in Jupyter Notebook. Jupyter Notebook is a fancy, browser-based environment for literate programming, the combination of Python scripts with rich text for telling a story about the task you set out to do with Python. This is a powerful way for collecting the code, the analysis, the context, and the results in a single place.
The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:
jupyter notebook
; then press ENTER.jupyter notebook
; then press ENTER.
In [1]:
a_tuple = ('a', 'b', 'c', 'd')
a_list = ['a', 'b', 'c', 'd']
a_tuple[0] = 5
a_list[0] = 5
type(a_tuple)
into Python; what is the object's type?How can we change a value in our dictionary? Try to reassign one of the values in the code_book
dictionary.
To convert from temperatures in Fahrenheight to Celsius, we first subtract $32$ and then multiple by $5/9$. Write a function that converts temperatures from Fahrenheit to Celsius.
In general, when we have a data analysis task, there are specific steps that we undertake, in order:
We've seen how we can use spreadsheets to effectively organize data. You probably know how to do some basic analysis on tabular data using spreadsheet software programs like Microsoft Excel. Why, then, should we learn to use Python for data analysis?
For this lesson, we will be using the Portal Project Teaching Database, a subset of data from Ernst et al.'s (2009) long-term monitoring and experimental manipulation of a Chihuahuan desert ecosystem near Portal, Arizona, U.S.A.. We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma-separated variable (CSV) file: each row holds information for a single animal, and the columns reprsent:
Column | Description |
---|---|
record_id | Unique ID for the observation |
month | Month of observation |
day | Day of observation |
year | Year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code identifying the species |
sex | Sex of animal ("M","F") |
hindfoot_length | Length of the hindfoot in millimeters |
weight | Weight of the animal in grams |
Try executing each code sample below and see what is returned.
surveys.columns
surveys.head()
surveys.head(15)
surveys.tail()
surveys.shape
Take note of the output of surveys.shape
; what format does it return?
Finally, what is the difference between the code samples that end in parentheses and those that do not?
plot_names
. How many unique plots are there in the data? How many unique species are in the data?len(plot_names)
and plot_names.shape
?record_id
column. Try asking for a different column in the square brackets. Do you get a different result? Why or why not?species_id
set to DO
? Hint: You can build on the last command we executed; think about Dictionaries and key-value pairs.Note: Some of the species have no weight measurements; they are entered as NaN
, which stands for "not a number" and refers to missing values.
Up to this point, we have learned:
Recall that in Python, we start counting from zero instead of one. This means that the first element in an object is located at position zero.
In [94]:
grades = [88, 72, 93, 94]
In [95]:
grades[2]
Out[95]:
In [96]:
grades[1:3]
Out[96]:
What do each of these lines of code return?
grades[0]
grades[len(grades)]
grades[4]
Why do (2) and (3) return errors?
What happens when you type:
surveys[0:3]
surveys[:5]
surveys[-1:]
To review...
To index by rows in Pandas:
In [168]:
surveys[0:3]
surveys.iloc[0:3]
surveys.iloc[0:3,:]
Out[168]:
To index by columns (and rows) in Pandas
In [174]:
surveys[['month', 'day', 'year']]
surveys.loc[0:3, ['month', 'day', 'year']]
surveys.iloc[0:3, 1:4]
Out[174]:
surveys
table to observations of female members of the DO
species. How many are there? What is their average weight?isin()
function (Hint: ?surveys.year.isin
). Use this function to filter the surveys
DataFrame to those rows that match the three species: OL
, OT
, OX
.At this point, we have learned:
Note: Use this webpage as a graphical reference for this segment.
Create a new data frame by joining the contents of surveys
and species
. Then, calculate and plot the distribution of taxa
by plot_id
.
NaN
values in one or more columns. Modify our for
loop so that the entries with null values are not included in the yearly files. for
loop to use with, e.g., range(1970, 1980)
.None
? (Hint: Create a variable set to None
and use the function type()
).multiple_years_to_csv()
with all_data
and an end_year
(that is, without providing a start_year
)? Can you write the function call with only a value for end_year
?