CASE - Biodiversity data - data cleaning and enrichment

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

Scenario:
You are interested in occurrence data for a number of species in Flanders. Unfortunately, the sources for this type of data are still scattered among different institutes. After a mailing campaign, you receive a number of files from different formats, in various data formats and styles...

You decide to be brave and script the interpretation and transformation, in order to provide reproducibility of your work. Moreover, similar analysis will be needed in the future with new data requests. You hope that future data requests will result in similar data formats from the individual partners. So, having a script will enhance the efficiency at that moment.

Besides from the technical differences in the data formats (csv, excel, shapefile, sqlite...), there are also differences in the naming of the content. For example, the coordinates, can be named x/y, decimalLatitude/decimalLongitude, lat/long... Luckely, you know of an international open data standard to describe occurrence data, i.e. Darwin Core (DwC). Instead of inventing your own data model, you decide to comply to this international standard. The latter will enhance communication and will also make your data compliant to other data services working with this kind of data.

In short, the DwC describes a flat table (cfr. CSV) with an agreed name convention on the header names and conventions on how certain data types need to be represented. Whereas the standard definitions are out of scope, an in depth description is given here. For this tutorial, we will focus on a few of the existing terms to learn some elements about data cleaning:

  • eventDate: ISO 6801 format of dates
  • scientificName: the accepted scientific name of the species
  • decimalLatitude/decimalLongitude: coordinates of the occurrence in WGS84 format
  • sex: either male or female to characterise the sex of the occurrence
  • occurrenceID: a identifier within the dataset to identify the individual records
  • datasetName: a static string defining the source of the data

Futhermore, additional information concering the taxonomy will be added using an external API service

Dataset to work on:

For this dataset, the data is provided in the following main data files:

  • surveys.csv the data with the surveys observed in the individual plots
  • species.csv the overview list of the species shortnames
  • plot_location.xlsx the overview of coordinates of the individual locations

The data originates from a study of a Chihuahuan desert ecosystem near Portal, Arizona

Survey-data

Reading in the data of the individual surveys:


In [2]:
survey_data = pd.read_csv("../data/surveys.csv")

In [3]:
survey_data.head()


Out[3]:
record_id month day year plot species sex_char wgt
0 1 7 16 1977 2 NaN M NaN
1 2 7 16 1977 3 NaN M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
EXERCISE: How many individual records (occurrences) does the survey data set contain?

In [4]:
len(survey_data)


Out[4]:
35549

Adding the data source information as static column

For convenience when this dataset will be combined with other datasets, we first add a column of static values, defining the datasetName of this particular data:


In [5]:
datasetname = "Ecological Archives E090-118-D1."

Adding this static value as a new column datasetName:

EXERCISE: Add a new column, 'datasetName', to the survey data set with datasetname as value for all of the records (static value for the entire data set)

In [6]:
survey_data["datasetName"] = datasetname

Cleaning the sex_char column into a DwC called sex column

EXERCISE: Get a list of the unique values for the column sex_char. __Tip__, to find the unique values, look for a function called `unique`...

In [7]:
survey_data["sex_char"].unique().tolist()


Out[7]:
['M', 'F', nan, 'R', 'P', 'Z']

So, apparently, more information is provided in this column, whereas according to the metadata information, the sex information should be either M (male) or F (female). We will create a column, named sex and convert the symbols to the corresponding sex, taking into account the following mapping of the values (see metadata for more details):

  • M -> male
  • F -> female
  • R -> male
  • P -> female
  • Z -> nan

At the same time, we will save the original information of the sex_char in a separate column, called verbatimSex, as a reference.

In summary, we have to:

  • create a new column verbatimSex, which is a copy of the current sex_char column
  • create a new column with the name sex
  • map the original values of the sex_char to the values male and female according to the listing above

Converting the name of the column header sex_char to verbatimSex with the rename function:


In [8]:
survey_data = survey_data.rename(columns={'sex_char': 'verbatimSex'})
EXERCISE: Express the mapping of the the values (e.g. M -> male) into a dictionary object called sex_dict __Tip__: (1) a NaN-value can be defined as `np.nan`, (2) a dictionary is a Python data structure, https://docs.python.org/3/tutorial/datastructures.html#dictionaries

In [9]:
sex_dict = {"M": "male",
            "F": "female",
            "R": "male",
            "P": "female",
            "Z": np.nan}
EXERCISE: Use the dictionary to replace the values in the `verbatimSex` column to the new values according to the `sex_dict` mapping dictionary and save the mapped values in a new column 'sex'. __Tip__: to replace values using a mapping dictionary, look for a function called `replace`...

In [10]:
survey_data['sex'] = survey_data['verbatimSex'].replace(sex_dict)

Checking the current frequency of values (this should result in the values male, female and nan):


In [11]:
survey_data["sex"].unique()


Out[11]:
array(['male', 'female', nan], dtype=object)

To check what the frequency of occurrences is for male/female of the categories, a bar chart is an possible representation:

EXERCISE: Make a horizontal bar chart comparing the number of male, female and unknown (NaN) records in the dataset __Tip__: check in the help of the Pandas plot function for the `kind` parameter

In [12]:
survey_data["sex"].value_counts(dropna=False).plot(kind="barh", color="#00007f")


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f481027fa90>
NOTE: The usage of `groupby` combined with a `count` of each group would be an option as well. However, the latter does not support to count the `NaN` values as well. The `value_counts` method does support this with the `dropna=False` argument.

Solving double entry field by decoupling

When checking the species unique information:


In [13]:
survey_data["species"].unique()


Out[13]:
array([nan, 'DM', 'PF', 'PE', 'DM and SH', 'DS', 'PP', 'SH', 'OT', 'DO',
       'OX', 'SS', 'OL', 'RM', 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM',
       'CQ', 'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL',
       'CS', 'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU',
       'RX', 'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

In [14]:
survey_data.head(10)


Out[14]:
record_id month day year plot species verbatimSex wgt datasetName sex
0 1 7 16 1977 2 NaN M NaN Ecological Archives E090-118-D1. male
1 2 7 16 1977 3 NaN M NaN Ecological Archives E090-118-D1. male
2 3 7 16 1977 2 DM F NaN Ecological Archives E090-118-D1. female
3 4 7 16 1977 7 DM M NaN Ecological Archives E090-118-D1. male
4 5 7 16 1977 3 DM M NaN Ecological Archives E090-118-D1. male
5 6 7 16 1977 1 PF M NaN Ecological Archives E090-118-D1. male
6 7 7 16 1977 2 PE F NaN Ecological Archives E090-118-D1. female
7 8 7 16 1977 1 DM M NaN Ecological Archives E090-118-D1. male
8 9 7 16 1977 1 DM and SH NaN NaN Ecological Archives E090-118-D1. NaN
9 10 7 16 1977 6 PF F NaN Ecological Archives E090-118-D1. female

There apparently exists a double entry: 'DM and SH', which basically defines two records and should be decoupled to two individual records (i.e. rows). Hence, we should be able to create a additional row based on this split. To do so, Pandas provides a dedicated function since version 0.25, called explode. Starting from a small subset example:


In [15]:
example = survey_data.loc[7:10, "species"]
example


Out[15]:
7            DM
8     DM and SH
9            PF
10           DS
Name: species, dtype: object

Using the split method on strings, we can split the string using a given character, in this case the word and:


In [16]:
example.str.split("and")


Out[16]:
7           [DM]
8     [DM ,  SH]
9           [PF]
10          [DS]
Name: species, dtype: object

The explode method will create a row for each element in the list:


In [17]:
example_split = example.str.split("and").explode()
example_split


Out[17]:
7      DM
8     DM 
8      SH
9      PF
10     DS
Name: species, dtype: object

Hence, the DM and SH are now enlisted in separate rows. Other rows remain unchanged. The only remaining issue is the spaces around the characters:


In [18]:
example_split.iloc[1], example_split.iloc[2]


Out[18]:
('DM ', ' SH')

Which we can solve again using the string method strip, removing the spaces before and after the characters:


In [19]:
example_split.str.strip()


Out[19]:
7     DM
8     DM
8     SH
9     PF
10    DS
Name: species, dtype: object

To make this reusable, let's create a dedicated function to combine these steps, called solve_double_field_entry:


In [20]:
def solve_double_field_entry(df, keyword="and", column="verbatimEventDate"):
    """split on keyword in column for an enumeration and create extra record
    
    Parameters
    ----------
    df: pd.DataFrame
        DataFrame with a double field entry in one or more values
    keyword: str
        word/character to split the double records on
    column: str
        column name to use for the decoupling of the records
    """
    df[column] = df[column].str.split(keyword)
    df = df.explode(column)
    df[column] = df[column].str.strip()  # remove white space around the words
    return df

The function takes a DataFrame as input, splits the record into separate rows and returns the updated DataFrame. We can use this function to get an update of the dataFrame, with the an additional row (occurrence) added by decoupling the specific field:

EXERCISE: Use the function solve_double_field_entry to create a dataFrame with an additional row, by decoupling the double entries. Save the result as a variable survey_data_decoupled.

In [21]:
survey_data_decoupled = solve_double_field_entry(survey_data.copy(), 
                                                 "and", 
                                                 column="species") # get help of the function by SHIFT + TAB
# REMARK: the copy() statement here (!) see pandas_03b_indexing.ipynb notebook `chained indexing section`

In [22]:
survey_data_decoupled["species"].unique()


Out[22]:
array([nan, 'DM', 'PF', 'PE', 'SH', 'DS', 'PP', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ', 'RF',
       'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS', 'SC',
       'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX', 'PB',
       'PL', 'PX', 'CT', 'US'], dtype=object)

In [23]:
survey_data_decoupled.head(11)


Out[23]:
record_id month day year plot species verbatimSex wgt datasetName sex
0 1 7 16 1977 2 NaN M NaN Ecological Archives E090-118-D1. male
1 2 7 16 1977 3 NaN M NaN Ecological Archives E090-118-D1. male
2 3 7 16 1977 2 DM F NaN Ecological Archives E090-118-D1. female
3 4 7 16 1977 7 DM M NaN Ecological Archives E090-118-D1. male
4 5 7 16 1977 3 DM M NaN Ecological Archives E090-118-D1. male
5 6 7 16 1977 1 PF M NaN Ecological Archives E090-118-D1. male
6 7 7 16 1977 2 PE F NaN Ecological Archives E090-118-D1. female
7 8 7 16 1977 1 DM M NaN Ecological Archives E090-118-D1. male
8 9 7 16 1977 1 DM NaN NaN Ecological Archives E090-118-D1. NaN
8 9 7 16 1977 1 SH NaN NaN Ecological Archives E090-118-D1. NaN
9 10 7 16 1977 6 PF F NaN Ecological Archives E090-118-D1. female

Create new occurrence identifier

The record_id is no longer a unique identifier after the decoupling of this dataset. We will make a new dataset-specific identifier, by adding a column called occurrenceID that takes a new counter as identifier. As a simply and straightforward approach, we will use a new counter for the whole dataset, starting with 1:


In [24]:
np.arange(1, len(survey_data_decoupled) + 1, 1)


Out[24]:
array([    1,     2,     3, ..., 35548, 35549, 35550])

Create a new column with header occurrenceID with the values 1 -> 35550 as field values:


In [25]:
survey_data_decoupled["occurrenceID"] = np.arange(1, len(survey_data_decoupled) + 1, 1)
Remark: A reset of the index to generate this column with `reset_index(drop=False)` would be technically perfectly valid. Still, we want the indices to start at 1 instead of 0 (and Python starts counting at 0!)

To overcome the confusion on having both a record_id and occurrenceID field, we will remove the record_id term:


In [26]:
survey_data_decoupled = survey_data_decoupled.drop(columns="record_id")

Hence, columns can be drop-ped out of a DataFrame


In [27]:
survey_data_decoupled.head(10)


Out[27]:
month day year plot species verbatimSex wgt datasetName sex occurrenceID
0 7 16 1977 2 NaN M NaN Ecological Archives E090-118-D1. male 1
1 7 16 1977 3 NaN M NaN Ecological Archives E090-118-D1. male 2
2 7 16 1977 2 DM F NaN Ecological Archives E090-118-D1. female 3
3 7 16 1977 7 DM M NaN Ecological Archives E090-118-D1. male 4
4 7 16 1977 3 DM M NaN Ecological Archives E090-118-D1. male 5
5 7 16 1977 1 PF M NaN Ecological Archives E090-118-D1. male 6
6 7 16 1977 2 PE F NaN Ecological Archives E090-118-D1. female 7
7 7 16 1977 1 DM M NaN Ecological Archives E090-118-D1. male 8
8 7 16 1977 1 DM NaN NaN Ecological Archives E090-118-D1. NaN 9
8 7 16 1977 1 SH NaN NaN Ecological Archives E090-118-D1. NaN 10

Converting the date values

In the survey-dataset we received, the month, day, and year columns are containing the information about the date, i.e. eventDate in DarwinCore terms. We want this data in a ISO format YYYY-MM-DD. A convenvient Pandas function is the usage of to_datatime, which provides multiple options to interpret dates. One of thes options is the automatic interpretation of some 'typical' columns, like year, month and day, when passing a DataFrame.


In [28]:
# pd.to_datetime(survey_data_decoupled[["year", "month", "day"]])  # uncomment the line and test this statement

This is not working, not all dates can be interpreted... We should get some more information on the reason of the errors. By using the option coerce, the problem makers will be labeled as a missing value NaT. We can count the number of dates that can not be interpreted:


In [29]:
sum(pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull())


Out[29]:
136
EXERCISE: Make a subselection of survey_data_decoupled containing those records that can not correctly be interpreted as date values and save the resulting dataframe as variable trouble_makers

In [30]:
mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull()
trouble_makers = survey_data_decoupled[mask]

Checking some charactersitics of the trouble_makers:


In [31]:
trouble_makers.head()


Out[31]:
month day year plot species verbatimSex wgt datasetName sex occurrenceID
30649 4 31 2000 6 PP F 19.0 Ecological Archives E090-118-D1. female 30651
30650 4 31 2000 6 PB M 32.0 Ecological Archives E090-118-D1. male 30652
30651 4 31 2000 6 PB F 30.0 Ecological Archives E090-118-D1. female 30653
30652 4 31 2000 6 PP M 20.0 Ecological Archives E090-118-D1. male 30654
30653 4 31 2000 6 PP M 24.0 Ecological Archives E090-118-D1. male 30655

In [32]:
trouble_makers["day"].unique()


Out[32]:
array([31])

In [33]:
trouble_makers["month"].unique()


Out[33]:
array([4, 9])

In [34]:
trouble_makers["year"].unique()


Out[34]:
array([2000])

So, basically the problem is the presence of day 31 during the months April and September of the year 2000. At this moment, we would have to recheck the original data in order to know how the issue could be solved. Apparently, - for this specific case - there has been a data-entry problem in 2000, making the 31 days during this period should actually be 30. It would be optimal to correct this in the source dataset, but for the further exercise, it will be corrected here.

EXERCISE: Replace in the dataFrame survey_data_decoupled all of the troublemakers day values into the value 30

In [35]:
mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull()
survey_data_decoupled.loc[mask, "day"] = 30

Now, we do the parsing again to create a proper eventDate field, containing the dates:


In [36]:
survey_data_decoupled["eventDate"] = \
    pd.to_datetime(survey_data_decoupled[["year", "month", "day"]])

Just let's do a check the amount of data for each year:

EXERCISE: Create a horizontal bar chart with the number of records for each year

In [37]:
survey_data_decoupled.groupby("year").size().plot(kind='barh', color="#00007f", figsize=(10, 10))


Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f480f9c2310>

In [38]:
survey_data_decoupled.head()


Out[38]:
month day year plot species verbatimSex wgt datasetName sex occurrenceID eventDate
0 7 16 1977 2 NaN M NaN Ecological Archives E090-118-D1. male 1 1977-07-16
1 7 16 1977 3 NaN M NaN Ecological Archives E090-118-D1. male 2 1977-07-16
2 7 16 1977 2 DM F NaN Ecological Archives E090-118-D1. female 3 1977-07-16
3 7 16 1977 7 DM M NaN Ecological Archives E090-118-D1. male 4 1977-07-16
4 7 16 1977 3 DM M NaN Ecological Archives E090-118-D1. male 5 1977-07-16

Currently, the dates are stored in a python specific date format:


In [39]:
survey_data_decoupled["eventDate"].dtype


Out[39]:
dtype('<M8[ns]')

This is great, because it allows for many functionalities:


In [40]:
survey_data_decoupled.eventDate.dt #add a dot and press TAB to explore the date options it provides


Out[40]:
<pandas.core.indexes.accessors.DatetimeProperties object at 0x7f480f9a3890>
EXERCISE: Create a horizontal bar chart with the number of records for each year (cfr. supra), but without using the column `year`

In [41]:
survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.year).size().plot(kind='barh', color="#00007f", figsize=(10, 10))


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f480f73ded0>

So, we actually do not need the day, month, year columns anymore and have other options available as well

EXERCISE: Create a bar chart with the number of records for each weekday

In [42]:
nrecords_by_weekday = survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.weekday).size()
ax = nrecords_by_weekday.plot(kind="barh", color="#00007f", figsize=(6, 6))
# I you want to represent the ticklabels as proper names, uncomment the following line:
#ticklabels = ax.set_yticklabels(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])


When saving the information to a file (e.g. CSV-file), this data type will be automatically converted to a string representation. However, we could also decide to explicitly provide the string format the dates are stored (losing the date type functionalities), in order to have full control on the way these dates are formatted:


In [43]:
survey_data_decoupled["eventDate"] = survey_data_decoupled["eventDate"].dt.strftime('%Y-%m-%d')

In [44]:
survey_data_decoupled["eventDate"].head()


Out[44]:
0    1977-07-16
1    1977-07-16
2    1977-07-16
3    1977-07-16
4    1977-07-16
Name: eventDate, dtype: object

As we do not need the day, month, year columns anymore, we can drop them from the DataFrame:

EXERCISE: Remove the columns day, month and year from the `survey_data_decoupled` DataFrame: __Tip__: Remember the `drop` method?

In [45]:
survey_data_decoupled = survey_data_decoupled.drop(columns=["day", "month", "year"])

Add coordinates from the plot locations

Loading the coordinate data

The individual plots are only identified by a plot identification number. In order to provide sufficient information to external users, additional information about the coordinates should be added. The coordinates of the individual plots are saved in another file: plot_location.xlsx. We will use this information to further enrich our dataset and add the Darwin Core Terms decimalLongitude and decimalLatitude.

EXERCISE: Read in the excel file 'plot_location.xlsx' and store the data as the variable `plot_data`, with 3 columns: plot, xutm, yutm.

In [46]:
plot_data = pd.read_excel("../data/plot_location.xlsx", skiprows=3, index_col=0)

In [47]:
plot_data.head()


Out[47]:
plot xutm yutm
0 1 681222.131658 3.535262e+06
1 2 681302.799361 3.535268e+06
2 3 681375.294968 3.535270e+06
3 4 681450.837525 3.535271e+06
4 5 681526.983040 3.535281e+06

Transforming to other coordinate reference system

These coordinates are in meters, more specifically in UTM 12 N coordinate system. However, the agreed coordinate system for Darwin Core is the World Geodetic System 1984 (WGS84).

As this is not a GIS course, we will shortcut the discussion about different projection systems, but provide an example on how such a conversion from UTM12N to WGS84 can be performed with the projection toolkit pyproj and by relying on the existing EPSG codes (a registry originally setup by the association of oil & gas producers).

First, we define out two projection systems, using their corresponding EPSG codes:


In [48]:
import pyproj

In [49]:
utm12n = pyproj.Proj("+init=EPSG:32612")
wgs84 = pyproj.Proj("+init=EPSG:4326")

The reprojection can be done by the function transform of the projection toolkit, providing the coordinate systems and a set of x, y coordinates. For example, for a single coordinate, this can be applied as follows:


In [50]:
pyproj.transform(utm12n, wgs84, 681222.131658, 3.535262e+06)


Out[50]:
(-109.08282902317859, 31.938849883722508)

Instead of writing a for loop to do this for each of the coordinates in the list, we can apply this function to each of them:

EXERCISE: Apply the pyproj function transform to plot_data, using the columns xutm and yutm and save the resulting output in 2 new columns, called decimalLongitude and decimalLatitude:
  • Create a function transform_utm_to_wgs that takes a row of a DataFrame and returns a Series of two elements with the longitude and latitude.
  • Test this function on the first row of plot_data
  • Now apply this function on all rows (remember the axis parameter)
  • Assign the result of the previous step to decimalLongitude and decimalLatitude columns

In [51]:
def transform_utm_to_wgs(row):
    """
    Converts the x and y coordinates of this row into a Series of the
    longitude and latitude.
    
    """
    utm12n = pyproj.Proj("+init=EPSG:32612")
    wgs84 = pyproj.Proj("+init=EPSG:4326")
    
    return pd.Series(pyproj.transform(utm12n, wgs84, row['xutm'], row['yutm']))

In [52]:
transform_utm_to_wgs(plot_data.loc[0])


Out[52]:
0   -109.082829
1     31.938851
dtype: float64

In [53]:
plot_data.apply(transform_utm_to_wgs, axis=1)


Out[53]:
0 1
0 -109.082829 31.938851
1 -109.081975 31.938887
2 -109.081208 31.938896
3 -109.080409 31.938894
4 -109.079602 31.938970
5 -109.078836 31.939078
6 -109.082816 31.938113
7 -109.081680 31.937884
8 -109.080903 31.937859
9 -109.080091 31.938017
10 -109.079307 31.938056
11 -109.078519 31.938203
12 -109.082613 31.937028
13 -109.081827 31.937054
14 -109.081036 31.937059
15 -109.080244 31.937094
16 -109.079415 31.937117
17 -109.078633 31.937126
18 -109.077912 31.937438
19 -109.080191 31.936334
20 -109.079398 31.936448
21 -109.078602 31.936441
22 -109.077838 31.936763
23 -109.077736 31.938560

In [54]:
plot_data[["decimalLongitude" ,"decimalLatitude"]] = plot_data.apply(transform_utm_to_wgs, axis=1)

In [55]:
plot_data.head()


Out[55]:
plot xutm yutm decimalLongitude decimalLatitude
0 1 681222.131658 3.535262e+06 -109.082829 31.938851
1 2 681302.799361 3.535268e+06 -109.081975 31.938887
2 3 681375.294968 3.535270e+06 -109.081208 31.938896
3 4 681450.837525 3.535271e+06 -109.080409 31.938894
4 5 681526.983040 3.535281e+06 -109.079602 31.938970

The above function transform_utm_to_wgs you have created is a very specific function that knows the structure of the DataFrame you will apply it to (it assumes the 'xutm' and 'yutm' column names). We could also make a more generic function that just takes a X and Y coordinate and returns the Series of converted coordinates (transform_utm_to_wgs2(X, Y)).

To apply such a more generic function to the plot_data DataFrame, we can make use of the lambda construct, which lets you specify a function on one line as an argument:

plot_data.apply(lambda row : transform_utm_to_wgs2(row['xutm'], row['yutm']), axis=1)

If you have time, try to implement this function and test it as well.

(intermezzo) Checking the coordinates on a map

To check the transformation, let's put these on an interactive map. Leaflet is a famous service for this and in many programming languages wrappers do exist to simplify the usage. Folium is an extensive library providing multiple options. As we just want to do a quick checkup of the coordinates, we will rely on the package mplleaflet, which just converts a matplotlib image to a leaflet map:


In [56]:
import mplleaflet  # https://github.com/jwass/mplleaflet

In [57]:
fig, ax = plt.subplots(figsize=(5, 8))
plt.plot(plot_data['decimalLongitude'], plot_data['decimalLatitude'], 'rs')

mplleaflet.display(fig=fig) # zoom out to see where the measurement plot is located


/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/IPython/core/display.py:701: UserWarning: Consider using IPython.display.IFrame instead
  warnings.warn("Consider using IPython.display.IFrame instead")
Out[57]:

Join the coordinate information to the survey data set

All points are inside the desert region as we expected, so we can extend our survey dataset with this coordinate information. Making the combination of two data sets based on a common identifier is completely similar to the usage of JOIN operations in databases. In Pandas, this functionality is provided by pd.merge.

In practice, we have to add the columns decimalLongitude/decimalLatitude to the current dataset survey_data_decoupled, by using the plot identification number as key to join.

EXERCISE: Extract only the columns to join to our survey dataset: the plot identifiers, decimalLatitude and decimalLongitude into a new variable named plot_data_selection

In [58]:
plot_data_selection = plot_data[["plot", "decimalLongitude", "decimalLatitude"]]
EXERCISE: Based on the documentation of Pandas merge, add the coordinate information (plot_data_selection) to the survey data set and save the resulting DataFrame as survey_data_plots. __Tip__: documentation of [merge](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-DataFrame-joining-merging)...

In [59]:
survey_data_plots = pd.merge(survey_data_decoupled, plot_data_selection, 
                             how="left", on="plot")

In [60]:
survey_data_plots.head()


Out[60]:
plot species verbatimSex wgt datasetName sex occurrenceID eventDate decimalLongitude decimalLatitude
0 2 NaN M NaN Ecological Archives E090-118-D1. male 1 1977-07-16 -109.081975 31.938887
1 3 NaN M NaN Ecological Archives E090-118-D1. male 2 1977-07-16 -109.081208 31.938896
2 2 DM F NaN Ecological Archives E090-118-D1. female 3 1977-07-16 -109.081975 31.938887
3 7 DM M NaN Ecological Archives E090-118-D1. male 4 1977-07-16 -109.082816 31.938113
4 3 DM M NaN Ecological Archives E090-118-D1. male 5 1977-07-16 -109.081208 31.938896

The plot locations need to be stored with the variable name verbatimLocality indicating th identifier as integer value of the plot:


In [61]:
survey_data_plots = survey_data_plots.rename(columns={'plot': 'verbatimLocality'})

Add species names to dataset

The column species only provides a short identifier in the survey overview. The name information is stored in a separate file species.csv. As we want our dataset to include this information, we will read in this data and add it to our survey dataset:

EXERCISE: Read in the 'species.csv' file and save the resulting DataFrame as variable species_data __Tip__: check the delimiter (`sep`) to define

In [62]:
species_data = pd.read_csv("../data/species.csv", sep=";")

In [63]:
species_data.head()


Out[63]:
species_id genus species taxa
0 AB Amphispiza bilineata Bird
1 AH Ammospermophilus harrisi Rodent-not censused
2 AS Ammodramus savannarum Bird
3 BA Baiomys taylori Rodent
4 CB Campylorhynchus brunneicapillus Bird

Fix a wrong acronym naming

When reviewing the metadata, you see that in the data-file the acronym NE is used to describe Neotoma albigula, whereas in the metadata description, the acronym NA is used.

EXERCISE: Convert the value of 'NE' to 'NA' by using boolean indexing for the `species_id` column

In [64]:
species_data.loc[species_data["species_id"] == "NE", "species_id"] = "NA"

(At the same time, you decide to cure this problem at the source and alert the data provider about this issue.)

Merging surveys and species

As we now prepared the two series, we can combine the data, using the merge operation. Take into account that our key-column is different for species_data and survey_data_plots, respectively species_id and species:

We want to add the data of the species to the survey data, in order to see the full species names:

EXERCISE: Merge the `survey_data_plots` data set with the `species_data` information in order to pass the species information to the survey data:

In [65]:
survey_data_species = pd.merge(survey_data_plots, species_data, how="left",  # LEFT OR INNER?
                                left_on="species", right_on="species_id")

In [66]:
len(survey_data_species)


Out[66]:
35550

The join is ok, but we are left with some redundant columns and wrong naming:


In [67]:
survey_data_species.head()


Out[67]:
verbatimLocality species_x verbatimSex wgt datasetName sex occurrenceID eventDate decimalLongitude decimalLatitude species_id genus species_y taxa
0 2 NaN M NaN Ecological Archives E090-118-D1. male 1 1977-07-16 -109.081975 31.938887 NaN NaN NaN NaN
1 3 NaN M NaN Ecological Archives E090-118-D1. male 2 1977-07-16 -109.081208 31.938896 NaN NaN NaN NaN
2 2 DM F NaN Ecological Archives E090-118-D1. female 3 1977-07-16 -109.081975 31.938887 DM Dipodomys merriami Rodent
3 7 DM M NaN Ecological Archives E090-118-D1. male 4 1977-07-16 -109.082816 31.938113 DM Dipodomys merriami Rodent
4 3 DM M NaN Ecological Archives E090-118-D1. male 5 1977-07-16 -109.081208 31.938896 DM Dipodomys merriami Rodent

We do not need the columns species_x and species_id column anymore, as we will use the scientific names from now on:


In [68]:
survey_data_species = survey_data_species.drop(["species_x", "species_id"], axis=1)

The column species_y could just be named species:


In [69]:
survey_data_species = survey_data_species.rename(columns={"species_y": "species"})

In [70]:
survey_data_species.head()


Out[70]:
verbatimLocality verbatimSex wgt datasetName sex occurrenceID eventDate decimalLongitude decimalLatitude genus species taxa
0 2 M NaN Ecological Archives E090-118-D1. male 1 1977-07-16 -109.081975 31.938887 NaN NaN NaN
1 3 M NaN Ecological Archives E090-118-D1. male 2 1977-07-16 -109.081208 31.938896 NaN NaN NaN
2 2 F NaN Ecological Archives E090-118-D1. female 3 1977-07-16 -109.081975 31.938887 Dipodomys merriami Rodent
3 7 M NaN Ecological Archives E090-118-D1. male 4 1977-07-16 -109.082816 31.938113 Dipodomys merriami Rodent
4 3 M NaN Ecological Archives E090-118-D1. male 5 1977-07-16 -109.081208 31.938896 Dipodomys merriami Rodent

In [71]:
len(survey_data_species)


Out[71]:
35550

Let's now save our cleaned-up to a csv file, so we can further analyze the data in a following notebook:


In [72]:
survey_data_species.to_csv("interim_survey_data_species.csv", index=False)

(OPTIONAL SECTION) Using a API service to match the scientific names

As the current species names are rather short and could eventually lead to confusion when shared with other users, retrieving additional information about the different species in our dataset would be useful to integrate our work with other research. An option is to match our names with an external service to request additional information about the different species.

One of these services is GBIF API. The service can most easily be illustrated with a small example:

In a new tabblad of the browser, go to the URL http://www.gbif.org/species/2475532, which corresponds to the page of Alcedo atthis (ijsvogel in dutch). One could for each of the species in the list we have do a search on the website of GBIF to find the corresponding page of the different species, from which more information can be extracted manually. However, this would take a lot of time...

Therefore, GBIF (and many other organisations!) provides a service to extract the same information on a machine-readable way, in order to automate these searches. As an example, let's search for the information of Alcedo atthis, using the GBIF API: Go to the URL: http://api.gbif.org/v1/species/match?name=Alcedo atthis and check the output. What we did is a machine-based search on the GBIF website for information about Alcedo atthis.

The same can be done using Python. The main library we need to this kind of automated searches is the requests package, which can be used to do request to any kind of API out there.


In [73]:
import requests

Example matching with Alcedo Atthis

For the example of Alcedo atthis:


In [74]:
species_name = 'Alcedo atthis'

In [75]:
base_string = 'http://api.gbif.org/v1/species/match?'
request_parameters = {'verbose': False, 'strict': True, 'name': species_name}
message = requests.get(base_string, params=request_parameters).json()
message


Out[75]:
{'usageKey': 2475532,
 'scientificName': 'Alcedo atthis (Linnaeus, 1758)',
 'canonicalName': 'Alcedo atthis',
 'rank': 'SPECIES',
 'status': 'ACCEPTED',
 'confidence': 99,
 'matchType': 'EXACT',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Coraciiformes',
 'family': 'Alcedinidae',
 'genus': 'Alcedo',
 'species': 'Alcedo atthis',
 'kingdomKey': 1,
 'phylumKey': 44,
 'classKey': 212,
 'orderKey': 1447,
 'familyKey': 2984,
 'genusKey': 2475493,
 'speciesKey': 2475532,
 'synonym': False,
 'class': 'Aves'}

From which we get a dictionary containing more information about the taxonomy of the Alcedo atthis.

In the species data set available, the name to match is provided in the combination of two columns, so we have to combine those to in order to execute the name matching:


In [76]:
genus_name = "Callipepla"
species_name = "squamata"
name_to_match = '{} {}'.format(genus_name, species_name)
base_string = 'http://api.gbif.org/v1/species/match?'
request_parameters = {'strict': True, 'name': name_to_match} # use strict matching(!)
message = requests.get(base_string, params=request_parameters).json()
message


Out[76]:
{'usageKey': 5228075,
 'scientificName': 'Callipepla squamata (Vigors, 1830)',
 'canonicalName': 'Callipepla squamata',
 'rank': 'SPECIES',
 'status': 'ACCEPTED',
 'confidence': 99,
 'matchType': 'EXACT',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Galliformes',
 'family': 'Odontophoridae',
 'genus': 'Callipepla',
 'species': 'Callipepla squamata',
 'kingdomKey': 1,
 'phylumKey': 44,
 'classKey': 212,
 'orderKey': 723,
 'familyKey': 9325,
 'genusKey': 2474205,
 'speciesKey': 5228075,
 'synonym': False,
 'class': 'Aves'}

To apply this on our species data set, we will have to do this request for each of the individual species/genus combination. As, this is a returning functionality, we will write a small function to do this:

Writing a custom matching function

EXERCISE: Write a function, called `name_match` that takes the `genus`, the `species` and the option to perform a strict matching or not as inputs, performs a matching with the GBIF name matching API and return the received message as a dictionary.

In [77]:
def name_match(genus_name, species_name, strict=True):
    """
    Perform a GBIF name matching using the species and genus names
    
    Parameters
    ----------
    genus_name: str
        name of the genus of the species
    species_name: str
        name of the species to request more information
    strict: boolean
        define if the mathing need to be performed with the strict 
        option (True) or not (False)
    
    Returns
    -------
    message: dict
        dictionary with the information returned by the GBIF matching service
    """
    name = '{} {}'.format(genus_name, species_name)
    base_string = 'http://api.gbif.org/v1/species/match?'
    request_parameters = {'strict': strict, 'name': name} # use strict matching(!)
    message = requests.get(base_string, params=request_parameters).json()
    return message
NOTE: For many of these API request handling, dedicated packages do exist, e.g. pygbif provides different functions to do requests to the GBIF API, basically wrapping the request possibilities. For any kind of service, just ask yourself: is the dedicated library providing sufficient additional advantage, or can I easily setup the request myself. (or sometimes: for which the documentation is the best...)

Many services do exist for a wide range of applications, e.g. scientific name matching, matching of addresses, downloading of data,...

Testing our custom matching function:


In [78]:
genus_name = "Callipepla"
species_name = "squamata"
name_match(genus_name, species_name, strict=True)


Out[78]:
{'usageKey': 5228075,
 'scientificName': 'Callipepla squamata (Vigors, 1830)',
 'canonicalName': 'Callipepla squamata',
 'rank': 'SPECIES',
 'status': 'ACCEPTED',
 'confidence': 99,
 'matchType': 'EXACT',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Galliformes',
 'family': 'Odontophoridae',
 'genus': 'Callipepla',
 'species': 'Callipepla squamata',
 'kingdomKey': 1,
 'phylumKey': 44,
 'classKey': 212,
 'orderKey': 723,
 'familyKey': 9325,
 'genusKey': 2474205,
 'speciesKey': 5228075,
 'synonym': False,
 'class': 'Aves'}

However, the matching won't provide an answer for every search:


In [79]:
genus_name = "Lizard"
species_name = "sp."
name_match(genus_name, species_name, strict=True)


Out[79]:
{'confidence': 100, 'matchType': 'NONE', 'synonym': False}

Match each of the species names of the survey data set

Hence, in order to add this information to our survey DataFrame, we need to perform the following steps:

  1. extract the unique genus/species combinations in our dataset and combine them in single column
  2. match each of these names to the GBIF API service
  3. process the returned message:
    • if a match is found, add the information of the columns 'class', 'kingdom', 'order', 'phylum', 'scientificName', 'status' and 'usageKey'
    • if no match was found: nan-values
  4. Join the DataFrame of unique genus/species information with the enriched GBIF info to the survey_data_species data set
EXERCISE: Extract the unique combinations of genus and species in the `survey_data_species` using the function drop_duplicates(). Save the result as the variable unique_species

In [80]:
#%%timeit
unique_species = survey_data_species[["genus", "species"]].drop_duplicates().dropna()

In [81]:
len(unique_species)


Out[81]:
47
EXERCISE: Extract the unique combinations of genus and species in the `survey_data_species` using groupby. Save the result as the variable unique_species

In [82]:
#%%timeit
unique_species = \
    survey_data_species.groupby(["genus", "species"]).first().reset_index()[["genus", "species"]]

In [83]:
len(unique_species)


Out[83]:
47
EXERCISE: Combine the columns genus and species to a single column with the complete name, save it in a new column named 'name'

In [84]:
unique_species["name"] = unique_species["genus"] + " " + unique_species["species"] 
# an alternative approach worthwhile to know:
#unique_species["name"] = unique_species["genus"].str.cat(unique_species["species"], " ")

In [85]:
unique_species.head()


Out[85]:
genus species name
0 Ammodramus savannarum Ammodramus savannarum
1 Ammospermophilus harrisi Ammospermophilus harrisi
2 Amphispiza bilineata Amphispiza bilineata
3 Baiomys taylori Baiomys taylori
4 Calamospiza melanocorys Calamospiza melanocorys

To perform the matching for each of the combination, different options do exist.

Just to show the possibility of using for loops, the addition of the matched information will be done as such. First, we will store everything in one dictionary, where the keys of the dictionary are the index values of unique_species (in order to later merge them again) and the values are the entire messages (which are dictionaries aon itself). The format will look as following:

species_annotated = {O: {'canonicalName': 'Squamata', 'class': 'Reptilia', 'classKey': 358, ...}, 
                     1: {'canonicalName':...},
                     2:...}

In [86]:
species_annotated = {}
for key, row in unique_species.iterrows():
    species_annotated[key] = name_match(row["genus"], row["species"], strict=True)

In [87]:
species_annotated


Out[87]:
{0: {'usageKey': 2491123,
  'scientificName': 'Ammodramus savannarum (J.F.Gmelin, 1789)',
  'canonicalName': 'Ammodramus savannarum',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Ammodramus',
  'species': 'Ammodramus savannarum',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491079,
  'speciesKey': 2491123,
  'synonym': False,
  'class': 'Aves'},
 1: {'usageKey': 2437568,
  'scientificName': 'Ammospermophilus harrisii (Audubon & Bachman, 1854)',
  'canonicalName': 'Ammospermophilus harrisii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 96,
  'matchType': 'FUZZY',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Sciuridae',
  'genus': 'Ammospermophilus',
  'species': 'Ammospermophilus harrisii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 9456,
  'genusKey': 2437567,
  'speciesKey': 2437568,
  'synonym': False,
  'class': 'Mammalia'},
 2: {'usageKey': 2491757,
  'scientificName': 'Amphispiza bilineata (Cassin, 1850)',
  'canonicalName': 'Amphispiza bilineata',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Amphispiza',
  'species': 'Amphispiza bilineata',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491750,
  'speciesKey': 2491757,
  'synonym': False,
  'class': 'Aves'},
 3: {'usageKey': 2438866,
  'scientificName': 'Baiomys taylori (Thomas, 1887)',
  'canonicalName': 'Baiomys taylori',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Baiomys',
  'species': 'Baiomys taylori',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438864,
  'speciesKey': 2438866,
  'synonym': False,
  'class': 'Mammalia'},
 4: {'usageKey': 2491893,
  'scientificName': 'Calamospiza melanocorys Stejneger, 1885',
  'canonicalName': 'Calamospiza melanocorys',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Calamospiza',
  'species': 'Calamospiza melanocorys',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491892,
  'speciesKey': 2491893,
  'synonym': False,
  'class': 'Aves'},
 5: {'usageKey': 5228075,
  'scientificName': 'Callipepla squamata (Vigors, 1830)',
  'canonicalName': 'Callipepla squamata',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Galliformes',
  'family': 'Odontophoridae',
  'genus': 'Callipepla',
  'species': 'Callipepla squamata',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 723,
  'familyKey': 9325,
  'genusKey': 2474205,
  'speciesKey': 5228075,
  'synonym': False,
  'class': 'Aves'},
 6: {'usageKey': 5231474,
  'scientificName': 'Campylorhynchus brunneicapillus (Lafresnaye, 1835)',
  'canonicalName': 'Campylorhynchus brunneicapillus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Troglodytidae',
  'genus': 'Campylorhynchus',
  'species': 'Campylorhynchus brunneicapillus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9355,
  'genusKey': 2493929,
  'speciesKey': 5231474,
  'synonym': False,
  'class': 'Aves'},
 7: {'usageKey': 2439581,
  'scientificName': 'Chaetodipus baileyi (Merriam, 1894)',
  'canonicalName': 'Chaetodipus baileyi',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus baileyi',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439581,
  'synonym': False,
  'class': 'Mammalia'},
 8: {'usageKey': 2439589,
  'scientificName': 'Chaetodipus intermedius (Merriam, 1889)',
  'canonicalName': 'Chaetodipus intermedius',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus intermedius',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439589,
  'synonym': False,
  'class': 'Mammalia'},
 9: {'usageKey': 2439591,
  'scientificName': 'Chaetodipus penicillatus (Woodhouse, 1852)',
  'canonicalName': 'Chaetodipus penicillatus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus penicillatus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439591,
  'synonym': False,
  'class': 'Mammalia'},
 10: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 11: {'usageKey': 8071886,
  'acceptedUsageKey': 6158624,
  'scientificName': 'Cnemidophorus tigris Grismer, 1999',
  'canonicalName': 'Cnemidophorus tigris',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 97,
  'note': '2 synonym homonyms',
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Teiidae',
  'genus': 'Aspidoscelis',
  'species': 'Aspidoscelis tigris',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5019,
  'genusKey': 2472006,
  'speciesKey': 8913064,
  'synonym': True,
  'class': 'Reptilia'},
 12: {'usageKey': 5227544,
  'acceptedUsageKey': 2472455,
  'scientificName': 'Cnemidophorus uniparens Wright & Lowe, 1965',
  'canonicalName': 'Cnemidophorus uniparens',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Teiidae',
  'genus': 'Aspidoscelis',
  'species': 'Aspidoscelis uniparens',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5019,
  'genusKey': 2472006,
  'speciesKey': 2472455,
  'synonym': True,
  'class': 'Reptilia'},
 13: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 14: {'usageKey': 8945077,
  'scientificName': 'Crotalus viridis Rafinesque, 1818',
  'canonicalName': 'Crotalus viridis',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 98,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Viperidae',
  'genus': 'Crotalus',
  'species': 'Crotalus viridis',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5024,
  'genusKey': 2444354,
  'speciesKey': 8945077,
  'synonym': False,
  'class': 'Reptilia'},
 15: {'usageKey': 2439521,
  'scientificName': 'Dipodomys merriami Mearns, 1890',
  'canonicalName': 'Dipodomys merriami',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Dipodomys',
  'species': 'Dipodomys merriami',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439509,
  'speciesKey': 2439521,
  'synonym': False,
  'class': 'Mammalia'},
 16: {'usageKey': 2439541,
  'scientificName': 'Dipodomys ordii Woodhouse, 1853',
  'canonicalName': 'Dipodomys ordii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Dipodomys',
  'species': 'Dipodomys ordii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439509,
  'speciesKey': 2439541,
  'synonym': False,
  'class': 'Mammalia'},
 17: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 18: {'usageKey': 2439531,
  'scientificName': 'Dipodomys spectabilis Merriam, 1890',
  'canonicalName': 'Dipodomys spectabilis',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Dipodomys',
  'species': 'Dipodomys spectabilis',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439509,
  'speciesKey': 2439531,
  'synonym': False,
  'class': 'Mammalia'},
 19: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 20: {'usageKey': 2438517,
  'scientificName': 'Onychomys leucogaster (Wied-Neuwied, 1841)',
  'canonicalName': 'Onychomys leucogaster',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Onychomys',
  'species': 'Onychomys leucogaster',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438514,
  'speciesKey': 2438517,
  'synonym': False,
  'class': 'Mammalia'},
 21: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 22: {'usageKey': 2438516,
  'scientificName': 'Onychomys torridus (Coues, 1874)',
  'canonicalName': 'Onychomys torridus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Onychomys',
  'species': 'Onychomys torridus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438514,
  'speciesKey': 2438516,
  'synonym': False,
  'class': 'Mammalia'},
 23: {'usageKey': 2439566,
  'scientificName': 'Perognathus flavus Baird, 1855',
  'canonicalName': 'Perognathus flavus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Perognathus',
  'species': 'Perognathus flavus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439552,
  'speciesKey': 2439566,
  'synonym': False,
  'class': 'Mammalia'},
 24: {'usageKey': 2439584,
  'acceptedUsageKey': 2439583,
  'scientificName': 'Perognathus hispidus Baird, 1858',
  'canonicalName': 'Perognathus hispidus',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus hispidus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439583,
  'synonym': True,
  'class': 'Mammalia'},
 25: {'usageKey': 2437981,
  'scientificName': 'Peromyscus eremicus (Baird, 1857)',
  'canonicalName': 'Peromyscus eremicus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Peromyscus',
  'species': 'Peromyscus eremicus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437961,
  'speciesKey': 2437981,
  'synonym': False,
  'class': 'Mammalia'},
 26: {'usageKey': 2438019,
  'scientificName': 'Peromyscus leucopus (Rafinesque, 1818)',
  'canonicalName': 'Peromyscus leucopus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Peromyscus',
  'species': 'Peromyscus leucopus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437961,
  'speciesKey': 2438019,
  'synonym': False,
  'class': 'Mammalia'},
 27: {'usageKey': 2437967,
  'scientificName': 'Peromyscus maniculatus (Wagner, 1845)',
  'canonicalName': 'Peromyscus maniculatus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Peromyscus',
  'species': 'Peromyscus maniculatus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437961,
  'speciesKey': 2437967,
  'synonym': False,
  'class': 'Mammalia'},
 28: {'usageKey': 2491276,
  'scientificName': 'Pipilo chlorurus (Audubon, 1839)',
  'canonicalName': 'Pipilo chlorurus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Pipilo',
  'species': 'Pipilo chlorurus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491204,
  'speciesKey': 2491276,
  'synonym': False,
  'class': 'Aves'},
 29: {'usageKey': 2491244,
  'acceptedUsageKey': 7341622,
  'scientificName': 'Pipilo fuscus Swainson, 1827',
  'canonicalName': 'Pipilo fuscus',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Melozone',
  'species': 'Melozone fusca',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491170,
  'speciesKey': 7341622,
  'synonym': True,
  'class': 'Aves'},
 30: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 31: {'usageKey': 2491728,
  'scientificName': 'Pooecetes gramineus (J.F.Gmelin, 1789)',
  'canonicalName': 'Pooecetes gramineus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Pooecetes',
  'species': 'Pooecetes gramineus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491727,
  'speciesKey': 2491728,
  'synonym': False,
  'class': 'Aves'},
 32: {'usageKey': 2437864,
  'scientificName': 'Reithrodontomys fulvescens J.A.Allen, 1894',
  'canonicalName': 'Reithrodontomys fulvescens',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Reithrodontomys',
  'species': 'Reithrodontomys fulvescens',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437854,
  'speciesKey': 2437864,
  'synonym': False,
  'class': 'Mammalia'},
 33: {'usageKey': 2437874,
  'scientificName': 'Reithrodontomys megalotis (Baird, 1857)',
  'canonicalName': 'Reithrodontomys megalotis',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Reithrodontomys',
  'species': 'Reithrodontomys megalotis',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437854,
  'speciesKey': 2437874,
  'synonym': False,
  'class': 'Mammalia'},
 34: {'usageKey': 2437866,
  'scientificName': 'Reithrodontomys montanus (Baird, 1855)',
  'canonicalName': 'Reithrodontomys montanus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Reithrodontomys',
  'species': 'Reithrodontomys montanus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437854,
  'speciesKey': 2437866,
  'synonym': False,
  'class': 'Mammalia'},
 35: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 36: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 37: {'usageKey': 2451192,
  'scientificName': 'Sceloporus clarkii Baird & Girard, 1852',
  'canonicalName': 'Sceloporus clarkii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 96,
  'matchType': 'FUZZY',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Phrynosomatidae',
  'genus': 'Sceloporus',
  'species': 'Sceloporus clarkii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5016,
  'genusKey': 2451143,
  'speciesKey': 2451192,
  'synonym': False,
  'class': 'Reptilia'},
 38: {'usageKey': 2451347,
  'scientificName': 'Sceloporus undulatus (Bosc & Daudin, 1801)',
  'canonicalName': 'Sceloporus undulatus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Phrynosomatidae',
  'genus': 'Sceloporus',
  'species': 'Sceloporus undulatus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5016,
  'genusKey': 2451143,
  'speciesKey': 2451347,
  'synonym': False,
  'class': 'Reptilia'},
 39: {'usageKey': 2438153,
  'scientificName': 'Sigmodon fulviventer J.A.Allen, 1889',
  'canonicalName': 'Sigmodon fulviventer',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Sigmodon',
  'species': 'Sigmodon fulviventer',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438146,
  'speciesKey': 2438153,
  'synonym': False,
  'class': 'Mammalia'},
 40: {'usageKey': 2438147,
  'scientificName': 'Sigmodon hispidus Say & Ord, 1825',
  'canonicalName': 'Sigmodon hispidus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Sigmodon',
  'species': 'Sigmodon hispidus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438146,
  'speciesKey': 2438147,
  'synonym': False,
  'class': 'Mammalia'},
 41: {'usageKey': 2438156,
  'scientificName': 'Sigmodon ochrognathus Bailey, 1902',
  'canonicalName': 'Sigmodon ochrognathus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Sigmodon',
  'species': 'Sigmodon ochrognathus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438146,
  'speciesKey': 2438156,
  'synonym': False,
  'class': 'Mammalia'},
 42: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 43: {'usageKey': 2437300,
  'acceptedUsageKey': 8299308,
  'scientificName': 'Spermophilus spilosoma Bennett, 1833',
  'canonicalName': 'Spermophilus spilosoma',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Sciuridae',
  'genus': 'Xerospermophilus',
  'species': 'Xerospermophilus spilosoma',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 9456,
  'genusKey': 7838133,
  'speciesKey': 8299308,
  'synonym': True,
  'class': 'Mammalia'},
 44: {'usageKey': 2437325,
  'acceptedUsageKey': 8032606,
  'scientificName': 'Spermophilus tereticaudus Baird, 1858',
  'canonicalName': 'Spermophilus tereticaudus',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Sciuridae',
  'genus': 'Xerospermophilus',
  'species': 'Xerospermophilus tereticaudus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 9456,
  'genusKey': 7838133,
  'speciesKey': 8032606,
  'synonym': True,
  'class': 'Mammalia'},
 45: {'usageKey': 2436910,
  'scientificName': 'Sylvilagus audubonii (Baird, 1858)',
  'canonicalName': 'Sylvilagus audubonii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Lagomorpha',
  'family': 'Leporidae',
  'genus': 'Sylvilagus',
  'species': 'Sylvilagus audubonii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 785,
  'familyKey': 9379,
  'genusKey': 2436839,
  'speciesKey': 2436910,
  'synonym': False,
  'class': 'Mammalia'},
 46: {'usageKey': 5231132,
  'scientificName': 'Zonotrichia leucophrys (J.R.Forster, 1772)',
  'canonicalName': 'Zonotrichia leucophrys',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Zonotrichia',
  'species': 'Zonotrichia leucophrys',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 6173226,
  'speciesKey': 5231132,
  'synonym': False,
  'class': 'Aves'}}

We can now transform this to a pandas DataFrame:

EXERCISE: Convert the dictionary species_annotated into a pandas DataFrame with the row index the key-values corresponding to unique_species and the column headers the output columns of the API response. Save the result as the variable df_species_annotated. __Tip__: `transpose` can be used to flip rows and columns

In [88]:
df_species_annotated = pd.DataFrame(species_annotated).transpose()

In [89]:
df_species_annotated.head()


Out[89]:
usageKey scientificName canonicalName rank status confidence matchType kingdom phylum order ... phylumKey classKey orderKey familyKey genusKey speciesKey synonym class note acceptedUsageKey
0 2491123 Ammodramus savannarum (J.F.Gmelin, 1789) Ammodramus savannarum SPECIES ACCEPTED 99 EXACT Animalia Chordata Passeriformes ... 44 212 729 9608 2491079 2491123 False Aves NaN NaN
1 2437568 Ammospermophilus harrisii (Audubon & Bachman, ... Ammospermophilus harrisii SPECIES ACCEPTED 96 FUZZY Animalia Chordata Rodentia ... 44 359 1459 9456 2437567 2437568 False Mammalia NaN NaN
2 2491757 Amphispiza bilineata (Cassin, 1850) Amphispiza bilineata SPECIES ACCEPTED 99 EXACT Animalia Chordata Passeriformes ... 44 212 729 9608 2491750 2491757 False Aves NaN NaN
3 2438866 Baiomys taylori (Thomas, 1887) Baiomys taylori SPECIES ACCEPTED 99 EXACT Animalia Chordata Rodentia ... 44 359 1459 3240723 2438864 2438866 False Mammalia NaN NaN
4 2491893 Calamospiza melanocorys Stejneger, 1885 Calamospiza melanocorys SPECIES ACCEPTED 99 EXACT Animalia Chordata Passeriformes ... 44 212 729 9608 2491892 2491893 False Aves NaN NaN

5 rows × 24 columns

Select relevant information and add this to the survey data

EXERCISE: Subselect the columns 'class', 'kingdom', 'order', 'phylum', 'scientificName', 'status' and 'usageKey' from the DataFrame `df_species_annotated`. Save it as the variable df_species_annotated_subset

In [90]:
df_species_annotated_subset = df_species_annotated[['class', 'kingdom', 'order', 'phylum', 
                                                    'scientificName', 'status', 'usageKey']]

In [91]:
df_species_annotated_subset.head()


Out[91]:
class kingdom order phylum scientificName status usageKey
0 Aves Animalia Passeriformes Chordata Ammodramus savannarum (J.F.Gmelin, 1789) ACCEPTED 2491123
1 Mammalia Animalia Rodentia Chordata Ammospermophilus harrisii (Audubon & Bachman, ... ACCEPTED 2437568
2 Aves Animalia Passeriformes Chordata Amphispiza bilineata (Cassin, 1850) ACCEPTED 2491757
3 Mammalia Animalia Rodentia Chordata Baiomys taylori (Thomas, 1887) ACCEPTED 2438866
4 Aves Animalia Passeriformes Chordata Calamospiza melanocorys Stejneger, 1885 ACCEPTED 2491893
EXERCISE: Join the df_species_annotated_subset information to the `unique_species` overview of species. Save the result as variable unique_species_annotated:

In [92]:
unique_species_annotated = pd.merge(unique_species, df_species_annotated_subset, 
                                    left_index=True, right_index=True)
EXERCISE: Add the `unique_species_annotated` data to the `survey_data_species` data set, using both the genus and species column as keys. Save the result as the variable survey_data_completed.

In [93]:
survey_data_completed = pd.merge(survey_data_species, unique_species_annotated, 
                                 how='left', on= ["genus", "species"])

In [94]:
len(survey_data_completed)


Out[94]:
35550

In [95]:
survey_data_completed.head()


Out[95]:
verbatimLocality verbatimSex wgt datasetName sex occurrenceID eventDate decimalLongitude decimalLatitude genus species taxa name class kingdom order phylum scientificName status usageKey
0 2 M NaN Ecological Archives E090-118-D1. male 1 1977-07-16 -109.081975 31.938887 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 3 M NaN Ecological Archives E090-118-D1. male 2 1977-07-16 -109.081208 31.938896 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 F NaN Ecological Archives E090-118-D1. female 3 1977-07-16 -109.081975 31.938887 Dipodomys merriami Rodent Dipodomys merriami Mammalia Animalia Rodentia Chordata Dipodomys merriami Mearns, 1890 ACCEPTED 2439521
3 7 M NaN Ecological Archives E090-118-D1. male 4 1977-07-16 -109.082816 31.938113 Dipodomys merriami Rodent Dipodomys merriami Mammalia Animalia Rodentia Chordata Dipodomys merriami Mearns, 1890 ACCEPTED 2439521
4 3 M NaN Ecological Archives E090-118-D1. male 5 1977-07-16 -109.081208 31.938896 Dipodomys merriami Rodent Dipodomys merriami Mammalia Animalia Rodentia Chordata Dipodomys merriami Mearns, 1890 ACCEPTED 2439521

Congratulations! You did a great cleaning job, save your result:


In [96]:
survey_data_completed.to_csv("../data/survey_data_completed.csv", index=False)

Acknowledgements

  • species.csv and survey.csv are used from the data carpentry workshop This data is from the paper S. K. Morgan Ernest, Thomas J. Valone, and James H. Brown. 2009. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. Ecology 90:1708. http://esapubs.org/archive/ecol/E090/118/
  • The plot_location.xlsx is a dummy created location file purely created for this exercise, using the plots location on google maps
  • GBIF API