CASE - Biodiversity data - data cleaning and enrichment

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

Scenario:
You are interested in occurrence data for a number of species in Flanders. Unfortunately, the sources for this type of data are still scattered among different institutes. After a mailing campaign, you receive a number of files from different formats, in various data formats and styles...

You decide to be brave and script the interpretation and transformation, in order to provide reproducibility of your work. Moreover, similar analysis will be needed in the future with new data requests. You hope that future data requests will result in similar data formats from the individual partners. So, having a script will enhance the efficiency at that moment.

Besides from the technical differences in the data formats (csv, excel, shapefile, sqlite...), there are also differences in the naming of the content. For example, the coordinates, can be named x/y, decimalLatitude/decimalLongitude, lat/long... Luckely, you know of an international open data standard to describe occurrence data, i.e. Darwin Core (DwC). Instead of inventing your own data model, you decide to comply to this international standard. The latter will enhance communication and will also make your data compliant to other data services working with this kind of data.

In short, the DwC describes a flat table (cfr. CSV) with an agreed name convention on the header names and conventions on how certain data types need to be represented. Whereas the standard definitions are out of scope, an in depth description is given here. For this tutorial, we will focus on a few of the existing terms to learn some elements about data cleaning:

eventDate: ISO 6801 format of dates
scientificName: the accepted scientific name of the species
decimalLatitude/decimalLongitude: coordinates of the occurrence in WGS84 format
sex: either male or female to characterise the sex of the occurrence
occurrenceID: a identifier within the dataset to identify the individual records
datasetName: a static string defining the source of the data

Futhermore, additional information concering the taxonomy will be added using an external API service

Dataset to work on:

For this dataset, the data is provided in the following main data files:

surveys.csv the data with the surveys observed in the individual plots
species.csv the overview list of the species shortnames
plot_location.xlsx the overview of coordinates of the individual locations

The data originates from a study of a Chihuahuan desert ecosystem near Portal, Arizona

Survey-data

Reading in the data of the individual surveys:



In [2]:

    
survey_data = pd.read_csv("../data/surveys.csv")



In [3]:

    
survey_data.head()

EXERCISE: How many individual records (occurrences) does the survey data set contain?



In [4]:

    
len(survey_data)









    Out[4]:





35549

Adding the data source information as static column

For convenience when this dataset will be combined with other datasets, we first add a column of static values, defining the datasetName of this particular data:



In [5]:

    
datasetname = "Ecological Archives E090-118-D1."

Adding this static value as a new column datasetName:

EXERCISE: Add a new column, 'datasetName', to the survey data set with datasetname as value for all of the records (static value for the entire data set)



In [6]:

    
survey_data["datasetName"] = datasetname

Cleaning the sex_char column into a DwC called sex column

EXERCISE: Get a list of the unique values for the column sex_char. __Tip__, to find the unique values, look for a function called `unique`...



In [7]:

    
survey_data["sex_char"].unique().tolist()









    Out[7]:





['M', 'F', nan, 'R', 'P', 'Z']

So, apparently, more information is provided in this column, whereas according to the metadata information, the sex information should be either M (male) or F (female). We will create a column, named sex and convert the symbols to the corresponding sex, taking into account the following mapping of the values (see metadata for more details):

M -> male
F -> female
R -> male
P -> female
Z -> nan

At the same time, we will save the original information of the sex_char in a separate column, called verbatimSex, as a reference.

In summary, we have to:

create a new column verbatimSex, which is a copy of the current sex_char column
create a new column with the name sex
map the original values of the sex_char to the values male and female according to the listing above

Converting the name of the column header sex_char to verbatimSex with the rename function:



In [8]:

    
survey_data = survey_data.rename(columns={'sex_char': 'verbatimSex'})

EXERCISE: Express the mapping of the the values (e.g. M -> male) into a dictionary object called sex_dict __Tip__: (1) a NaN-value can be defined as `np.nan`, (2) a dictionary is a Python data structure, https://docs.python.org/3/tutorial/datastructures.html#dictionaries



In [9]:

    
sex_dict = {"M": "male",
            "F": "female",
            "R": "male",
            "P": "female",
            "Z": np.nan}

EXERCISE: Use the dictionary to replace the values in the `verbatimSex` column to the new values according to the `sex_dict` mapping dictionary and save the mapped values in a new column 'sex'. __Tip__: to replace values using a mapping dictionary, look for a function called `replace`...



In [10]:

    
survey_data['sex'] = survey_data['verbatimSex'].replace(sex_dict)

Checking the current frequency of values (this should result in the values male, female and nan):



In [11]:

    
survey_data["sex"].unique()









    Out[11]:





array(['male', 'female', nan], dtype=object)

To check what the frequency of occurrences is for male/female of the categories, a bar chart is an possible representation:

EXERCISE: Make a horizontal bar chart comparing the number of male, female and unknown (NaN) records in the dataset __Tip__: check in the help of the Pandas plot function for the `kind` parameter



In [12]:

    
survey_data["sex"].value_counts(dropna=False).plot(kind="barh", color="#00007f")









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f481027fa90>

NOTE: The usage of `groupby` combined with a `count` of each group would be an option as well. However, the latter does not support to count the `NaN` values as well. The `value_counts` method does support this with the `dropna=False` argument.

Solving double entry field by decoupling

When checking the species unique information:



In [13]:

    
survey_data["species"].unique()









    Out[13]:





array([nan, 'DM', 'PF', 'PE', 'DM and SH', 'DS', 'PP', 'SH', 'OT', 'DO',
       'OX', 'SS', 'OL', 'RM', 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM',
       'CQ', 'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL',
       'CS', 'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU',
       'RX', 'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)



In [14]:

    
survey_data.head(10)









    Out[14]:







  
    
      
      record_id
      month
      day
      year
      plot
      species
      verbatimSex
      wgt
      datasetName
      sex
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      1
      2
      7
      16
      1977
      3
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      NaN
      Ecological Archives E090-118-D1.
      female
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      5
      6
      7
      16
      1977
      1
      PF
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      6
      7
      7
      16
      1977
      2
      PE
      F
      NaN
      Ecological Archives E090-118-D1.
      female
    
    
      7
      8
      7
      16
      1977
      1
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      8
      9
      7
      16
      1977
      1
      DM and SH
      NaN
      NaN
      Ecological Archives E090-118-D1.
      NaN
    
    
      9
      10
      7
      16
      1977
      6
      PF
      F
      NaN
      Ecological Archives E090-118-D1.
      female

There apparently exists a double entry: 'DM and SH', which basically defines two records and should be decoupled to two individual records (i.e. rows). Hence, we should be able to create a additional row based on this split. To do so, Pandas provides a dedicated function since version 0.25, called explode. Starting from a small subset example:



In [15]:

    
example = survey_data.loc[7:10, "species"]
example









    Out[15]:





7            DM
8     DM and SH
9            PF
10           DS
Name: species, dtype: object

Using the split method on strings, we can split the string using a given character, in this case the word and:



In [16]:

    
example.str.split("and")









    Out[16]:





7           [DM]
8     [DM ,  SH]
9           [PF]
10          [DS]
Name: species, dtype: object

The explode method will create a row for each element in the list:



In [17]:

    
example_split = example.str.split("and").explode()
example_split









    Out[17]:





7      DM
8     DM 
8      SH
9      PF
10     DS
Name: species, dtype: object

Hence, the DM and SH are now enlisted in separate rows. Other rows remain unchanged. The only remaining issue is the spaces around the characters:



In [18]:

    
example_split.iloc[1], example_split.iloc[2]









    Out[18]:





('DM ', ' SH')

Which we can solve again using the string method strip, removing the spaces before and after the characters:



In [19]:

    
example_split.str.strip()









    Out[19]:





7     DM
8     DM
8     SH
9     PF
10    DS
Name: species, dtype: object

To make this reusable, let's create a dedicated function to combine these steps, called solve_double_field_entry:



In [20]:

    
def solve_double_field_entry(df, keyword="and", column="verbatimEventDate"):
    """split on keyword in column for an enumeration and create extra record
    
    Parameters
    ----------
    df: pd.DataFrame
        DataFrame with a double field entry in one or more values
    keyword: str
        word/character to split the double records on
    column: str
        column name to use for the decoupling of the records
    """
    df[column] = df[column].str.split(keyword)
    df = df.explode(column)
    df[column] = df[column].str.strip()  # remove white space around the words
    return df

The function takes a DataFrame as input, splits the record into separate rows and returns the updated DataFrame. We can use this function to get an update of the dataFrame, with the an additional row (occurrence) added by decoupling the specific field:

EXERCISE: Use the function solve_double_field_entry to create a dataFrame with an additional row, by decoupling the double entries. Save the result as a variable survey_data_decoupled.



In [21]:

    
survey_data_decoupled = solve_double_field_entry(survey_data.copy(), 
                                                 "and", 
                                                 column="species") # get help of the function by SHIFT + TAB
# REMARK: the copy() statement here (!) see pandas_03b_indexing.ipynb notebook `chained indexing section`



In [22]:

    
survey_data_decoupled["species"].unique()









    Out[22]:





array([nan, 'DM', 'PF', 'PE', 'SH', 'DS', 'PP', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ', 'RF',
       'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS', 'SC',
       'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX', 'PB',
       'PL', 'PX', 'CT', 'US'], dtype=object)



In [23]:

    
survey_data_decoupled.head(11)









    Out[23]:







  
    
      
      record_id
      month
      day
      year
      plot
      species
      verbatimSex
      wgt
      datasetName
      sex
    
  
  
    
      0
      1
      7
      16
      1977
      2
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      1
      2
      7
      16
      1977
      3
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      2
      3
      7
      16
      1977
      2
      DM
      F
      NaN
      Ecological Archives E090-118-D1.
      female
    
    
      3
      4
      7
      16
      1977
      7
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      4
      5
      7
      16
      1977
      3
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      5
      6
      7
      16
      1977
      1
      PF
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      6
      7
      7
      16
      1977
      2
      PE
      F
      NaN
      Ecological Archives E090-118-D1.
      female
    
    
      7
      8
      7
      16
      1977
      1
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
    
    
      8
      9
      7
      16
      1977
      1
      DM
      NaN
      NaN
      Ecological Archives E090-118-D1.
      NaN
    
    
      8
      9
      7
      16
      1977
      1
      SH
      NaN
      NaN
      Ecological Archives E090-118-D1.
      NaN
    
    
      9
      10
      7
      16
      1977
      6
      PF
      F
      NaN
      Ecological Archives E090-118-D1.
      female

Create new occurrence identifier

The record_id is no longer a unique identifier after the decoupling of this dataset. We will make a new dataset-specific identifier, by adding a column called occurrenceID that takes a new counter as identifier. As a simply and straightforward approach, we will use a new counter for the whole dataset, starting with 1:



In [24]:

    
np.arange(1, len(survey_data_decoupled) + 1, 1)









    Out[24]:





array([    1,     2,     3, ..., 35548, 35549, 35550])

Create a new column with header occurrenceID with the values 1 -> 35550 as field values:



In [25]:

    
survey_data_decoupled["occurrenceID"] = np.arange(1, len(survey_data_decoupled) + 1, 1)

Remark: A reset of the index to generate this column with `reset_index(drop=False)` would be technically perfectly valid. Still, we want the indices to start at 1 instead of 0 (and Python starts counting at 0!)

To overcome the confusion on having both a record_id and occurrenceID field, we will remove the record_id term:



In [26]:

    
survey_data_decoupled = survey_data_decoupled.drop(columns="record_id")

Hence, columns can be drop-ped out of a DataFrame



In [27]:

    
survey_data_decoupled.head(10)









    Out[27]:







  
    
      
      month
      day
      year
      plot
      species
      verbatimSex
      wgt
      datasetName
      sex
      occurrenceID
    
  
  
    
      0
      7
      16
      1977
      2
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1
    
    
      1
      7
      16
      1977
      3
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      2
    
    
      2
      7
      16
      1977
      2
      DM
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      3
    
    
      3
      7
      16
      1977
      7
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      4
    
    
      4
      7
      16
      1977
      3
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      5
    
    
      5
      7
      16
      1977
      1
      PF
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      6
    
    
      6
      7
      16
      1977
      2
      PE
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      7
    
    
      7
      7
      16
      1977
      1
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      8
    
    
      8
      7
      16
      1977
      1
      DM
      NaN
      NaN
      Ecological Archives E090-118-D1.
      NaN
      9
    
    
      8
      7
      16
      1977
      1
      SH
      NaN
      NaN
      Ecological Archives E090-118-D1.
      NaN
      10

Converting the date values

In the survey-dataset we received, the month, day, and year columns are containing the information about the date, i.e. eventDate in DarwinCore terms. We want this data in a ISO format YYYY-MM-DD. A convenvient Pandas function is the usage of to_datatime, which provides multiple options to interpret dates. One of thes options is the automatic interpretation of some 'typical' columns, like year, month and day, when passing a DataFrame.



In [28]:

    
# pd.to_datetime(survey_data_decoupled[["year", "month", "day"]])  # uncomment the line and test this statement

This is not working, not all dates can be interpreted... We should get some more information on the reason of the errors. By using the option coerce, the problem makers will be labeled as a missing value NaT. We can count the number of dates that can not be interpreted:



In [29]:

    
sum(pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull())









    Out[29]:





136

EXERCISE: Make a subselection of survey_data_decoupled containing those records that can not correctly be interpreted as date values and save the resulting dataframe as variable trouble_makers



In [30]:

    
mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull()
trouble_makers = survey_data_decoupled[mask]

Checking some charactersitics of the trouble_makers:



In [31]:

    
trouble_makers.head()









    Out[31]:







  
    
      
      month
      day
      year
      plot
      species
      verbatimSex
      wgt
      datasetName
      sex
      occurrenceID
    
  
  
    
      30649
      4
      31
      2000
      6
      PP
      F
      19.0
      Ecological Archives E090-118-D1.
      female
      30651
    
    
      30650
      4
      31
      2000
      6
      PB
      M
      32.0
      Ecological Archives E090-118-D1.
      male
      30652
    
    
      30651
      4
      31
      2000
      6
      PB
      F
      30.0
      Ecological Archives E090-118-D1.
      female
      30653
    
    
      30652
      4
      31
      2000
      6
      PP
      M
      20.0
      Ecological Archives E090-118-D1.
      male
      30654
    
    
      30653
      4
      31
      2000
      6
      PP
      M
      24.0
      Ecological Archives E090-118-D1.
      male
      30655



In [32]:

    
trouble_makers["day"].unique()









    Out[32]:





array([31])



In [33]:

    
trouble_makers["month"].unique()









    Out[33]:





array([4, 9])



In [34]:

    
trouble_makers["year"].unique()









    Out[34]:





array([2000])

So, basically the problem is the presence of day 31 during the months April and September of the year 2000. At this moment, we would have to recheck the original data in order to know how the issue could be solved. Apparently, - for this specific case - there has been a data-entry problem in 2000, making the 31 days during this period should actually be 30. It would be optimal to correct this in the source dataset, but for the further exercise, it will be corrected here.

EXERCISE: Replace in the dataFrame survey_data_decoupled all of the troublemakers day values into the value 30



In [35]:

    
mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull()
survey_data_decoupled.loc[mask, "day"] = 30

Now, we do the parsing again to create a proper eventDate field, containing the dates:



In [36]:

    
survey_data_decoupled["eventDate"] = \
    pd.to_datetime(survey_data_decoupled[["year", "month", "day"]])

Just let's do a check the amount of data for each year:

EXERCISE: Create a horizontal bar chart with the number of records for each year



In [37]:

    
survey_data_decoupled.groupby("year").size().plot(kind='barh', color="#00007f", figsize=(10, 10))









    Out[37]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f480f9c2310>



In [38]:

    
survey_data_decoupled.head()









    Out[38]:







  
    
      
      month
      day
      year
      plot
      species
      verbatimSex
      wgt
      datasetName
      sex
      occurrenceID
      eventDate
    
  
  
    
      0
      7
      16
      1977
      2
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1
      1977-07-16
    
    
      1
      7
      16
      1977
      3
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      2
      1977-07-16
    
    
      2
      7
      16
      1977
      2
      DM
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      3
      1977-07-16
    
    
      3
      7
      16
      1977
      7
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      4
      1977-07-16
    
    
      4
      7
      16
      1977
      3
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      5
      1977-07-16

Currently, the dates are stored in a python specific date format:



In [39]:

    
survey_data_decoupled["eventDate"].dtype









    Out[39]:





dtype('<M8[ns]')

This is great, because it allows for many functionalities:



In [40]:

    
survey_data_decoupled.eventDate.dt #add a dot and press TAB to explore the date options it provides









    Out[40]:





<pandas.core.indexes.accessors.DatetimeProperties object at 0x7f480f9a3890>

EXERCISE: Create a horizontal bar chart with the number of records for each year (cfr. supra), but without using the column `year`



In [41]:

    
survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.year).size().plot(kind='barh', color="#00007f", figsize=(10, 10))









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f480f73ded0>

So, we actually do not need the day, month, year columns anymore and have other options available as well

EXERCISE: Create a bar chart with the number of records for each weekday



In [42]:

    
nrecords_by_weekday = survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.weekday).size()
ax = nrecords_by_weekday.plot(kind="barh", color="#00007f", figsize=(6, 6))
# I you want to represent the ticklabels as proper names, uncomment the following line:
#ticklabels = ax.set_yticklabels(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

When saving the information to a file (e.g. CSV-file), this data type will be automatically converted to a string representation. However, we could also decide to explicitly provide the string format the dates are stored (losing the date type functionalities), in order to have full control on the way these dates are formatted:



In [43]:

    
survey_data_decoupled["eventDate"] = survey_data_decoupled["eventDate"].dt.strftime('%Y-%m-%d')



In [44]:

    
survey_data_decoupled["eventDate"].head()









    Out[44]:





0    1977-07-16
1    1977-07-16
2    1977-07-16
3    1977-07-16
4    1977-07-16
Name: eventDate, dtype: object

As we do not need the day, month, year columns anymore, we can drop them from the DataFrame:

EXERCISE: Remove the columns day, month and year from the `survey_data_decoupled` DataFrame: __Tip__: Remember the `drop` method?



In [45]:

    
survey_data_decoupled = survey_data_decoupled.drop(columns=["day", "month", "year"])

Add coordinates from the plot locations

Loading the coordinate data

The individual plots are only identified by a plot identification number. In order to provide sufficient information to external users, additional information about the coordinates should be added. The coordinates of the individual plots are saved in another file: plot_location.xlsx. We will use this information to further enrich our dataset and add the Darwin Core Terms decimalLongitude and decimalLatitude.

EXERCISE: Read in the excel file 'plot_location.xlsx' and store the data as the variable `plot_data`, with 3 columns: plot, xutm, yutm.



In [46]:

    
plot_data = pd.read_excel("../data/plot_location.xlsx", skiprows=3, index_col=0)



In [47]:

    
plot_data.head()









    Out[47]:







  
    
      
      plot
      xutm
      yutm
    
  
  
    
      0
      1
      681222.131658
      3.535262e+06
    
    
      1
      2
      681302.799361
      3.535268e+06
    
    
      2
      3
      681375.294968
      3.535270e+06
    
    
      3
      4
      681450.837525
      3.535271e+06
    
    
      4
      5
      681526.983040
      3.535281e+06

Transforming to other coordinate reference system

These coordinates are in meters, more specifically in UTM 12 N coordinate system. However, the agreed coordinate system for Darwin Core is the World Geodetic System 1984 (WGS84).

As this is not a GIS course, we will shortcut the discussion about different projection systems, but provide an example on how such a conversion from UTM12N to WGS84 can be performed with the projection toolkit pyproj and by relying on the existing EPSG codes (a registry originally setup by the association of oil & gas producers).

First, we define out two projection systems, using their corresponding EPSG codes:



In [48]:

    
import pyproj



In [49]:

    
utm12n = pyproj.Proj("+init=EPSG:32612")
wgs84 = pyproj.Proj("+init=EPSG:4326")

The reprojection can be done by the function transform of the projection toolkit, providing the coordinate systems and a set of x, y coordinates. For example, for a single coordinate, this can be applied as follows:



In [50]:

    
pyproj.transform(utm12n, wgs84, 681222.131658, 3.535262e+06)









    Out[50]:





(-109.08282902317859, 31.938849883722508)

Instead of writing a for loop to do this for each of the coordinates in the list, we can apply this function to each of them:

EXERCISE: Apply the pyproj function transform to plot_data, using the columns xutm and yutm and save the resulting output in 2 new columns, called decimalLongitude and decimalLatitude:

Create a function transform_utm_to_wgs that takes a row of a DataFrame and returns a Series of two elements with the longitude and latitude.
Test this function on the first row of plot_data
Now apply this function on all rows (remember the axis parameter)
Assign the result of the previous step to decimalLongitude and decimalLatitude columns



In [51]:

    
def transform_utm_to_wgs(row):
    """
    Converts the x and y coordinates of this row into a Series of the
    longitude and latitude.
    
    """
    utm12n = pyproj.Proj("+init=EPSG:32612")
    wgs84 = pyproj.Proj("+init=EPSG:4326")
    
    return pd.Series(pyproj.transform(utm12n, wgs84, row['xutm'], row['yutm']))



In [52]:

    
transform_utm_to_wgs(plot_data.loc[0])









    Out[52]:





0   -109.082829
1     31.938851
dtype: float64



In [53]:

    
plot_data.apply(transform_utm_to_wgs, axis=1)









    Out[53]:







  
    
      
      0
      1
    
  
  
    
      0
      -109.082829
      31.938851
    
    
      1
      -109.081975
      31.938887
    
    
      2
      -109.081208
      31.938896
    
    
      3
      -109.080409
      31.938894
    
    
      4
      -109.079602
      31.938970
    
    
      5
      -109.078836
      31.939078
    
    
      6
      -109.082816
      31.938113
    
    
      7
      -109.081680
      31.937884
    
    
      8
      -109.080903
      31.937859
    
    
      9
      -109.080091
      31.938017
    
    
      10
      -109.079307
      31.938056
    
    
      11
      -109.078519
      31.938203
    
    
      12
      -109.082613
      31.937028
    
    
      13
      -109.081827
      31.937054
    
    
      14
      -109.081036
      31.937059
    
    
      15
      -109.080244
      31.937094
    
    
      16
      -109.079415
      31.937117
    
    
      17
      -109.078633
      31.937126
    
    
      18
      -109.077912
      31.937438
    
    
      19
      -109.080191
      31.936334
    
    
      20
      -109.079398
      31.936448
    
    
      21
      -109.078602
      31.936441
    
    
      22
      -109.077838
      31.936763
    
    
      23
      -109.077736
      31.938560



In [54]:

    
plot_data[["decimalLongitude" ,"decimalLatitude"]] = plot_data.apply(transform_utm_to_wgs, axis=1)



In [55]:

    
plot_data.head()









    Out[55]:







  
    
      
      plot
      xutm
      yutm
      decimalLongitude
      decimalLatitude
    
  
  
    
      0
      1
      681222.131658
      3.535262e+06
      -109.082829
      31.938851
    
    
      1
      2
      681302.799361
      3.535268e+06
      -109.081975
      31.938887
    
    
      2
      3
      681375.294968
      3.535270e+06
      -109.081208
      31.938896
    
    
      3
      4
      681450.837525
      3.535271e+06
      -109.080409
      31.938894
    
    
      4
      5
      681526.983040
      3.535281e+06
      -109.079602
      31.938970

The above function transform_utm_to_wgs you have created is a very specific function that knows the structure of the DataFrame you will apply it to (it assumes the 'xutm' and 'yutm' column names). We could also make a more generic function that just takes a X and Y coordinate and returns the Series of converted coordinates (transform_utm_to_wgs2(X, Y)).

To apply such a more generic function to the plot_data DataFrame, we can make use of the lambda construct, which lets you specify a function on one line as an argument:

plot_data.apply(lambda row : transform_utm_to_wgs2(row['xutm'], row['yutm']), axis=1)

If you have time, try to implement this function and test it as well.

(intermezzo) Checking the coordinates on a map

To check the transformation, let's put these on an interactive map. Leaflet is a famous service for this and in many programming languages wrappers do exist to simplify the usage. Folium is an extensive library providing multiple options. As we just want to do a quick checkup of the coordinates, we will rely on the package mplleaflet, which just converts a matplotlib image to a leaflet map:



In [56]:

    
import mplleaflet  # https://github.com/jwass/mplleaflet



In [57]:

    
fig, ax = plt.subplots(figsize=(5, 8))
plt.plot(plot_data['decimalLongitude'], plot_data['decimalLatitude'], 'rs')

mplleaflet.display(fig=fig) # zoom out to see where the measurement plot is located









    



/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/IPython/core/display.py:701: UserWarning: Consider using IPython.display.IFrame instead
  warnings.warn("Consider using IPython.display.IFrame instead")






    Out[57]:

Join the coordinate information to the survey data set

All points are inside the desert region as we expected, so we can extend our survey dataset with this coordinate information. Making the combination of two data sets based on a common identifier is completely similar to the usage of JOIN operations in databases. In Pandas, this functionality is provided by pd.merge.

In practice, we have to add the columns decimalLongitude/decimalLatitude to the current dataset survey_data_decoupled, by using the plot identification number as key to join.

EXERCISE: Extract only the columns to join to our survey dataset: the plot identifiers, decimalLatitude and decimalLongitude into a new variable named plot_data_selection



In [58]:

    
plot_data_selection = plot_data[["plot", "decimalLongitude", "decimalLatitude"]]

EXERCISE: Based on the documentation of Pandas merge, add the coordinate information (plot_data_selection) to the survey data set and save the resulting DataFrame as survey_data_plots. __Tip__: documentation of [merge](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-DataFrame-joining-merging)...



In [59]:

    
survey_data_plots = pd.merge(survey_data_decoupled, plot_data_selection, 
                             how="left", on="plot")



In [60]:

    
survey_data_plots.head()









    Out[60]:







  
    
      
      plot
      species
      verbatimSex
      wgt
      datasetName
      sex
      occurrenceID
      eventDate
      decimalLongitude
      decimalLatitude
    
  
  
    
      0
      2
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1
      1977-07-16
      -109.081975
      31.938887
    
    
      1
      3
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      2
      1977-07-16
      -109.081208
      31.938896
    
    
      2
      2
      DM
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      3
      1977-07-16
      -109.081975
      31.938887
    
    
      3
      7
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      4
      1977-07-16
      -109.082816
      31.938113
    
    
      4
      3
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      5
      1977-07-16
      -109.081208
      31.938896

The plot locations need to be stored with the variable name verbatimLocality indicating th identifier as integer value of the plot:



In [61]:

    
survey_data_plots = survey_data_plots.rename(columns={'plot': 'verbatimLocality'})

Add species names to dataset

The column species only provides a short identifier in the survey overview. The name information is stored in a separate file species.csv. As we want our dataset to include this information, we will read in this data and add it to our survey dataset:

EXERCISE: Read in the 'species.csv' file and save the resulting DataFrame as variable species_data __Tip__: check the delimiter (`sep`) to define



In [62]:

    
species_data = pd.read_csv("../data/species.csv", sep=";")



In [63]:

    
species_data.head()









    Out[63]:







  
    
      
      species_id
      genus
      species
      taxa
    
  
  
    
      0
      AB
      Amphispiza
      bilineata
      Bird
    
    
      1
      AH
      Ammospermophilus
      harrisi
      Rodent-not censused
    
    
      2
      AS
      Ammodramus
      savannarum
      Bird
    
    
      3
      BA
      Baiomys
      taylori
      Rodent
    
    
      4
      CB
      Campylorhynchus
      brunneicapillus
      Bird

Fix a wrong acronym naming

When reviewing the metadata, you see that in the data-file the acronym NE is used to describe Neotoma albigula, whereas in the metadata description, the acronym NA is used.

EXERCISE: Convert the value of 'NE' to 'NA' by using boolean indexing for the `species_id` column



In [64]:

    
species_data.loc[species_data["species_id"] == "NE", "species_id"] = "NA"

(At the same time, you decide to cure this problem at the source and alert the data provider about this issue.)

Merging surveys and species

As we now prepared the two series, we can combine the data, using the merge operation. Take into account that our key-column is different for species_data and survey_data_plots, respectively species_id and species:

We want to add the data of the species to the survey data, in order to see the full species names:

EXERCISE: Merge the `survey_data_plots` data set with the `species_data` information in order to pass the species information to the survey data:



In [65]:

    
survey_data_species = pd.merge(survey_data_plots, species_data, how="left",  # LEFT OR INNER?
                                left_on="species", right_on="species_id")



In [66]:

    
len(survey_data_species)









    Out[66]:





35550

The join is ok, but we are left with some redundant columns and wrong naming:



In [67]:

    
survey_data_species.head()









    Out[67]:







  
    
      
      verbatimLocality
      species_x
      verbatimSex
      wgt
      datasetName
      sex
      occurrenceID
      eventDate
      decimalLongitude
      decimalLatitude
      species_id
      genus
      species_y
      taxa
    
  
  
    
      0
      2
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1
      1977-07-16
      -109.081975
      31.938887
      NaN
      NaN
      NaN
      NaN
    
    
      1
      3
      NaN
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      2
      1977-07-16
      -109.081208
      31.938896
      NaN
      NaN
      NaN
      NaN
    
    
      2
      2
      DM
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      3
      1977-07-16
      -109.081975
      31.938887
      DM
      Dipodomys
      merriami
      Rodent
    
    
      3
      7
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      4
      1977-07-16
      -109.082816
      31.938113
      DM
      Dipodomys
      merriami
      Rodent
    
    
      4
      3
      DM
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      5
      1977-07-16
      -109.081208
      31.938896
      DM
      Dipodomys
      merriami
      Rodent

We do not need the columns species_x and species_id column anymore, as we will use the scientific names from now on:



In [68]:

    
survey_data_species = survey_data_species.drop(["species_x", "species_id"], axis=1)

The column species_y could just be named species:



In [69]:

    
survey_data_species = survey_data_species.rename(columns={"species_y": "species"})



In [70]:

    
survey_data_species.head()









    Out[70]:







  
    
      
      verbatimLocality
      verbatimSex
      wgt
      datasetName
      sex
      occurrenceID
      eventDate
      decimalLongitude
      decimalLatitude
      genus
      species
      taxa
    
  
  
    
      0
      2
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1
      1977-07-16
      -109.081975
      31.938887
      NaN
      NaN
      NaN
    
    
      1
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      2
      1977-07-16
      -109.081208
      31.938896
      NaN
      NaN
      NaN
    
    
      2
      2
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      3
      1977-07-16
      -109.081975
      31.938887
      Dipodomys
      merriami
      Rodent
    
    
      3
      7
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      4
      1977-07-16
      -109.082816
      31.938113
      Dipodomys
      merriami
      Rodent
    
    
      4
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      5
      1977-07-16
      -109.081208
      31.938896
      Dipodomys
      merriami
      Rodent



In [71]:

    
len(survey_data_species)









    Out[71]:





35550

Let's now save our cleaned-up to a csv file, so we can further analyze the data in a following notebook:



In [72]:

    
survey_data_species.to_csv("interim_survey_data_species.csv", index=False)

(OPTIONAL SECTION) Using a API service to match the scientific names

As the current species names are rather short and could eventually lead to confusion when shared with other users, retrieving additional information about the different species in our dataset would be useful to integrate our work with other research. An option is to match our names with an external service to request additional information about the different species.

One of these services is GBIF API. The service can most easily be illustrated with a small example:

In a new tabblad of the browser, go to the URL http://www.gbif.org/species/2475532, which corresponds to the page of Alcedo atthis (ijsvogel in dutch). One could for each of the species in the list we have do a search on the website of GBIF to find the corresponding page of the different species, from which more information can be extracted manually. However, this would take a lot of time...

Therefore, GBIF (and many other organisations!) provides a service to extract the same information on a machine-readable way, in order to automate these searches. As an example, let's search for the information of Alcedo atthis, using the GBIF API: Go to the URL: http://api.gbif.org/v1/species/match?name=Alcedo atthis and check the output. What we did is a machine-based search on the GBIF website for information about Alcedo atthis.

The same can be done using Python. The main library we need to this kind of automated searches is the requests package, which can be used to do request to any kind of API out there.



In [73]:

    
import requests

Example matching with Alcedo Atthis

For the example of Alcedo atthis:



In [74]:

    
species_name = 'Alcedo atthis'



In [75]:

    
base_string = 'http://api.gbif.org/v1/species/match?'
request_parameters = {'verbose': False, 'strict': True, 'name': species_name}
message = requests.get(base_string, params=request_parameters).json()
message









    Out[75]:





{'usageKey': 2475532,
 'scientificName': 'Alcedo atthis (Linnaeus, 1758)',
 'canonicalName': 'Alcedo atthis',
 'rank': 'SPECIES',
 'status': 'ACCEPTED',
 'confidence': 99,
 'matchType': 'EXACT',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Coraciiformes',
 'family': 'Alcedinidae',
 'genus': 'Alcedo',
 'species': 'Alcedo atthis',
 'kingdomKey': 1,
 'phylumKey': 44,
 'classKey': 212,
 'orderKey': 1447,
 'familyKey': 2984,
 'genusKey': 2475493,
 'speciesKey': 2475532,
 'synonym': False,
 'class': 'Aves'}

From which we get a dictionary containing more information about the taxonomy of the Alcedo atthis.

In the species data set available, the name to match is provided in the combination of two columns, so we have to combine those to in order to execute the name matching:



In [76]:

    
genus_name = "Callipepla"
species_name = "squamata"
name_to_match = '{} {}'.format(genus_name, species_name)
base_string = 'http://api.gbif.org/v1/species/match?'
request_parameters = {'strict': True, 'name': name_to_match} # use strict matching(!)
message = requests.get(base_string, params=request_parameters).json()
message









    Out[76]:





{'usageKey': 5228075,
 'scientificName': 'Callipepla squamata (Vigors, 1830)',
 'canonicalName': 'Callipepla squamata',
 'rank': 'SPECIES',
 'status': 'ACCEPTED',
 'confidence': 99,
 'matchType': 'EXACT',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Galliformes',
 'family': 'Odontophoridae',
 'genus': 'Callipepla',
 'species': 'Callipepla squamata',
 'kingdomKey': 1,
 'phylumKey': 44,
 'classKey': 212,
 'orderKey': 723,
 'familyKey': 9325,
 'genusKey': 2474205,
 'speciesKey': 5228075,
 'synonym': False,
 'class': 'Aves'}

To apply this on our species data set, we will have to do this request for each of the individual species/genus combination. As, this is a returning functionality, we will write a small function to do this:

Writing a custom matching function

EXERCISE: Write a function, called `name_match` that takes the `genus`, the `species` and the option to perform a strict matching or not as inputs, performs a matching with the GBIF name matching API and return the received message as a dictionary.



In [77]:

    
def name_match(genus_name, species_name, strict=True):
    """
    Perform a GBIF name matching using the species and genus names
    
    Parameters
    ----------
    genus_name: str
        name of the genus of the species
    species_name: str
        name of the species to request more information
    strict: boolean
        define if the mathing need to be performed with the strict 
        option (True) or not (False)
    
    Returns
    -------
    message: dict
        dictionary with the information returned by the GBIF matching service
    """
    name = '{} {}'.format(genus_name, species_name)
    base_string = 'http://api.gbif.org/v1/species/match?'
    request_parameters = {'strict': strict, 'name': name} # use strict matching(!)
    message = requests.get(base_string, params=request_parameters).json()
    return message

NOTE: For many of these API request handling, dedicated packages do exist, e.g. pygbif provides different functions to do requests to the GBIF API, basically wrapping the request possibilities. For any kind of service, just ask yourself: is the dedicated library providing sufficient additional advantage, or can I easily setup the request myself. (or sometimes: for which the documentation is the best...)

Many services do exist for a wide range of applications, e.g. scientific name matching, matching of addresses, downloading of data,...

Testing our custom matching function:



In [78]:

    
genus_name = "Callipepla"
species_name = "squamata"
name_match(genus_name, species_name, strict=True)









    Out[78]:





{'usageKey': 5228075,
 'scientificName': 'Callipepla squamata (Vigors, 1830)',
 'canonicalName': 'Callipepla squamata',
 'rank': 'SPECIES',
 'status': 'ACCEPTED',
 'confidence': 99,
 'matchType': 'EXACT',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Galliformes',
 'family': 'Odontophoridae',
 'genus': 'Callipepla',
 'species': 'Callipepla squamata',
 'kingdomKey': 1,
 'phylumKey': 44,
 'classKey': 212,
 'orderKey': 723,
 'familyKey': 9325,
 'genusKey': 2474205,
 'speciesKey': 5228075,
 'synonym': False,
 'class': 'Aves'}

However, the matching won't provide an answer for every search:



In [79]:

    
genus_name = "Lizard"
species_name = "sp."
name_match(genus_name, species_name, strict=True)









    Out[79]:





{'confidence': 100, 'matchType': 'NONE', 'synonym': False}

Match each of the species names of the survey data set

Hence, in order to add this information to our survey DataFrame, we need to perform the following steps:

extract the unique genus/species combinations in our dataset and combine them in single column
match each of these names to the GBIF API service
process the returned message:
- if a match is found, add the information of the columns 'class', 'kingdom', 'order', 'phylum', 'scientificName', 'status' and 'usageKey'
- if no match was found: nan-values
Join the DataFrame of unique genus/species information with the enriched GBIF info to the survey_data_species data set

EXERCISE: Extract the unique combinations of genus and species in the `survey_data_species` using the function drop_duplicates(). Save the result as the variable unique_species



In [80]:

    
#%%timeit
unique_species = survey_data_species[["genus", "species"]].drop_duplicates().dropna()



In [81]:

    
len(unique_species)









    Out[81]:





47

EXERCISE: Extract the unique combinations of genus and species in the `survey_data_species` using groupby. Save the result as the variable unique_species



In [82]:

    
#%%timeit
unique_species = \
    survey_data_species.groupby(["genus", "species"]).first().reset_index()[["genus", "species"]]



In [83]:

    
len(unique_species)









    Out[83]:





47

EXERCISE: Combine the columns genus and species to a single column with the complete name, save it in a new column named 'name'



In [84]:

    
unique_species["name"] = unique_species["genus"] + " " + unique_species["species"] 
# an alternative approach worthwhile to know:
#unique_species["name"] = unique_species["genus"].str.cat(unique_species["species"], " ")



In [85]:

    
unique_species.head()









    Out[85]:







  
    
      
      genus
      species
      name
    
  
  
    
      0
      Ammodramus
      savannarum
      Ammodramus savannarum
    
    
      1
      Ammospermophilus
      harrisi
      Ammospermophilus harrisi
    
    
      2
      Amphispiza
      bilineata
      Amphispiza bilineata
    
    
      3
      Baiomys
      taylori
      Baiomys taylori
    
    
      4
      Calamospiza
      melanocorys
      Calamospiza melanocorys

To perform the matching for each of the combination, different options do exist.

Just to show the possibility of using for loops, the addition of the matched information will be done as such. First, we will store everything in one dictionary, where the keys of the dictionary are the index values of unique_species (in order to later merge them again) and the values are the entire messages (which are dictionaries aon itself). The format will look as following:

species_annotated = {O: {'canonicalName': 'Squamata', 'class': 'Reptilia', 'classKey': 358, ...}, 
                     1: {'canonicalName':...},
                     2:...}



In [86]:

    
species_annotated = {}
for key, row in unique_species.iterrows():
    species_annotated[key] = name_match(row["genus"], row["species"], strict=True)



In [87]:

    
species_annotated









    Out[87]:





{0: {'usageKey': 2491123,
  'scientificName': 'Ammodramus savannarum (J.F.Gmelin, 1789)',
  'canonicalName': 'Ammodramus savannarum',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Ammodramus',
  'species': 'Ammodramus savannarum',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491079,
  'speciesKey': 2491123,
  'synonym': False,
  'class': 'Aves'},
 1: {'usageKey': 2437568,
  'scientificName': 'Ammospermophilus harrisii (Audubon & Bachman, 1854)',
  'canonicalName': 'Ammospermophilus harrisii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 96,
  'matchType': 'FUZZY',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Sciuridae',
  'genus': 'Ammospermophilus',
  'species': 'Ammospermophilus harrisii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 9456,
  'genusKey': 2437567,
  'speciesKey': 2437568,
  'synonym': False,
  'class': 'Mammalia'},
 2: {'usageKey': 2491757,
  'scientificName': 'Amphispiza bilineata (Cassin, 1850)',
  'canonicalName': 'Amphispiza bilineata',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Amphispiza',
  'species': 'Amphispiza bilineata',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491750,
  'speciesKey': 2491757,
  'synonym': False,
  'class': 'Aves'},
 3: {'usageKey': 2438866,
  'scientificName': 'Baiomys taylori (Thomas, 1887)',
  'canonicalName': 'Baiomys taylori',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Baiomys',
  'species': 'Baiomys taylori',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438864,
  'speciesKey': 2438866,
  'synonym': False,
  'class': 'Mammalia'},
 4: {'usageKey': 2491893,
  'scientificName': 'Calamospiza melanocorys Stejneger, 1885',
  'canonicalName': 'Calamospiza melanocorys',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Calamospiza',
  'species': 'Calamospiza melanocorys',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491892,
  'speciesKey': 2491893,
  'synonym': False,
  'class': 'Aves'},
 5: {'usageKey': 5228075,
  'scientificName': 'Callipepla squamata (Vigors, 1830)',
  'canonicalName': 'Callipepla squamata',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Galliformes',
  'family': 'Odontophoridae',
  'genus': 'Callipepla',
  'species': 'Callipepla squamata',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 723,
  'familyKey': 9325,
  'genusKey': 2474205,
  'speciesKey': 5228075,
  'synonym': False,
  'class': 'Aves'},
 6: {'usageKey': 5231474,
  'scientificName': 'Campylorhynchus brunneicapillus (Lafresnaye, 1835)',
  'canonicalName': 'Campylorhynchus brunneicapillus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Troglodytidae',
  'genus': 'Campylorhynchus',
  'species': 'Campylorhynchus brunneicapillus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9355,
  'genusKey': 2493929,
  'speciesKey': 5231474,
  'synonym': False,
  'class': 'Aves'},
 7: {'usageKey': 2439581,
  'scientificName': 'Chaetodipus baileyi (Merriam, 1894)',
  'canonicalName': 'Chaetodipus baileyi',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus baileyi',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439581,
  'synonym': False,
  'class': 'Mammalia'},
 8: {'usageKey': 2439589,
  'scientificName': 'Chaetodipus intermedius (Merriam, 1889)',
  'canonicalName': 'Chaetodipus intermedius',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus intermedius',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439589,
  'synonym': False,
  'class': 'Mammalia'},
 9: {'usageKey': 2439591,
  'scientificName': 'Chaetodipus penicillatus (Woodhouse, 1852)',
  'canonicalName': 'Chaetodipus penicillatus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus penicillatus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439591,
  'synonym': False,
  'class': 'Mammalia'},
 10: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 11: {'usageKey': 8071886,
  'acceptedUsageKey': 6158624,
  'scientificName': 'Cnemidophorus tigris Grismer, 1999',
  'canonicalName': 'Cnemidophorus tigris',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 97,
  'note': '2 synonym homonyms',
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Teiidae',
  'genus': 'Aspidoscelis',
  'species': 'Aspidoscelis tigris',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5019,
  'genusKey': 2472006,
  'speciesKey': 8913064,
  'synonym': True,
  'class': 'Reptilia'},
 12: {'usageKey': 5227544,
  'acceptedUsageKey': 2472455,
  'scientificName': 'Cnemidophorus uniparens Wright & Lowe, 1965',
  'canonicalName': 'Cnemidophorus uniparens',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Teiidae',
  'genus': 'Aspidoscelis',
  'species': 'Aspidoscelis uniparens',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5019,
  'genusKey': 2472006,
  'speciesKey': 2472455,
  'synonym': True,
  'class': 'Reptilia'},
 13: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 14: {'usageKey': 8945077,
  'scientificName': 'Crotalus viridis Rafinesque, 1818',
  'canonicalName': 'Crotalus viridis',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 98,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Viperidae',
  'genus': 'Crotalus',
  'species': 'Crotalus viridis',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5024,
  'genusKey': 2444354,
  'speciesKey': 8945077,
  'synonym': False,
  'class': 'Reptilia'},
 15: {'usageKey': 2439521,
  'scientificName': 'Dipodomys merriami Mearns, 1890',
  'canonicalName': 'Dipodomys merriami',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Dipodomys',
  'species': 'Dipodomys merriami',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439509,
  'speciesKey': 2439521,
  'synonym': False,
  'class': 'Mammalia'},
 16: {'usageKey': 2439541,
  'scientificName': 'Dipodomys ordii Woodhouse, 1853',
  'canonicalName': 'Dipodomys ordii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Dipodomys',
  'species': 'Dipodomys ordii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439509,
  'speciesKey': 2439541,
  'synonym': False,
  'class': 'Mammalia'},
 17: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 18: {'usageKey': 2439531,
  'scientificName': 'Dipodomys spectabilis Merriam, 1890',
  'canonicalName': 'Dipodomys spectabilis',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Dipodomys',
  'species': 'Dipodomys spectabilis',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439509,
  'speciesKey': 2439531,
  'synonym': False,
  'class': 'Mammalia'},
 19: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 20: {'usageKey': 2438517,
  'scientificName': 'Onychomys leucogaster (Wied-Neuwied, 1841)',
  'canonicalName': 'Onychomys leucogaster',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Onychomys',
  'species': 'Onychomys leucogaster',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438514,
  'speciesKey': 2438517,
  'synonym': False,
  'class': 'Mammalia'},
 21: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 22: {'usageKey': 2438516,
  'scientificName': 'Onychomys torridus (Coues, 1874)',
  'canonicalName': 'Onychomys torridus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Onychomys',
  'species': 'Onychomys torridus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438514,
  'speciesKey': 2438516,
  'synonym': False,
  'class': 'Mammalia'},
 23: {'usageKey': 2439566,
  'scientificName': 'Perognathus flavus Baird, 1855',
  'canonicalName': 'Perognathus flavus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Perognathus',
  'species': 'Perognathus flavus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439552,
  'speciesKey': 2439566,
  'synonym': False,
  'class': 'Mammalia'},
 24: {'usageKey': 2439584,
  'acceptedUsageKey': 2439583,
  'scientificName': 'Perognathus hispidus Baird, 1858',
  'canonicalName': 'Perognathus hispidus',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Heteromyidae',
  'genus': 'Chaetodipus',
  'species': 'Chaetodipus hispidus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 5504,
  'genusKey': 2439576,
  'speciesKey': 2439583,
  'synonym': True,
  'class': 'Mammalia'},
 25: {'usageKey': 2437981,
  'scientificName': 'Peromyscus eremicus (Baird, 1857)',
  'canonicalName': 'Peromyscus eremicus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Peromyscus',
  'species': 'Peromyscus eremicus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437961,
  'speciesKey': 2437981,
  'synonym': False,
  'class': 'Mammalia'},
 26: {'usageKey': 2438019,
  'scientificName': 'Peromyscus leucopus (Rafinesque, 1818)',
  'canonicalName': 'Peromyscus leucopus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Peromyscus',
  'species': 'Peromyscus leucopus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437961,
  'speciesKey': 2438019,
  'synonym': False,
  'class': 'Mammalia'},
 27: {'usageKey': 2437967,
  'scientificName': 'Peromyscus maniculatus (Wagner, 1845)',
  'canonicalName': 'Peromyscus maniculatus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Peromyscus',
  'species': 'Peromyscus maniculatus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437961,
  'speciesKey': 2437967,
  'synonym': False,
  'class': 'Mammalia'},
 28: {'usageKey': 2491276,
  'scientificName': 'Pipilo chlorurus (Audubon, 1839)',
  'canonicalName': 'Pipilo chlorurus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Pipilo',
  'species': 'Pipilo chlorurus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491204,
  'speciesKey': 2491276,
  'synonym': False,
  'class': 'Aves'},
 29: {'usageKey': 2491244,
  'acceptedUsageKey': 7341622,
  'scientificName': 'Pipilo fuscus Swainson, 1827',
  'canonicalName': 'Pipilo fuscus',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Melozone',
  'species': 'Melozone fusca',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491170,
  'speciesKey': 7341622,
  'synonym': True,
  'class': 'Aves'},
 30: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 31: {'usageKey': 2491728,
  'scientificName': 'Pooecetes gramineus (J.F.Gmelin, 1789)',
  'canonicalName': 'Pooecetes gramineus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Pooecetes',
  'species': 'Pooecetes gramineus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 2491727,
  'speciesKey': 2491728,
  'synonym': False,
  'class': 'Aves'},
 32: {'usageKey': 2437864,
  'scientificName': 'Reithrodontomys fulvescens J.A.Allen, 1894',
  'canonicalName': 'Reithrodontomys fulvescens',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Reithrodontomys',
  'species': 'Reithrodontomys fulvescens',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437854,
  'speciesKey': 2437864,
  'synonym': False,
  'class': 'Mammalia'},
 33: {'usageKey': 2437874,
  'scientificName': 'Reithrodontomys megalotis (Baird, 1857)',
  'canonicalName': 'Reithrodontomys megalotis',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Reithrodontomys',
  'species': 'Reithrodontomys megalotis',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437854,
  'speciesKey': 2437874,
  'synonym': False,
  'class': 'Mammalia'},
 34: {'usageKey': 2437866,
  'scientificName': 'Reithrodontomys montanus (Baird, 1855)',
  'canonicalName': 'Reithrodontomys montanus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Reithrodontomys',
  'species': 'Reithrodontomys montanus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2437854,
  'speciesKey': 2437866,
  'synonym': False,
  'class': 'Mammalia'},
 35: {'confidence': 99,
  'note': 'No match because of too little confidence',
  'matchType': 'NONE',
  'synonym': False},
 36: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 37: {'usageKey': 2451192,
  'scientificName': 'Sceloporus clarkii Baird & Girard, 1852',
  'canonicalName': 'Sceloporus clarkii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 96,
  'matchType': 'FUZZY',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Phrynosomatidae',
  'genus': 'Sceloporus',
  'species': 'Sceloporus clarkii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5016,
  'genusKey': 2451143,
  'speciesKey': 2451192,
  'synonym': False,
  'class': 'Reptilia'},
 38: {'usageKey': 2451347,
  'scientificName': 'Sceloporus undulatus (Bosc & Daudin, 1801)',
  'canonicalName': 'Sceloporus undulatus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Squamata',
  'family': 'Phrynosomatidae',
  'genus': 'Sceloporus',
  'species': 'Sceloporus undulatus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 358,
  'orderKey': 715,
  'familyKey': 5016,
  'genusKey': 2451143,
  'speciesKey': 2451347,
  'synonym': False,
  'class': 'Reptilia'},
 39: {'usageKey': 2438153,
  'scientificName': 'Sigmodon fulviventer J.A.Allen, 1889',
  'canonicalName': 'Sigmodon fulviventer',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Sigmodon',
  'species': 'Sigmodon fulviventer',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438146,
  'speciesKey': 2438153,
  'synonym': False,
  'class': 'Mammalia'},
 40: {'usageKey': 2438147,
  'scientificName': 'Sigmodon hispidus Say & Ord, 1825',
  'canonicalName': 'Sigmodon hispidus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Sigmodon',
  'species': 'Sigmodon hispidus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438146,
  'speciesKey': 2438147,
  'synonym': False,
  'class': 'Mammalia'},
 41: {'usageKey': 2438156,
  'scientificName': 'Sigmodon ochrognathus Bailey, 1902',
  'canonicalName': 'Sigmodon ochrognathus',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Cricetidae',
  'genus': 'Sigmodon',
  'species': 'Sigmodon ochrognathus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 3240723,
  'genusKey': 2438146,
  'speciesKey': 2438156,
  'synonym': False,
  'class': 'Mammalia'},
 42: {'confidence': 100, 'matchType': 'NONE', 'synonym': False},
 43: {'usageKey': 2437300,
  'acceptedUsageKey': 8299308,
  'scientificName': 'Spermophilus spilosoma Bennett, 1833',
  'canonicalName': 'Spermophilus spilosoma',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Sciuridae',
  'genus': 'Xerospermophilus',
  'species': 'Xerospermophilus spilosoma',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 9456,
  'genusKey': 7838133,
  'speciesKey': 8299308,
  'synonym': True,
  'class': 'Mammalia'},
 44: {'usageKey': 2437325,
  'acceptedUsageKey': 8032606,
  'scientificName': 'Spermophilus tereticaudus Baird, 1858',
  'canonicalName': 'Spermophilus tereticaudus',
  'rank': 'SPECIES',
  'status': 'SYNONYM',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Rodentia',
  'family': 'Sciuridae',
  'genus': 'Xerospermophilus',
  'species': 'Xerospermophilus tereticaudus',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 1459,
  'familyKey': 9456,
  'genusKey': 7838133,
  'speciesKey': 8032606,
  'synonym': True,
  'class': 'Mammalia'},
 45: {'usageKey': 2436910,
  'scientificName': 'Sylvilagus audubonii (Baird, 1858)',
  'canonicalName': 'Sylvilagus audubonii',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Lagomorpha',
  'family': 'Leporidae',
  'genus': 'Sylvilagus',
  'species': 'Sylvilagus audubonii',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 359,
  'orderKey': 785,
  'familyKey': 9379,
  'genusKey': 2436839,
  'speciesKey': 2436910,
  'synonym': False,
  'class': 'Mammalia'},
 46: {'usageKey': 5231132,
  'scientificName': 'Zonotrichia leucophrys (J.R.Forster, 1772)',
  'canonicalName': 'Zonotrichia leucophrys',
  'rank': 'SPECIES',
  'status': 'ACCEPTED',
  'confidence': 99,
  'matchType': 'EXACT',
  'kingdom': 'Animalia',
  'phylum': 'Chordata',
  'order': 'Passeriformes',
  'family': 'Emberizidae',
  'genus': 'Zonotrichia',
  'species': 'Zonotrichia leucophrys',
  'kingdomKey': 1,
  'phylumKey': 44,
  'classKey': 212,
  'orderKey': 729,
  'familyKey': 9608,
  'genusKey': 6173226,
  'speciesKey': 5231132,
  'synonym': False,
  'class': 'Aves'}}

We can now transform this to a pandas DataFrame:

EXERCISE: Convert the dictionary species_annotated into a pandas DataFrame with the row index the key-values corresponding to unique_species and the column headers the output columns of the API response. Save the result as the variable df_species_annotated. __Tip__: `transpose` can be used to flip rows and columns



In [88]:

    
df_species_annotated = pd.DataFrame(species_annotated).transpose()



In [89]:

    
df_species_annotated.head()









    Out[89]:







  
    
      
      usageKey
      scientificName
      canonicalName
      rank
      status
      confidence
      matchType
      kingdom
      phylum
      order
      ...
      phylumKey
      classKey
      orderKey
      familyKey
      genusKey
      speciesKey
      synonym
      class
      note
      acceptedUsageKey
    
  
  
    
      0
      2491123
      Ammodramus savannarum (J.F.Gmelin, 1789)
      Ammodramus savannarum
      SPECIES
      ACCEPTED
      99
      EXACT
      Animalia
      Chordata
      Passeriformes
      ...
      44
      212
      729
      9608
      2491079
      2491123
      False
      Aves
      NaN
      NaN
    
    
      1
      2437568
      Ammospermophilus harrisii (Audubon & Bachman, ...
      Ammospermophilus harrisii
      SPECIES
      ACCEPTED
      96
      FUZZY
      Animalia
      Chordata
      Rodentia
      ...
      44
      359
      1459
      9456
      2437567
      2437568
      False
      Mammalia
      NaN
      NaN
    
    
      2
      2491757
      Amphispiza bilineata (Cassin, 1850)
      Amphispiza bilineata
      SPECIES
      ACCEPTED
      99
      EXACT
      Animalia
      Chordata
      Passeriformes
      ...
      44
      212
      729
      9608
      2491750
      2491757
      False
      Aves
      NaN
      NaN
    
    
      3
      2438866
      Baiomys taylori (Thomas, 1887)
      Baiomys taylori
      SPECIES
      ACCEPTED
      99
      EXACT
      Animalia
      Chordata
      Rodentia
      ...
      44
      359
      1459
      3240723
      2438864
      2438866
      False
      Mammalia
      NaN
      NaN
    
    
      4
      2491893
      Calamospiza melanocorys Stejneger, 1885
      Calamospiza melanocorys
      SPECIES
      ACCEPTED
      99
      EXACT
      Animalia
      Chordata
      Passeriformes
      ...
      44
      212
      729
      9608
      2491892
      2491893
      False
      Aves
      NaN
      NaN
    
  

5 rows × 24 columns

Select relevant information and add this to the survey data

EXERCISE: Subselect the columns 'class', 'kingdom', 'order', 'phylum', 'scientificName', 'status' and 'usageKey' from the DataFrame `df_species_annotated`. Save it as the variable df_species_annotated_subset



In [90]:

    
df_species_annotated_subset = df_species_annotated[['class', 'kingdom', 'order', 'phylum', 
                                                    'scientificName', 'status', 'usageKey']]



In [91]:

    
df_species_annotated_subset.head()









    Out[91]:







  
    
      
      class
      kingdom
      order
      phylum
      scientificName
      status
      usageKey
    
  
  
    
      0
      Aves
      Animalia
      Passeriformes
      Chordata
      Ammodramus savannarum (J.F.Gmelin, 1789)
      ACCEPTED
      2491123
    
    
      1
      Mammalia
      Animalia
      Rodentia
      Chordata
      Ammospermophilus harrisii (Audubon & Bachman, ...
      ACCEPTED
      2437568
    
    
      2
      Aves
      Animalia
      Passeriformes
      Chordata
      Amphispiza bilineata (Cassin, 1850)
      ACCEPTED
      2491757
    
    
      3
      Mammalia
      Animalia
      Rodentia
      Chordata
      Baiomys taylori (Thomas, 1887)
      ACCEPTED
      2438866
    
    
      4
      Aves
      Animalia
      Passeriformes
      Chordata
      Calamospiza melanocorys Stejneger, 1885
      ACCEPTED
      2491893

EXERCISE: Join the df_species_annotated_subset information to the `unique_species` overview of species. Save the result as variable unique_species_annotated:



In [92]:

    
unique_species_annotated = pd.merge(unique_species, df_species_annotated_subset, 
                                    left_index=True, right_index=True)

EXERCISE: Add the `unique_species_annotated` data to the `survey_data_species` data set, using both the genus and species column as keys. Save the result as the variable survey_data_completed.



In [93]:

    
survey_data_completed = pd.merge(survey_data_species, unique_species_annotated, 
                                 how='left', on= ["genus", "species"])



In [94]:

    
len(survey_data_completed)









    Out[94]:





35550



In [95]:

    
survey_data_completed.head()









    Out[95]:







  
    
      
      verbatimLocality
      verbatimSex
      wgt
      datasetName
      sex
      occurrenceID
      eventDate
      decimalLongitude
      decimalLatitude
      genus
      species
      taxa
      name
      class
      kingdom
      order
      phylum
      scientificName
      status
      usageKey
    
  
  
    
      0
      2
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1
      1977-07-16
      -109.081975
      31.938887
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      2
      1977-07-16
      -109.081208
      31.938896
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      2
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      3
      1977-07-16
      -109.081975
      31.938887
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521
    
    
      3
      7
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      4
      1977-07-16
      -109.082816
      31.938113
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521
    
    
      4
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      5
      1977-07-16
      -109.081208
      31.938896
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521

Congratulations! You did a great cleaning job, save your result:



In [96]:

    
survey_data_completed.to_csv("../data/survey_data_completed.csv", index=False)

Acknowledgements

species.csv and survey.csv are used from the data carpentry workshop This data is from the paper S. K. Morgan Ernest, Thomas J. Valone, and James H. Brown. 2009. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. Ecology 90:1708. http://esapubs.org/archive/ecol/E090/118/
The plot_location.xlsx is a dummy created location file purely created for this exercise, using the plots location on google maps
GBIF API

	record_id	month	day	year	plot	species	sex_char	wgt
0	1	7	16	1977	2	NaN	M	NaN
1	2	7	16	1977	3	NaN	M	NaN
2	3	7	16	1977	2	DM	F	NaN
3	4	7	16	1977	7	DM	M	NaN
4	5	7	16	1977	3	DM	M	NaN

	record_id	month	day	year	plot	species	verbatimSex	wgt	datasetName	sex
0	1	7	16	1977	2	NaN	M	NaN	Ecological Archives E090-118-D1.	male
1	2	7	16	1977	3	NaN	M	NaN	Ecological Archives E090-118-D1.	male
2	3	7	16	1977	2	DM	F	NaN	Ecological Archives E090-118-D1.	female
3	4	7	16	1977	7	DM	M	NaN	Ecological Archives E090-118-D1.	male
4	5	7	16	1977	3	DM	M	NaN	Ecological Archives E090-118-D1.	male
5	6	7	16	1977	1	PF	M	NaN	Ecological Archives E090-118-D1.	male
6	7	7	16	1977	2	PE	F	NaN	Ecological Archives E090-118-D1.	female
7	8	7	16	1977	1	DM	M	NaN	Ecological Archives E090-118-D1.	male
8	9	7	16	1977	1	DM and SH	NaN	NaN	Ecological Archives E090-118-D1.	NaN
9	10	7	16	1977	6	PF	F	NaN	Ecological Archives E090-118-D1.	female

	month	day	year	plot	species	verbatimSex	wgt	datasetName	sex	occurrenceID
30649	4	31	2000	6	PP	F	19.0	Ecological Archives E090-118-D1.	female	30651
30650	4	31	2000	6	PB	M	32.0	Ecological Archives E090-118-D1.	male	30652
30651	4	31	2000	6	PB	F	30.0	Ecological Archives E090-118-D1.	female	30653
30652	4	31	2000	6	PP	M	20.0	Ecological Archives E090-118-D1.	male	30654
30653	4	31	2000	6	PP	M	24.0	Ecological Archives E090-118-D1.	male	30655

	plot	xutm	yutm
0	1	681222.131658	3.535262e+06
1	2	681302.799361	3.535268e+06
2	3	681375.294968	3.535270e+06
3	4	681450.837525	3.535271e+06
4	5	681526.983040	3.535281e+06

	0	1
0	-109.082829	31.938851
1	-109.081975	31.938887
2	-109.081208	31.938896
3	-109.080409	31.938894
4	-109.079602	31.938970
5	-109.078836	31.939078
6	-109.082816	31.938113
7	-109.081680	31.937884
8	-109.080903	31.937859
9	-109.080091	31.938017
10	-109.079307	31.938056
11	-109.078519	31.938203
12	-109.082613	31.937028
13	-109.081827	31.937054
14	-109.081036	31.937059
15	-109.080244	31.937094
16	-109.079415	31.937117
17	-109.078633	31.937126
18	-109.077912	31.937438
19	-109.080191	31.936334
20	-109.079398	31.936448
21	-109.078602	31.936441
22	-109.077838	31.936763
23	-109.077736	31.938560

	plot	species	verbatimSex	wgt	datasetName	sex	occurrenceID	eventDate	decimalLongitude	decimalLatitude
0	2	NaN	M	NaN	Ecological Archives E090-118-D1.	male	1	1977-07-16	-109.081975	31.938887
1	3	NaN	M	NaN	Ecological Archives E090-118-D1.	male	2	1977-07-16	-109.081208	31.938896
2	2	DM	F	NaN	Ecological Archives E090-118-D1.	female	3	1977-07-16	-109.081975	31.938887
3	7	DM	M	NaN	Ecological Archives E090-118-D1.	male	4	1977-07-16	-109.082816	31.938113
4	3	DM	M	NaN	Ecological Archives E090-118-D1.	male	5	1977-07-16	-109.081208	31.938896

	species_id	genus	species	taxa
0	AB	Amphispiza	bilineata	Bird
1	AH	Ammospermophilus	harrisi	Rodent-not censused
2	AS	Ammodramus	savannarum	Bird
3	BA	Baiomys	taylori	Rodent
4	CB	Campylorhynchus	brunneicapillus	Bird

	genus	species	name
0	Ammodramus	savannarum	Ammodramus savannarum
1	Ammospermophilus	harrisi	Ammospermophilus harrisi
2	Amphispiza	bilineata	Amphispiza bilineata
3	Baiomys	taylori	Baiomys taylori
4	Calamospiza	melanocorys	Calamospiza melanocorys

	usageKey	scientificName	canonicalName	rank	status	confidence	matchType	kingdom	phylum	order	...	phylumKey	classKey	orderKey	familyKey	genusKey	speciesKey	synonym	class	note	acceptedUsageKey
0	2491123	Ammodramus savannarum (J.F.Gmelin, 1789)	Ammodramus savannarum	SPECIES	ACCEPTED	99	EXACT	Animalia	Chordata	Passeriformes	...	44	212	729	9608	2491079	2491123	False	Aves	NaN	NaN
1	2437568	Ammospermophilus harrisii (Audubon & Bachman, ...	Ammospermophilus harrisii	SPECIES	ACCEPTED	96	FUZZY	Animalia	Chordata	Rodentia	...	44	359	1459	9456	2437567	2437568	False	Mammalia	NaN	NaN
2	2491757	Amphispiza bilineata (Cassin, 1850)	Amphispiza bilineata	SPECIES	ACCEPTED	99	EXACT	Animalia	Chordata	Passeriformes	...	44	212	729	9608	2491750	2491757	False	Aves	NaN	NaN
3	2438866	Baiomys taylori (Thomas, 1887)	Baiomys taylori	SPECIES	ACCEPTED	99	EXACT	Animalia	Chordata	Rodentia	...	44	359	1459	3240723	2438864	2438866	False	Mammalia	NaN	NaN
4	2491893	Calamospiza melanocorys Stejneger, 1885	Calamospiza melanocorys	SPECIES	ACCEPTED	99	EXACT	Animalia	Chordata	Passeriformes	...	44	212	729	9608	2491892	2491893	False	Aves	NaN	NaN