CASE - Biodiversity data - data cleaning and enrichment
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
Scenario:
You are interested in occurrence data for a number of species in Flanders. Unfortunately, the sources for this type of data are still scattered among different institutes. After a mailing campaign, you receive a number of files from different formats, in various data formats and styles...
You decide to be brave and script the interpretation and transformation, in order to provide reproducibility of your work. Moreover, similar analysis will be needed in the future with new data requests. You hope that future data requests will result in similar data formats from the individual partners. So, having a script will enhance the efficiency at that moment.
Besides from the technical differences in the data formats (csv
, excel
, shapefile
, sqlite
...), there are also differences in the naming of the content. For example, the coordinates, can be named x
/y
, decimalLatitude
/decimalLongitude
, lat
/long
... Luckely, you know of an international open data standard to describe occurrence data, i.e. Darwin Core (DwC). Instead of inventing your own data model, you decide to comply to this international standard. The latter will enhance communication and will also make your data compliant to other data services working with this kind of data.
In short, the DwC describes a flat table (cfr. CSV) with an agreed name convention on the header names and conventions on how certain data types need to be represented. Whereas the standard definitions are out of scope, an in depth description is given here. For this tutorial, we will focus on a few of the existing terms to learn some elements about data cleaning:
eventDate
: ISO 6801 format of datesscientificName
: the accepted scientific name of the speciesdecimalLatitude
/decimalLongitude
: coordinates of the occurrence in WGS84 formatsex
: either male
or female
to characterise the sex of the occurrenceoccurrenceID
: a identifier within the dataset to identify the individual recordsdatasetName
: a static string defining the source of the dataFuthermore, additional information concering the taxonomy will be added using an external API service
Dataset to work on:
For this dataset, the data is provided in the following main data files:
surveys.csv
the data with the surveys observed in the individual plotsspecies.csv
the overview list of the species shortnamesplot_location.xlsx
the overview of coordinates of the individual locationsThe data originates from a study of a Chihuahuan desert ecosystem near Portal, Arizona
Reading in the data of the individual surveys:
In [2]:
survey_data = pd.read_csv("../data/surveys.csv")
In [3]:
survey_data.head()
Out[3]:
In [4]:
len(survey_data)
Out[4]:
For convenience when this dataset will be combined with other datasets, we first add a column of static values, defining the datasetName
of this particular data:
In [5]:
datasetname = "Ecological Archives E090-118-D1."
Adding this static value as a new column datasetName
:
datasetname
as value for all of the records (static value for the entire data set)
In [6]:
survey_data["datasetName"] = datasetname
sex_char
.
__Tip__, to find the unique values, look for a function called `unique`...
In [7]:
survey_data["sex_char"].unique().tolist()
Out[7]:
So, apparently, more information is provided in this column, whereas according to the metadata information, the sex information should be either M
(male) or F
(female). We will create a column, named sex
and convert the symbols to the corresponding sex, taking into account the following mapping of the values (see metadata for more details):
M
-> male
F
-> female
R
-> male
P
-> female
Z
-> nanAt the same time, we will save the original information of the sex_char
in a separate column, called verbatimSex
, as a reference.
In summary, we have to:
verbatimSex
, which is a copy of the current sex_char
columnsex
sex_char
to the values male
and female
according to the listing aboveConverting the name of the column header sex_char
to verbatimSex
with the rename
function:
In [8]:
survey_data = survey_data.rename(columns={'sex_char': 'verbatimSex'})
In [9]:
sex_dict = {"M": "male",
"F": "female",
"R": "male",
"P": "female",
"Z": np.nan}
In [10]:
survey_data['sex'] = survey_data['verbatimSex'].replace(sex_dict)
Checking the current frequency of values (this should result in the values male
, female
and nan
):
In [11]:
survey_data["sex"].unique()
Out[11]:
To check what the frequency of occurrences is for male/female of the categories, a bar chart is an possible representation:
In [12]:
survey_data["sex"].value_counts(dropna=False).plot(kind="barh", color="#00007f")
Out[12]:
When checking the species unique information:
In [13]:
survey_data["species"].unique()
Out[13]:
In [14]:
survey_data.head(10)
Out[14]:
There apparently exists a double entry: 'DM and SH'
, which basically defines two records and should be decoupled to two individual records (i.e. rows). Hence, we should be able to create a additional row based on this split. To do so, Pandas provides a dedicated function since version 0.25, called explode
. Starting from a small subset example:
In [15]:
example = survey_data.loc[7:10, "species"]
example
Out[15]:
Using the split
method on strings, we can split the string using a given character, in this case the word and
:
In [16]:
example.str.split("and")
Out[16]:
The explode
method will create a row for each element in the list:
In [17]:
example_split = example.str.split("and").explode()
example_split
Out[17]:
Hence, the DM
and SH
are now enlisted in separate rows. Other rows remain unchanged. The only remaining issue is the spaces around the characters:
In [18]:
example_split.iloc[1], example_split.iloc[2]
Out[18]:
Which we can solve again using the string method strip
, removing the spaces before and after the characters:
In [19]:
example_split.str.strip()
Out[19]:
To make this reusable, let's create a dedicated function to combine these steps, called solve_double_field_entry
:
In [20]:
def solve_double_field_entry(df, keyword="and", column="verbatimEventDate"):
"""split on keyword in column for an enumeration and create extra record
Parameters
----------
df: pd.DataFrame
DataFrame with a double field entry in one or more values
keyword: str
word/character to split the double records on
column: str
column name to use for the decoupling of the records
"""
df[column] = df[column].str.split(keyword)
df = df.explode(column)
df[column] = df[column].str.strip() # remove white space around the words
return df
The function takes a DataFrame as input, splits the record into separate rows and returns the updated DataFrame. We can use this function to get an update of the dataFrame, with the an additional row (occurrence) added by decoupling the specific field:
solve_double_field_entry
to create a dataFrame with an additional row, by decoupling the double entries. Save the result as a variable survey_data_decoupled
.
In [21]:
survey_data_decoupled = solve_double_field_entry(survey_data.copy(),
"and",
column="species") # get help of the function by SHIFT + TAB
# REMARK: the copy() statement here (!) see pandas_03b_indexing.ipynb notebook `chained indexing section`
In [22]:
survey_data_decoupled["species"].unique()
Out[22]:
In [23]:
survey_data_decoupled.head(11)
Out[23]:
The record_id
is no longer a unique identifier after the decoupling of this dataset. We will make a new dataset-specific identifier, by adding a column called occurrenceID
that takes a new counter as identifier. As a simply and straightforward approach, we will use a new counter for the whole dataset, starting with 1:
In [24]:
np.arange(1, len(survey_data_decoupled) + 1, 1)
Out[24]:
Create a new column with header occurrenceID
with the values 1 -> 35550 as field values:
In [25]:
survey_data_decoupled["occurrenceID"] = np.arange(1, len(survey_data_decoupled) + 1, 1)
To overcome the confusion on having both a record_id
and occurrenceID
field, we will remove the record_id
term:
In [26]:
survey_data_decoupled = survey_data_decoupled.drop(columns="record_id")
Hence, columns can be drop
-ped out of a DataFrame
In [27]:
survey_data_decoupled.head(10)
Out[27]:
In the survey-dataset we received, the month
, day
, and year
columns are containing the information about the date, i.e. eventDate
in DarwinCore terms. We want this data in a ISO format YYYY-MM-DD
. A convenvient Pandas function is the usage of to_datatime
, which provides multiple options to interpret dates. One of thes options is the automatic interpretation of some 'typical' columns, like year
, month
and day
, when passing a DataFrame.
In [28]:
# pd.to_datetime(survey_data_decoupled[["year", "month", "day"]]) # uncomment the line and test this statement
This is not working, not all dates can be interpreted... We should get some more information on the reason of the errors. By using the option coerce
, the problem makers will be labeled as a missing value NaT
. We can count the number of dates that can not be interpreted:
In [29]:
sum(pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull())
Out[29]:
survey_data_decoupled
containing those records that can not correctly be interpreted as date values and save the resulting dataframe as variable trouble_makers
In [30]:
mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull()
trouble_makers = survey_data_decoupled[mask]
Checking some charactersitics of the trouble_makers:
In [31]:
trouble_makers.head()
Out[31]:
In [32]:
trouble_makers["day"].unique()
Out[32]:
In [33]:
trouble_makers["month"].unique()
Out[33]:
In [34]:
trouble_makers["year"].unique()
Out[34]:
So, basically the problem is the presence of day 31
during the months April and September of the year 2000. At this moment, we would have to recheck the original data in order to know how the issue could be solved. Apparently, - for this specific case - there has been a data-entry problem in 2000, making the 31
days during this period should actually be 30
. It would be optimal to correct this in the source dataset, but for the further exercise, it will be corrected here.
survey_data_decoupled
all of the troublemakers day
values into the value 30
In [35]:
mask = pd.to_datetime(survey_data_decoupled[["year", "month", "day"]], errors='coerce').isnull()
survey_data_decoupled.loc[mask, "day"] = 30
Now, we do the parsing again to create a proper eventDate
field, containing the dates:
In [36]:
survey_data_decoupled["eventDate"] = \
pd.to_datetime(survey_data_decoupled[["year", "month", "day"]])
Just let's do a check the amount of data for each year:
In [37]:
survey_data_decoupled.groupby("year").size().plot(kind='barh', color="#00007f", figsize=(10, 10))
Out[37]:
In [38]:
survey_data_decoupled.head()
Out[38]:
Currently, the dates are stored in a python specific date format:
In [39]:
survey_data_decoupled["eventDate"].dtype
Out[39]:
This is great, because it allows for many functionalities:
In [40]:
survey_data_decoupled.eventDate.dt #add a dot and press TAB to explore the date options it provides
Out[40]:
In [41]:
survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.year).size().plot(kind='barh', color="#00007f", figsize=(10, 10))
Out[41]:
So, we actually do not need the day
, month
, year
columns anymore and have other options available as well
In [42]:
nrecords_by_weekday = survey_data_decoupled.groupby(survey_data_decoupled["eventDate"].dt.weekday).size()
ax = nrecords_by_weekday.plot(kind="barh", color="#00007f", figsize=(6, 6))
# I you want to represent the ticklabels as proper names, uncomment the following line:
#ticklabels = ax.set_yticklabels(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
When saving the information to a file (e.g. CSV-file), this data type will be automatically converted to a string representation. However, we could also decide to explicitly provide the string format the dates are stored (losing the date type functionalities), in order to have full control on the way these dates are formatted:
In [43]:
survey_data_decoupled["eventDate"] = survey_data_decoupled["eventDate"].dt.strftime('%Y-%m-%d')
In [44]:
survey_data_decoupled["eventDate"].head()
Out[44]:
As we do not need the day
, month
, year
columns anymore, we can drop them from the DataFrame:
In [45]:
survey_data_decoupled = survey_data_decoupled.drop(columns=["day", "month", "year"])
The individual plots are only identified by a plot
identification number. In order to provide sufficient information to external users, additional information about the coordinates should be added. The coordinates of the individual plots are saved in another file: plot_location.xlsx
. We will use this information to further enrich our dataset and add the Darwin Core Terms decimalLongitude
and decimalLatitude
.
In [46]:
plot_data = pd.read_excel("../data/plot_location.xlsx", skiprows=3, index_col=0)
In [47]:
plot_data.head()
Out[47]:
These coordinates are in meters, more specifically in UTM 12 N coordinate system. However, the agreed coordinate system for Darwin Core is the World Geodetic System 1984 (WGS84).
As this is not a GIS course, we will shortcut the discussion about different projection systems, but provide an example on how such a conversion from UTM12N to WGS84 can be performed with the projection toolkit pyproj
and by relying on the existing EPSG codes (a registry originally setup by the association of oil & gas producers).
First, we define out two projection systems, using their corresponding EPSG codes:
In [48]:
import pyproj
In [49]:
utm12n = pyproj.Proj("+init=EPSG:32612")
wgs84 = pyproj.Proj("+init=EPSG:4326")
The reprojection can be done by the function transform
of the projection toolkit, providing the coordinate systems and a set of x, y coordinates. For example, for a single coordinate, this can be applied as follows:
In [50]:
pyproj.transform(utm12n, wgs84, 681222.131658, 3.535262e+06)
Out[50]:
Instead of writing a for
loop to do this for each of the coordinates in the list, we can apply
this function to each of them:
transform
to plot_data, using the columns xutm
and yutm
and save the resulting output in 2 new columns, called decimalLongitude
and decimalLatitude
:
transform_utm_to_wgs
that takes a row of a DataFrame and returns a Series of two elements with the longitude and latitude.plot_data
axis
parameter)decimalLongitude
and decimalLatitude
columns
In [51]:
def transform_utm_to_wgs(row):
"""
Converts the x and y coordinates of this row into a Series of the
longitude and latitude.
"""
utm12n = pyproj.Proj("+init=EPSG:32612")
wgs84 = pyproj.Proj("+init=EPSG:4326")
return pd.Series(pyproj.transform(utm12n, wgs84, row['xutm'], row['yutm']))
In [52]:
transform_utm_to_wgs(plot_data.loc[0])
Out[52]:
In [53]:
plot_data.apply(transform_utm_to_wgs, axis=1)
Out[53]:
In [54]:
plot_data[["decimalLongitude" ,"decimalLatitude"]] = plot_data.apply(transform_utm_to_wgs, axis=1)
In [55]:
plot_data.head()
Out[55]:
The above function transform_utm_to_wgs
you have created is a very specific function that knows the structure of the DataFrame you will apply it to (it assumes the 'xutm' and 'yutm' column names). We could also make a more generic function that just takes a X and Y coordinate and returns the Series of converted coordinates (transform_utm_to_wgs2(X, Y)
).
To apply such a more generic function to the plot_data
DataFrame, we can make use of the lambda
construct, which lets you specify a function on one line as an argument:
plot_data.apply(lambda row : transform_utm_to_wgs2(row['xutm'], row['yutm']), axis=1)
If you have time, try to implement this function and test it as well.
To check the transformation, let's put these on an interactive map. Leaflet is a famous service for this and in many programming languages wrappers do exist to simplify the usage. Folium is an extensive library providing multiple options. As we just want to do a quick checkup of the coordinates, we will rely on the package mplleaflet, which just converts a matplotlib image to a leaflet map:
In [56]:
import mplleaflet # https://github.com/jwass/mplleaflet
In [57]:
fig, ax = plt.subplots(figsize=(5, 8))
plt.plot(plot_data['decimalLongitude'], plot_data['decimalLatitude'], 'rs')
mplleaflet.display(fig=fig) # zoom out to see where the measurement plot is located
Out[57]:
All points are inside the desert region as we expected, so we can extend our survey dataset with this coordinate information. Making the combination of two data sets based on a common identifier is completely similar to the usage of JOIN
operations in databases. In Pandas, this functionality is provided by pd.merge
.
In practice, we have to add the columns decimalLongitude
/decimalLatitude
to the current dataset survey_data_decoupled
, by using the plot identification number as key to join.
plot_data_selection
In [58]:
plot_data_selection = plot_data[["plot", "decimalLongitude", "decimalLatitude"]]
merge
, add the coordinate information (plot_data_selection) to the survey data set and save the resulting DataFrame as survey_data_plots
.
__Tip__: documentation of [merge](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-DataFrame-joining-merging)...
In [59]:
survey_data_plots = pd.merge(survey_data_decoupled, plot_data_selection,
how="left", on="plot")
In [60]:
survey_data_plots.head()
Out[60]:
The plot locations need to be stored with the variable name verbatimLocality
indicating th identifier as integer value of the plot:
In [61]:
survey_data_plots = survey_data_plots.rename(columns={'plot': 'verbatimLocality'})
The column species
only provides a short identifier in the survey overview. The name information is stored in a separate file species.csv
. As we want our dataset to include this information, we will read in this data and add it to our survey dataset:
species_data
__Tip__: check the delimiter (`sep`) to define
In [62]:
species_data = pd.read_csv("../data/species.csv", sep=";")
In [63]:
species_data.head()
Out[63]:
When reviewing the metadata, you see that in the data-file the acronym NE
is used to describe Neotoma albigula
, whereas in the metadata description, the acronym NA
is used.
In [64]:
species_data.loc[species_data["species_id"] == "NE", "species_id"] = "NA"
(At the same time, you decide to cure this problem at the source and alert the data provider about this issue.)
As we now prepared the two series, we can combine the data, using the merge
operation. Take into account that our key-column is different for species_data
and survey_data_plots
, respectively species_id
and species
:
We want to add the data of the species to the survey data, in order to see the full species names:
In [65]:
survey_data_species = pd.merge(survey_data_plots, species_data, how="left", # LEFT OR INNER?
left_on="species", right_on="species_id")
In [66]:
len(survey_data_species)
Out[66]:
The join is ok, but we are left with some redundant columns and wrong naming:
In [67]:
survey_data_species.head()
Out[67]:
We do not need the columns species_x
and species_id
column anymore, as we will use the scientific names from now on:
In [68]:
survey_data_species = survey_data_species.drop(["species_x", "species_id"], axis=1)
The column species_y
could just be named species
:
In [69]:
survey_data_species = survey_data_species.rename(columns={"species_y": "species"})
In [70]:
survey_data_species.head()
Out[70]:
In [71]:
len(survey_data_species)
Out[71]:
Let's now save our cleaned-up to a csv file, so we can further analyze the data in a following notebook:
In [72]:
survey_data_species.to_csv("interim_survey_data_species.csv", index=False)
As the current species names are rather short and could eventually lead to confusion when shared with other users, retrieving additional information about the different species in our dataset would be useful to integrate our work with other research. An option is to match our names with an external service to request additional information about the different species.
One of these services is GBIF API. The service can most easily be illustrated with a small example:
In a new tabblad of the browser, go to the URL http://www.gbif.org/species/2475532, which corresponds to the page of Alcedo atthis
(ijsvogel in dutch). One could for each of the species in the list we have do a search on the website of GBIF to find the corresponding page of the different species, from which more information can be extracted manually. However, this would take a lot of time...
Therefore, GBIF (and many other organisations!) provides a service to extract the same information on a machine-readable way, in order to automate these searches. As an example, let's search for the information of Alcedo atthis
, using the GBIF API: Go to the URL: http://api.gbif.org/v1/species/match?name=Alcedo atthis and check the output. What we did is a machine-based search on the GBIF website for information about Alcedo atthis
.
The same can be done using Python. The main library we need to this kind of automated searches is the requests
package, which can be used to do request to any kind of API out there.
In [73]:
import requests
For the example of Alcedo atthis
:
In [74]:
species_name = 'Alcedo atthis'
In [75]:
base_string = 'http://api.gbif.org/v1/species/match?'
request_parameters = {'verbose': False, 'strict': True, 'name': species_name}
message = requests.get(base_string, params=request_parameters).json()
message
Out[75]:
From which we get a dictionary containing more information about the taxonomy of the Alcedo atthis
.
In the species data set available, the name to match is provided in the combination of two columns, so we have to combine those to in order to execute the name matching:
In [76]:
genus_name = "Callipepla"
species_name = "squamata"
name_to_match = '{} {}'.format(genus_name, species_name)
base_string = 'http://api.gbif.org/v1/species/match?'
request_parameters = {'strict': True, 'name': name_to_match} # use strict matching(!)
message = requests.get(base_string, params=request_parameters).json()
message
Out[76]:
To apply this on our species data set, we will have to do this request for each of the individual species/genus combination. As, this is a returning functionality, we will write a small function to do this:
In [77]:
def name_match(genus_name, species_name, strict=True):
"""
Perform a GBIF name matching using the species and genus names
Parameters
----------
genus_name: str
name of the genus of the species
species_name: str
name of the species to request more information
strict: boolean
define if the mathing need to be performed with the strict
option (True) or not (False)
Returns
-------
message: dict
dictionary with the information returned by the GBIF matching service
"""
name = '{} {}'.format(genus_name, species_name)
base_string = 'http://api.gbif.org/v1/species/match?'
request_parameters = {'strict': strict, 'name': name} # use strict matching(!)
message = requests.get(base_string, params=request_parameters).json()
return message
Testing our custom matching function:
In [78]:
genus_name = "Callipepla"
species_name = "squamata"
name_match(genus_name, species_name, strict=True)
Out[78]:
However, the matching won't provide an answer for every search:
In [79]:
genus_name = "Lizard"
species_name = "sp."
name_match(genus_name, species_name, strict=True)
Out[79]:
Hence, in order to add this information to our survey DataFrame, we need to perform the following steps:
survey_data_species
data setdrop_duplicates()
. Save the result as the variable unique_species
In [80]:
#%%timeit
unique_species = survey_data_species[["genus", "species"]].drop_duplicates().dropna()
In [81]:
len(unique_species)
Out[81]:
groupby
. Save the result as the variable unique_species
In [82]:
#%%timeit
unique_species = \
survey_data_species.groupby(["genus", "species"]).first().reset_index()[["genus", "species"]]
In [83]:
len(unique_species)
Out[83]:
In [84]:
unique_species["name"] = unique_species["genus"] + " " + unique_species["species"]
# an alternative approach worthwhile to know:
#unique_species["name"] = unique_species["genus"].str.cat(unique_species["species"], " ")
In [85]:
unique_species.head()
Out[85]:
To perform the matching for each of the combination, different options do exist.
Just to show the possibility of using for
loops, the addition of the matched information will be done as such. First, we will store everything in one dictionary, where the keys of the dictionary are the index values of unique_species
(in order to later merge them again) and the values are the entire messages (which are dictionaries aon itself). The format will look as following:
species_annotated = {O: {'canonicalName': 'Squamata', 'class': 'Reptilia', 'classKey': 358, ...},
1: {'canonicalName':...},
2:...}
In [86]:
species_annotated = {}
for key, row in unique_species.iterrows():
species_annotated[key] = name_match(row["genus"], row["species"], strict=True)
In [87]:
species_annotated
Out[87]:
We can now transform this to a pandas DataFrame:
species_annotated
into a pandas DataFrame with the row index the key-values corresponding to unique_species
and the column headers the output columns of the API response. Save the result as the variable df_species_annotated
.
__Tip__: `transpose` can be used to flip rows and columns
In [88]:
df_species_annotated = pd.DataFrame(species_annotated).transpose()
In [89]:
df_species_annotated.head()
Out[89]:
df_species_annotated_subset
In [90]:
df_species_annotated_subset = df_species_annotated[['class', 'kingdom', 'order', 'phylum',
'scientificName', 'status', 'usageKey']]
In [91]:
df_species_annotated_subset.head()
Out[91]:
unique_species_annotated
:
In [92]:
unique_species_annotated = pd.merge(unique_species, df_species_annotated_subset,
left_index=True, right_index=True)
survey_data_completed
.
In [93]:
survey_data_completed = pd.merge(survey_data_species, unique_species_annotated,
how='left', on= ["genus", "species"])
In [94]:
len(survey_data_completed)
Out[94]:
In [95]:
survey_data_completed.head()
Out[95]:
Congratulations! You did a great cleaning job, save your result:
In [96]:
survey_data_completed.to_csv("../data/survey_data_completed.csv", index=False)
species.csv
and survey.csv
are used from the data carpentry workshop This data is from the paper S. K. Morgan Ernest, Thomas J. Valone, and James H.
Brown. 2009. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. Ecology 90:1708. http://esapubs.org/archive/ecol/E090/118/plot_location.xlsx
is a dummy created location file purely created for this exercise, using the plots location on google maps