CASE - Biodiversity data - analysis
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
survey_data_processed
(if you did not complete the previous notebook, a version of the csv file is available in the `../data` folder).
In [2]:
survey_data_processed = pd.read_csv("../data/survey_data_completed.csv",
parse_dates=['eventDate'], index_col="occurrenceID")
In [3]:
survey_data_processed.head()
Out[3]:
In [4]:
survey_data_processed.info()
In [5]:
survey_data_processed['species'].isnull().sum()
Out[5]:
In [6]:
survey_data_processed.duplicated().sum()
Out[6]:
In [7]:
survey_data_processed[survey_data_processed.duplicated(keep=False)].sort_values(["eventDate", "verbatimLocality"]).head(10)
Out[7]:
survey_data_unique
__Tip__: Next to finding `duplicated` values, Pandas has a function to `drop duplicates`...
In [8]:
survey_data_unique = survey_data_processed.drop_duplicates()
In [9]:
len(survey_data_unique)
Out[9]:
In [10]:
len(survey_data_unique.dropna())
Out[10]:
not_identified
__Tip__: next to `isnull`, also `notnull` exists...
In [11]:
mask = survey_data_unique['species'].isnull() & survey_data_unique['sex'].notnull()
not_identified = survey_data_unique[mask]
In [12]:
not_identified.head()
Out[12]:
survey_data
. Make sure survey_data
is a copy of the original DataFrame. This is the DataFrame we will use in the further analyses.
In [13]:
survey_data = survey_data_unique.dropna(subset=['species']).copy()
In [14]:
survey_data.resample('A', on='eventDate').size().plot()
Out[14]:
To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation. We can actually use the plotnine library as well to make heatmaps, as it provides the geom_tile
geometry. Loading the library:
In [15]:
import plotnine as pn
heatmap_prep_plotnine
, based on the survey_data
DataFrame with a column for the years, a column for the months a column with the counts (called `count`).
__Tip__: You have to count for each year/month combination. Also `reset_index` could be useful.
In [16]:
heatmap_prep_plotnine = survey_data.groupby([survey_data['eventDate'].dt.year,
survey_data['eventDate'].dt.month]).size()
heatmap_prep_plotnine.index.names = ["year", "month"]
heatmap_prep_plotnine = heatmap_prep_plotnine.reset_index(name='count')
In [17]:
heatmap_prep_plotnine.head()
Out[17]:
heatmap_prep_plotnine
, make a heatmap using the plotnine package.
__Tip__: When in trouble, check [this section of the documentation](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html#Annotated-Heatmap)
In [18]:
(pn.ggplot(heatmap_prep_plotnine, pn.aes(x="factor(month)", y="year", fill="count"))
+ pn.geom_tile()
+ pn.scale_fill_cmap("Reds")
+ pn.scale_y_reverse()
+ pn.theme(
axis_ticks=pn.element_blank(),
panel_background=pn.element_rect(fill='white'))
)
Out[18]:
Remark that we started from a tidy
data format (also called long format).
The heatmap functionality is also provided by the plotting library seaborn (check the docs!). Based on the documentation, seaborn uses the short format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values.
Let's reformat the heatmap_prep_plotnine
data to be useable for the seaborn heatmap function:
heatmap_prep_sns
, based on the heatmap_prep_plotnine
DataFrame with in the row index the years, in the column the months and as values of the table, the counts for each of these year/month combinations.
__Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know
In [19]:
heatmap_prep_sns = heatmap_prep_plotnine.pivot_table(index='year', columns='month', values='count')
heatmap_prep_sns
variable.
In [20]:
fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(heatmap_prep_sns, cmap='Reds')
heatmap_prep_sns
DataFrame, return to the long format of the table with the columns `year`, `month` and `count` and call the resulting variable heatmap_tidy
.
__Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know, but a `reset_index` could be useful as well
In [21]:
heatmap_tidy = heatmap_prep_sns.reset_index().melt(id_vars=["year"], value_name="count")
heatmap_tidy.head()
Out[21]:
The name of the observed species consists of two parts: the 'genus' and 'species' columns. For the further analyses, we want the combined name. This is already available as the 'name' column if you completed the previous notebook, otherwise you can add this again in the following exercise.
In [22]:
survey_data['name'] = survey_data['genus'] + ' ' + survey_data['species']
In [23]:
survey_data.groupby("name").size().nlargest(8)
Out[23]:
In [24]:
survey_data['name'].value_counts()[:8]
Out[24]:
In [25]:
species_per_plot = survey_data.reset_index().pivot_table(index="name",
columns="verbatimLocality",
values="occurrenceID",
aggfunc='count')
# alternative ways to calculate this
#species_per_plot = survey_data.groupby(['name', 'plot_id']).size().unstack(level=-1)
#species_per_plot = pd.crosstab(survey_data['name'], survey_data['plot_id'])
In [26]:
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(species_per_plot, ax=ax, cmap='Reds')
Out[26]:
In [27]:
n_species_per_plot = survey_data.groupby(["verbatimLocality"])["name"].nunique()
fig, ax = plt.subplots(figsize=(6, 6))
n_species_per_plot.plot(kind="barh", ax=ax, color="lightblue")
ax.set_ylabel("plot number")
# Alternative option:
# inspired on the pivot table we already had:
# species_per_plot = survey_data.reset_index().pivot_table(
# index="name", columns="verbatimLocality", values="occurrenceID", aggfunc='count')
# n_species_per_plot = species_per_plot.count()
Out[27]:
In [28]:
n_plots_per_species = survey_data.groupby(["name"])["verbatimLocality"].nunique().sort_values()
fig, ax = plt.subplots(figsize=(8, 8))
n_plots_per_species.plot(kind="barh", ax=ax, color='0.4')
# Alternatives
# species_per_plot2 = survey_data.reset_index().pivot_table(index="verbatimLocality",
# columns="name",
# values="occurrenceID",
# aggfunc='count')
# nplots_per_species = species_per_plot2.count().sort_values(ascending=False)
# or
# species_per_plot.count(axis=1).sort_values(ascending=False).plot(kind='bar')
Out[28]:
n_plot_sex
.
__Tip__: Release the power of `unstack`...
In [31]:
subselection_sex = survey_data.dropna(subset=["sex"])
#subselection_sex = survey_data[survey_data["sex"].notnull()]
In [32]:
n_plot_sex = subselection_sex.groupby(["sex", "verbatimLocality"]).size().unstack(level=0)
n_plot_sex.head()
Out[32]:
As such, we can use the variable n_plot_sex
to plot the result:
In [33]:
n_plot_sex.plot(kind='bar', figsize=(12, 6), rot=0)
Out[33]:
subselection_sex
.
__Tip__: When in trouble, check these [docs](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_col.html#Two-Variable-Bar-Plot).
In [34]:
(pn.ggplot(subselection_sex, pn.aes(x="verbatimLocality", fill="sex"))
+ pn.geom_bar(position='dodge')
+ pn.scale_x_discrete(breaks=np.arange(1, 25, 1), limits=np.arange(1, 25, 1))
)
Out[34]:
In [35]:
survey_data["taxa"].unique()
Out[35]:
In [36]:
survey_data['taxa'].value_counts()
#survey_data.groupby('taxa').size()
Out[36]:
In [37]:
non_rodent_species = survey_data[survey_data['taxa'].isin(['Rabbit', 'Bird', 'Reptile'])]
In [38]:
len(non_rodent_species)
Out[38]:
r_species
.
__Tip__: Remember the `.str.` construction to provide all kind of string functionalities?
In [39]:
r_species = survey_data[survey_data['taxa'].str.lower().str.startswith('ro')]
In [40]:
len(r_species)
Out[40]:
non_bird_species
.
In [41]:
non_bird_species = survey_data[survey_data['taxa'] != 'Bird']
In [42]:
len(non_bird_species)
Out[42]:
In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked
In [43]:
merriami = survey_data[survey_data["name"] == "Dipodomys merriami"]
In [44]:
fig, ax = plt.subplots()
merriami.groupby(merriami['eventDate'].dt.year).size().plot(ax=ax)
ax.set_xlabel("")
ax.set_ylabel("number of occurrences")
Out[44]:
In [45]:
merriami = survey_data[survey_data["species"] == "merriami"]
fig, ax = plt.subplots(2, 1, figsize=(14, 8))
merriami.groupby(merriami['eventDate']).size().plot(ax=ax[0], style="-") # top graph
merriami.resample("D", on="eventDate").size().plot(ax=ax[1], style="-") # lower graph
Out[45]:
In [46]:
subsetspecies = survey_data[survey_data["name"].isin(['Dipodomys merriami', 'Dipodomys ordii',
'Reithrodontomys megalotis', 'Chaetodipus baileyi'])]
In [47]:
month_evolution = subsetspecies.groupby("name").resample('M', on='eventDate').size()
In [48]:
species_evolution = month_evolution.unstack(level=0)
axs = species_evolution.plot(subplots=True, figsize=(14, 8), sharey=True)
In [49]:
subsetspecies = survey_data[survey_data["name"].isin(['Dipodomys merriami', 'Dipodomys ordii',
'Reithrodontomys megalotis', 'Chaetodipus baileyi'])]
month_evolution = subsetspecies.groupby("name").resample('M', on='eventDate').size()
In [50]:
(pn.ggplot(month_evolution.reset_index(name='count'),
pn.aes(x='eventDate', y='count', color='name'))
+ pn.geom_line()
+ pn.facet_wrap('name', nrow=4)
+ pn.theme_light()
)
Out[50]:
In [51]:
year_evolution = survey_data.groupby("taxa").resample('A', on='eventDate').size()
species_evolution = year_evolution.unstack(level=0)
axs = species_evolution.plot(subplots=True, figsize=(16, 8), sharey=False)
count_weekday_years
In [52]:
count_weekday_years = survey_data.groupby([survey_data["eventDate"].dt.year, survey_data["eventDate"].dt.dayofweek]).size().unstack()
In [53]:
# Alternative
#years = survey_data["eventDate"].dt.year.rename('year')
#dayofweaks = survey_data["eventDate"].dt.dayofweek.rename('dayofweak')
#count_weekday_years = pd.crosstab(index=years, columns=dayofweaks)
In [54]:
count_weekday_years.head()
Out[54]:
In [55]:
count_weekday_years.plot()
Out[55]:
In [56]:
fig, ax = plt.subplots()
count_weekday_years.median(axis=0).plot(kind='barh', ax=ax, color='#66b266')
xticks = ax.set_yticklabels(['Monday', 'Tuesday', 'Wednesday', "Thursday", "Friday", "Saturday", "Sunday"])
Nice work!