CASE - Biodiversity data - analysis

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:

    
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-whitegrid')

Reading in the enriched survey data set

EXERCISE:

Read in the 'survey_data_completed.csv' file and save the resulting DataFrame as variable survey_data_processed (if you did not complete the previous notebook, a version of the csv file is available in the `../data` folder).
Interpret the 'eventDate' column directly as python datetime object and make sure the 'occurrenceID' column is used as the index of the resulting DataFrame (both can be done at once when reading the csv file using parameters of the `read_csv` function)
Inspect the resulting frame (remember `.head()` and `.info()`) and check that the 'eventDate' indeed has a datetime data type.



In [2]:

    
survey_data_processed = pd.read_csv("../data/survey_data_completed.csv", 
                                    parse_dates=['eventDate'], index_col="occurrenceID")



In [3]:

    
survey_data_processed.head()









    Out[3]:







  
    
      
      verbatimLocality
      verbatimSex
      wgt
      datasetName
      sex
      eventDate
      decimalLongitude
      decimalLatitude
      genus
      species
      taxa
      name
      class
      kingdom
      order
      phylum
      scientificName
      status
      usageKey
    
    
      occurrenceID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      2
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.081975
      31.938887
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.081208
      31.938896
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      2
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      1977-07-16
      -109.081975
      31.938887
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      4
      7
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.082816
      31.938113
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      5
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.081208
      31.938896
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0



In [4]:

    
survey_data_processed.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 35550 entries, 1 to 35550
Data columns (total 19 columns):
verbatimLocality    35550 non-null int64
verbatimSex         33042 non-null object
wgt                 32283 non-null float64
datasetName         35550 non-null object
sex                 33041 non-null object
eventDate           35550 non-null datetime64[ns]
decimalLongitude    35550 non-null float64
decimalLatitude     35550 non-null float64
genus               33535 non-null object
species             33535 non-null object
taxa                33535 non-null object
name                33535 non-null object
class               33448 non-null object
kingdom             33448 non-null object
order               33448 non-null object
phylum              33448 non-null object
scientificName      33448 non-null object
status              33448 non-null object
usageKey            33448 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(1), object(13)
memory usage: 5.4+ MB

Tackle missing values (NaN) and duplicate values

EXERCISE: How many records are in the data set without information on the 'species' name?



In [5]:

    
survey_data_processed['species'].isnull().sum()









    Out[5]:





2015

EXERCISE: How many duplicate records are present in the dataset? _Tip_: Pandas has a function to find `duplicated` values...



In [6]:

    
survey_data_processed.duplicated().sum()









    Out[6]:





1577

EXERCISE: Extract a list of all duplicates, sort on the columns `eventDate` and `verbatimLocality` and show the first 10 records _Tip_: Check documentation of `duplicated`



In [7]:

    
survey_data_processed[survey_data_processed.duplicated(keep=False)].sort_values(["eventDate", "verbatimLocality"]).head(10)









    Out[7]:







  
    
      
      verbatimLocality
      verbatimSex
      wgt
      datasetName
      sex
      eventDate
      decimalLongitude
      decimalLatitude
      genus
      species
      taxa
      name
      class
      kingdom
      order
      phylum
      scientificName
      status
      usageKey
    
    
      occurrenceID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      5
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.081208
      31.938896
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      14
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.081208
      31.938896
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      4
      7
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.082816
      31.938113
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      13
      7
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.082816
      31.938113
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      34
      11
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      1977-07-17
      -109.079307
      31.938056
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      38
      11
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      1977-07-17
      -109.079307
      31.938056
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      40
      11
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      1977-07-17
      -109.079307
      31.938056
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      27
      15
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-17
      -109.081036
      31.937059
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      28
      15
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-17
      -109.081036
      31.937059
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0
    
    
      45
      18
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      1977-07-18
      -109.078633
      31.937126
      Dipodomys
      merriami
      Rodent
      Dipodomys merriami
      Mammalia
      Animalia
      Rodentia
      Chordata
      Dipodomys merriami Mearns, 1890
      ACCEPTED
      2439521.0

EXERCISE: Exclude the duplicate values from the survey data set and save the result as survey_data_unique __Tip__: Next to finding `duplicated` values, Pandas has a function to `drop duplicates`...



In [8]:

    
survey_data_unique = survey_data_processed.drop_duplicates()



In [9]:

    
len(survey_data_unique)









    Out[9]:





33973

EXERCISE: For how many records (rows) we have all the information available (i.e. no NaN values in any of the columns)? __Tip__: Just counting the nan (null) values won't work, maybe `dropna` can help you?



In [10]:

    
len(survey_data_unique.dropna())









    Out[10]:





29777

EXERCISE: Select the subset of records without a species name, while having information on the sex and store the result as variable not_identified __Tip__: next to `isnull`, also `notnull` exists...



In [11]:

    
mask = survey_data_unique['species'].isnull() & survey_data_unique['sex'].notnull()
not_identified = survey_data_unique[mask]



In [12]:

    
not_identified.head()









    Out[12]:







  
    
      
      verbatimLocality
      verbatimSex
      wgt
      datasetName
      sex
      eventDate
      decimalLongitude
      decimalLatitude
      genus
      species
      taxa
      name
      class
      kingdom
      order
      phylum
      scientificName
      status
      usageKey
    
    
      occurrenceID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      2
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.081975
      31.938887
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      3
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-16
      -109.081208
      31.938896
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      23
      15
      F
      NaN
      Ecological Archives E090-118-D1.
      female
      1977-07-17
      -109.081036
      31.937059
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      39
      17
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-07-17
      -109.079415
      31.937117
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      73
      2
      M
      NaN
      Ecological Archives E090-118-D1.
      male
      1977-08-19
      -109.081975
      31.938887
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

EXERCISE: Select only those records that do have species information and save them as the variable survey_data. Make sure survey_data is a copy of the original DataFrame. This is the DataFrame we will use in the further analyses.



In [13]:

    
survey_data = survey_data_unique.dropna(subset=['species']).copy()

NOTE: For biodiversity studies, absence values (knowing that someting is not present) are useful as well to normalize the observations, but this is out of scope for these exercises.

Observations over time

EXERCISE: Make a plot visualizing the evolution of the number of observations for each of the individual years (i.e. annual counts). __Tip__: In the `pandas_04_time_series_data.ipynb` notebook, a powerful command to resample a time series



In [14]:

    
survey_data.resample('A', on='eventDate').size().plot()









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fbfec81ead0>

To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation. We can actually use the plotnine library as well to make heatmaps, as it provides the geom_tile geometry. Loading the library:



In [15]:

    
import plotnine as pn

EXERCISE: Create a table, called heatmap_prep_plotnine, based on the survey_data DataFrame with a column for the years, a column for the months a column with the counts (called `count`). __Tip__: You have to count for each year/month combination. Also `reset_index` could be useful.



In [16]:

    
heatmap_prep_plotnine = survey_data.groupby([survey_data['eventDate'].dt.year, 
                                             survey_data['eventDate'].dt.month]).size()
heatmap_prep_plotnine.index.names = ["year", "month"]
heatmap_prep_plotnine = heatmap_prep_plotnine.reset_index(name='count')



In [17]:

    
heatmap_prep_plotnine.head()

EXERCISE: Based on heatmap_prep_plotnine, make a heatmap using the plotnine package. __Tip__: When in trouble, check [this section of the documentation](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html#Annotated-Heatmap)



In [18]:

    
(pn.ggplot(heatmap_prep_plotnine, pn.aes(x="factor(month)", y="year", fill="count"))
    + pn.geom_tile()
    + pn.scale_fill_cmap("Reds")
    + pn.scale_y_reverse()
    + pn.theme( 
     axis_ticks=pn.element_blank(),
     panel_background=pn.element_rect(fill='white'))
)









    












    Out[18]:





<ggplot: (8778890668909)>

Remark that we started from a tidy data format (also called long format).

The heatmap functionality is also provided by the plotting library seaborn (check the docs!). Based on the documentation, seaborn uses the short format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values.

Let's reformat the heatmap_prep_plotnine data to be useable for the seaborn heatmap function:

EXERCISE: Create a table, called heatmap_prep_sns, based on the heatmap_prep_plotnine DataFrame with in the row index the years, in the column the months and as values of the table, the counts for each of these year/month combinations. __Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know



In [19]:

    
heatmap_prep_sns = heatmap_prep_plotnine.pivot_table(index='year', columns='month', values='count')

EXERCISE: Using the seaborn documentation make a heatmap starting from the heatmap_prep_sns variable.



In [20]:

    
fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(heatmap_prep_sns, cmap='Reds')

EXERCISE: Based on the heatmap_prep_sns DataFrame, return to the long format of the table with the columns `year`, `month` and `count` and call the resulting variable heatmap_tidy. __Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know, but a `reset_index` could be useful as well



In [21]:

    
heatmap_tidy = heatmap_prep_sns.reset_index().melt(id_vars=["year"], value_name="count")
heatmap_tidy.head()

Species abundance for each of the plots

The name of the observed species consists of two parts: the 'genus' and 'species' columns. For the further analyses, we want the combined name. This is already available as the 'name' column if you completed the previous notebook, otherwise you can add this again in the following exercise.

EXERCISE: Make a new column 'name' that combines the 'Genus' and 'species' columns (with a space in between). __Tip__: You are aware you can count with strings in Python 'a' + 'b' = 'ab'?



In [22]:

    
survey_data['name'] = survey_data['genus'] + ' ' + survey_data['species']

EXERCISE: Which 8 species have been observed most of all? __Tip__: Pandas provide a function to combine sorting and showing the first n records, see [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nlargest.html)...



In [23]:

    
survey_data.groupby("name").size().nlargest(8)









    Out[23]:





name
Dipodomys merriami           10025
Dipodomys ordii               2966
Chaetodipus penicillatus      2928
Chaetodipus baileyi           2696
Reithrodontomys megalotis     2485
Dipodomys spectabilis         2481
Onychomys torridus            2220
Perognathus flavus            1475
dtype: int64



In [24]:

    
survey_data['name'].value_counts()[:8]









    Out[24]:





Dipodomys merriami           10025
Dipodomys ordii               2966
Chaetodipus penicillatus      2928
Chaetodipus baileyi           2696
Reithrodontomys megalotis     2485
Dipodomys spectabilis         2481
Onychomys torridus            2220
Perognathus flavus            1475
Name: name, dtype: int64

EXERCISE: How many records are available of each of the species in each of the plots (called `verbatimLocality`)? How would you visualize this information with seaborn?



In [25]:

    
species_per_plot = survey_data.reset_index().pivot_table(index="name", 
                                                         columns="verbatimLocality", 
                                                         values="occurrenceID", 
                                                         aggfunc='count')

# alternative ways to calculate this
#species_per_plot =  survey_data.groupby(['name', 'plot_id']).size().unstack(level=-1)
#species_per_plot = pd.crosstab(survey_data['name'], survey_data['plot_id'])



In [26]:

    
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(species_per_plot, ax=ax, cmap='Reds')









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fbfea62b210>

EXERCISE: What is the number of different species in each of the plots? Make a bar chart, using Pandas `plot` function, providing for each plot the diversity of species, by defining a matplotlib figure and ax to make the plot. Change the y-label to 'plot number' __Tip__: next to `unique`, Pandas also provides a function `nunique`...



In [27]:

    
n_species_per_plot = survey_data.groupby(["verbatimLocality"])["name"].nunique()

fig, ax = plt.subplots(figsize=(6, 6))
n_species_per_plot.plot(kind="barh", ax=ax, color="lightblue")
ax.set_ylabel("plot number")

# Alternative option:
# inspired on the pivot table we already had:
# species_per_plot = survey_data.reset_index().pivot_table(
#     index="name", columns="verbatimLocality", values="occurrenceID", aggfunc='count')
# n_species_per_plot = species_per_plot.count()









    Out[27]:





Text(0, 0.5, 'plot number')

EXERCISE: What is the number of plots each species have been observed? Make an horizontal bar chart using Pandas `plot` function providing for each species the spread amongst the plots for which the species names are sorted to the number of plots



In [28]:

    
n_plots_per_species = survey_data.groupby(["name"])["verbatimLocality"].nunique().sort_values()

fig, ax = plt.subplots(figsize=(8, 8))
n_plots_per_species.plot(kind="barh", ax=ax, color='0.4')

# Alternatives
# species_per_plot2 = survey_data.reset_index().pivot_table(index="verbatimLocality",
#                                                           columns="name",
#                                                           values="occurrenceID",
#                                                           aggfunc='count')
# nplots_per_species = species_per_plot2.count().sort_values(ascending=False)
# or
# species_per_plot.count(axis=1).sort_values(ascending=False).plot(kind='bar')









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fbfea4c9d90>

EXERCISE: First, exclude the NaN-values from the `sex` column and save the result as a new variable called `subselection_sex`. Based on this variable `subselection_sex`, calculate the amount of males and females present in each of the plots. Save the result (with the verbatimLocality as index and sex as column names) as a variable n_plot_sex. __Tip__: Release the power of `unstack`...



In [31]:

    
subselection_sex = survey_data.dropna(subset=["sex"])
#subselection_sex = survey_data[survey_data["sex"].notnull()]



In [32]:

    
n_plot_sex = subselection_sex.groupby(["sex", "verbatimLocality"]).size().unstack(level=0)
n_plot_sex.head()









    Out[32]:







  
    
      sex
      female
      male
    
    
      verbatimLocality
      
      
    
  
  
    
      1
      792
      1027
    
    
      2
      838
      1018
    
    
      3
      810
      742
    
    
      4
      825
      972
    
    
      5
      495
      553

As such, we can use the variable n_plot_sex to plot the result:



In [33]:

    
n_plot_sex.plot(kind='bar', figsize=(12, 6), rot=0)









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fbfe88a3c10>

EXERCISE: Create the previous plot with the plotnine library, directly from the variable subselection_sex. __Tip__: When in trouble, check these [docs](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_col.html#Two-Variable-Bar-Plot).



In [34]:

    
(pn.ggplot(subselection_sex, pn.aes(x="verbatimLocality", fill="sex"))
     + pn.geom_bar(position='dodge')
     + pn.scale_x_discrete(breaks=np.arange(1, 25, 1), limits=np.arange(1, 25, 1))
)









    












    Out[34]:





<ggplot: (8778888519333)>

Select subsets according to taxa of species



In [35]:

    
survey_data["taxa"].unique()









    Out[35]:





array(['Rodent', 'Rodent-not censused', 'Rabbit', 'Bird', 'Reptile'],
      dtype=object)



In [36]:

    
survey_data['taxa'].value_counts()
#survey_data.groupby('taxa').size()









    Out[36]:





Rodent                 30939
Rodent-not censused      595
Bird                     354
Rabbit                    59
Reptile                   14
Name: taxa, dtype: int64

EXERCISE: Select the records for which the `taxa` is equal to 'Rabbit', 'Bird' or 'Reptile'. Call the resulting variable `non_rodent_species`. __Tip__: You do not have to combine three different conditions, as Pandas has a function to check if something is in a certain list of values



In [37]:

    
non_rodent_species = survey_data[survey_data['taxa'].isin(['Rabbit', 'Bird', 'Reptile'])]



In [38]:

    
len(non_rodent_species)









    Out[38]:





427

EXERCISE: Select the records for which the `taxa` starts with an 'ro' (make sure it does not matter if a capital character is used in the 'taxa' name). Call the resulting variable r_species. __Tip__: Remember the `.str.` construction to provide all kind of string functionalities?



In [39]:

    
r_species = survey_data[survey_data['taxa'].str.lower().str.startswith('ro')]



In [40]:

    
len(r_species)









    Out[40]:





31534

EXERCISE: Select the records that are not Birds. Call the resulting variable non_bird_species.



In [41]:

    
non_bird_species = survey_data[survey_data['taxa'] != 'Bird']



In [42]:

    
len(non_bird_species)









    Out[42]:





31607

(OPTIONAL SECTION) Evolution of species during monitoring period

In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked

EXERCISE: Plot using Pandas `plot` function the number of records for `Dipodomys merriami` on yearly basis during time



In [43]:

    
merriami = survey_data[survey_data["name"] == "Dipodomys merriami"]



In [44]:

    
fig, ax = plt.subplots()
merriami.groupby(merriami['eventDate'].dt.year).size().plot(ax=ax)
ax.set_xlabel("")
ax.set_ylabel("number of occurrences")









    Out[44]:





Text(0, 0.5, 'number of occurrences')

NOTE: Check the difference between the following two graphs? What is different? Which one would you use?



In [45]:

    
merriami = survey_data[survey_data["species"] == "merriami"]
fig, ax = plt.subplots(2, 1, figsize=(14, 8))
merriami.groupby(merriami['eventDate']).size().plot(ax=ax[0], style="-") # top graph
merriami.resample("D", on="eventDate").size().plot(ax=ax[1], style="-") # lower graph









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fbfe84f5190>

EXERCISE: Plot, for the species 'Dipodomys merriami', 'Dipodomys ordii', 'Reithrodontomys megalotis' and 'Chaetodipus baileyi', the monthly number of records as a function of time for the whole monitoring period. Plot each of the individual species in a separate subplot and provide them all with the same y-axis scale __Tip__: have a look at the documentation of the pandas plot function.



In [46]:

    
subsetspecies = survey_data[survey_data["name"].isin(['Dipodomys merriami', 'Dipodomys ordii',
                                                      'Reithrodontomys megalotis', 'Chaetodipus baileyi'])]



In [47]:

    
month_evolution = subsetspecies.groupby("name").resample('M', on='eventDate').size()



In [48]:

    
species_evolution = month_evolution.unstack(level=0)
axs = species_evolution.plot(subplots=True, figsize=(14, 8), sharey=True)

EXERCISE: Reproduce the previous plot using the plotnine package.



In [49]:

    
subsetspecies = survey_data[survey_data["name"].isin(['Dipodomys merriami', 'Dipodomys ordii',
                                                      'Reithrodontomys megalotis', 'Chaetodipus baileyi'])]
month_evolution = subsetspecies.groupby("name").resample('M', on='eventDate').size()



In [50]:

    
(pn.ggplot(month_evolution.reset_index(name='count'), 
           pn.aes(x='eventDate', y='count', color='name'))
    + pn.geom_line()
    + pn.facet_wrap('name', nrow=4)
    + pn.theme_light()
)









    












    Out[50]:





<ggplot: (8778888171901)>

EXERCISE: Evaluate the yearly amount of occurrences for each of the 'taxa' as a function of time.



In [51]:

    
year_evolution = survey_data.groupby("taxa").resample('A', on='eventDate').size()
species_evolution = year_evolution.unstack(level=0)
axs = species_evolution.plot(subplots=True, figsize=(16, 8), sharey=False)

EXERCISE: Calculate the number of occurrences for each weekday, grouped by each year of the monitoring campaign, without using the `pivot` functionality. Call the variable count_weekday_years



In [52]:

    
count_weekday_years = survey_data.groupby([survey_data["eventDate"].dt.year, survey_data["eventDate"].dt.dayofweek]).size().unstack()



In [53]:

    
# Alternative
#years = survey_data["eventDate"].dt.year.rename('year')
#dayofweaks = survey_data["eventDate"].dt.dayofweek.rename('dayofweak')
#count_weekday_years = pd.crosstab(index=years, columns=dayofweaks)



In [54]:

    
count_weekday_years.head()



In [55]:

    
count_weekday_years.plot()









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fbfe3e99050>

EXERCISE: Based on the variable `count_weekday_years`, calculate for each weekday the median amount of records based on the yearly count values. Modify the labels of the plot to indicate the actual days of the week (instead of numbers)



In [56]:

    
fig, ax = plt.subplots()
count_weekday_years.median(axis=0).plot(kind='barh', ax=ax, color='#66b266')
xticks = ax.set_yticklabels(['Monday', 'Tuesday', 'Wednesday', "Thursday", "Friday", "Saturday", "Sunday"])

Nice work!

	year	month	count
0	1977	1	NaN
1	1978	1	63.0
2	1979	1	55.0
3	1980	1	124.0
4	1981	1	157.0

eventDate	0	1	2	3	4	5	6
eventDate
1977	89.0	35.0	NaN	NaN	19.0	106.0	160.0
1978	121.0	14.0	16.0	71.0	148.0	277.0	275.0
1979	39.0	79.0	121.0	58.0	NaN	165.0	161.0
1980	234.0	162.0	94.0	62.0	81.0	160.0	489.0
1981	446.0	63.0	87.0	126.0	74.0	83.0	427.0

	verbatimLocality	verbatimSex	wgt	datasetName	sex	eventDate	decimalLongitude	decimalLatitude	genus	species	taxa	name	class	kingdom	order	phylum	scientificName	status	usageKey
occurrenceID
1	2	M	NaN	Ecological Archives E090-118-D1.	male	1977-07-16	-109.081975	31.938887	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	M	NaN	Ecological Archives E090-118-D1.	male	1977-07-16	-109.081208	31.938896	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2	F	NaN	Ecological Archives E090-118-D1.	female	1977-07-16	-109.081975	31.938887	Dipodomys	merriami	Rodent	Dipodomys merriami	Mammalia	Animalia	Rodentia	Chordata	Dipodomys merriami Mearns, 1890	ACCEPTED	2439521.0
4	7	M	NaN	Ecological Archives E090-118-D1.	male	1977-07-16	-109.082816	31.938113	Dipodomys	merriami	Rodent	Dipodomys merriami	Mammalia	Animalia	Rodentia	Chordata	Dipodomys merriami Mearns, 1890	ACCEPTED	2439521.0
5	3	M	NaN	Ecological Archives E090-118-D1.	male	1977-07-16	-109.081208	31.938896	Dipodomys	merriami	Rodent	Dipodomys merriami	Mammalia	Animalia	Rodentia	Chordata	Dipodomys merriami Mearns, 1890	ACCEPTED	2439521.0