iNaturalist SDM Command and Control

Daniel Phillips, Chris Howard, Phillip Johnson, Jacob Smith, Michael Reid

Basic Overview

This application provides a full data pipeline for getting and cleaning iNaturalist butterfly data, and running this data through Jeff Oliver's species distribution model, which creates rasters and image files that help scientists visualize how different butterfly species are distributed across the country, given the month of the year.

System Requirements:

Python 3.6, R, Git, Anaconda (install in admin), Bash


Getting started:

  1. Clone the required projects

    • Get Jeff Oliver's SDM with clone https://github.com/jcoliver/ebutterfly-sdm.git
    • Get this program with clone https://github.com/ckhoward/ebutterfly-sdm.git
  2. Get your data

    • If downloading GBIF Observations, unzip the downloaded file, and move observations.csv into the directory ebutterfly-sdm/data/ (warning: many input files will be generated here)
    • If using updater.py and observation_getter.py..
  3. Organize your files

    • Place inat_request.ipynb into the directory ebutterfly-sdm/ (this notebook will act as command and control)
    • Place organize.py into the directory ebutterfly-sdm/data/
    • Move taxon_list.txt from ebutterfly-sdm/data/gbif/ to ebutterfly-sdm/data/inaturalist

How to use:

Run organize.py with python organize.py from the command line. This script:

  • Cleans the observations.csv data by removing extraneous and/or missing data (IDs not listed in Jeff's taxon_list file, for instance);
  • Creates a user-friendly file, data_for_sdm.csv, containing all observations, with data for taxonId, year, month, latitude, and longitude. A text copy is created too, just in case it is preferred. This makes it easier for users to sift through the data of interest, to find any glaring issues;
  • Creates 13 text files for every species listed in data_for_sdm.csv, one for each month, and one for all months, to be used as input for the SDM, as the format [taxonid-month-iNaturalist].

This process prepares you to run the SDM model by generating the text files shown below.

Running the R Scripts through Bash

Input:

This code iterates on every species contained in the ids array, and for every month (and all observations), the run-sdm.R script takes in each species, by each month, and creates associated images and raster files.

Notice that the IDs are hardcoded. We know this is not ideal and are working on a fix. In the meantime, we have created a massive list of space separated IDs that can be plugged in as necessary. The list can be found here. Additionally, organize.py contains a commented out function that takes taxon_list.txt and provides an output that can be copied and pasted into the bash ID array contained in the notebook. Thus, you're given the flexibility to choose what specific IDs you want to observe.


In [ ]:
%%bash
start=$(date +%s.%N)

ids=("47226")
months=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "all")
for id in "${ids[@]}"; do
    for month in "${months[@]}"; do
        Rscript --vanilla scripts/run-sdm.R data/inaturalist/$id-$month-iNaturalist.txt $id-$month output/
    done
done
end=$(date +%s.%N)    
runtime=$(python -c "print(${end} - ${start})")

echo "Runtime was $runtime"

Output distribution images:

Warnings

Data integrity:

From iNaturlist:

"Our export system is a bit flawed, though: if you alter the filters to get more than 10,000 records it may start omitting data, so the above techniques would probably be better. Plus, if you use GBIF, your [users] can benefit from even more records from museums and such."

Anyone using iNaturalist data should perhaps use our API access code, Jeff's, or download the DwC-A that iNaturalist makes for GBIF.

Some of the research-grade observations did not have latitude and longitude entries. Our script takes care of this by omitting these data.

Runtimes:

Approximate run time for organization.py to clean and organize/write needed files for the SDM : ~5minutes

Approximate run time for SDM/R Script to fully run 5 Taxon ID's and recieve Raster Images: ~3minutes

Estimated run time for SDM/R Script to fully run all 760-780 Taxon ID's and recieve Raster Images: ~6.5hours

Space requirement:

When running the SDM on 760 species, the output produced is approximately 16 GBs in total. Consider your hard drive space before utilizing the SDM on large numbers of species.

Map Errors:

After running the SDM, you will likely notice maps produced with yellow bands, or kernel output noting issues with the mins, maxes, and correlations. This is due to the input having either no data, or too few observations. If the input has more than 12 observations, the generated outputs will be correct.

Other functionality:

These codeblocks give you the flexibility to call the iNaturalist API and fetch data that way.


In [4]:
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
from IPython.display import display

def get_taxa_id(species_name):
    '''
    This function returns the taxon_id when given the species name.
    
    Parameters:
    species_name: a string object representing a species name, e.g. "Danaus plexippus"
    
    Returns: ids, a list object containing integer id's for the species
    '''
    
    base_url = "http://api.inaturalist.org/v1/taxa/autocomplete?q="

    
    request = requests.get(base_url + "%20".join(species_name.split()))
    data = request.json()

    ids = []
    for i in data['results']:
        ids.append(i['id'])

    return ids


def get_observation(id_no, month, year):
    '''
    This function returns observation data when given taxon_id, month, and year.
    
    Parameters: 
    id_no: an integer representing species taxon_id
    month: an integer (1-12) representing the month of interest
    year: an integer representing year of interest
    
    Returns: observational data for taxon_id for specified month and year.
    '''
    #Url builder, for the request
    base_url = "http://api.inaturalist.org/v1/observations?"
    end_url = "&order=desc&order_by=created_at"
    url = base_url + 'taxon_id=' + str(id_no) + '&month=' + str(month) + '&year=' + str(year) + end_url

    request = requests.get(url)
    data = request.json()    

    return data
    
        
def get_count_one_month(id_no_lst, month, year):
    '''
    This function counts the number of observations of a taxon_id, for each month of a given year.
    
    Parameters:
    id_no_lst: a Python list object containing IDs 
    month: an integer object (1-12) representing the month you want the count for
    year: an integer object for the year of interest
    
    Returns: count, an integer of how many observations are given for some id, for some month of a given year.
    '''
    count = 0
    for i in id_no_lst:
        count += int(get_observation(i, month, year)['total_results']) #total_results key associates w/ ea. set of obs data
    return count
    
    

species = [
    'Danaus plexippus',
    'Hyles lineata',
    'Zerene cesonia',
    'Papilio multicaudata',
    'Agraulis vanillae',
    'Papilio cresphontes',
    'Strymon melinus',
    'Vanessa cardui',
    'Hylephila phyleus',
    'Danaus gilippus'
]

months = [
    'January',
    'February',
    'March',
    'April',
    'May',
    'June',
    'July',
    'August',
    'September',
    'October',
    'November',
    'December'
]

def main():
    #print('running')
    species_to_id = {}
    frames = []
    
    
    #Get a dictionary of the taxa -> lst(ids)
    for i in species:
        species_to_id[i] = get_taxa_id(i)
    
    #Map integers 1-12 to 'January' through 'December'
    month_map = dict(zip(range(1,13), months))
    
    species_dict_out = {}
    year = 2016
    
    #print(species_to_id)
    
    #Create a dictionary for each species
    for spec in species_to_id:
        species_dict_out[spec] = {}
        
        #Map each species' months to their corresponding count of species observations for that month (and year)
        for mon in month_map:
            species_dict_out[spec][month_map[mon]] = get_count_one_month(species_to_id[spec], mon, year)
        
        #print(species_dict_out[spec])
        
        #Creates list of observed IDs for each species, e.g. {'Danaus plexippus': [48662, 235550]}
        frames.append(species_dict_out[spec])
            
        
    #Makes the JSON Species->ID_List structures 
    #result = json_normalize(frames)
    #display(result)
    
    
%time main()


Wall time: 4min 49s

In [3]:
taxon_dict={}
def get_organized(file):
	file=open(file, 'r')
	for line in file:
		line=line.split()
		taxon_dict[line[0]]=' '.join(line[1:])
		#print(taxon_dict[line[0]])
	return taxon_dict
		
hello = get_organized('taxon_list.txt')
#print(taxon_dict)

In [ ]: