This application provides a full data pipeline for getting and cleaning iNaturalist butterfly data, and running this data through Jeff Oliver's species distribution model, which creates rasters and image files that help scientists visualize how different butterfly species are distributed across the country, given the month of the year.
Python 3.6, R, Git, Anaconda (install in admin), Bash
Clone the required projects
clone https://github.com/jcoliver/ebutterfly-sdm.gitclone https://github.com/ckhoward/ebutterfly-sdm.gitGet your data
Organize your files
Run organize.py with python organize.py from the command line. This script:
This process prepares you to run the SDM model by generating the text files shown below.
This code iterates on every species contained in the ids array, and for every month (and all observations), the run-sdm.R script takes in each species, by each month, and creates associated images and raster files.
Notice that the IDs are hardcoded. We know this is not ideal and are working on a fix. In the meantime, we have created a massive list of space separated IDs that can be plugged in as necessary. The list can be found here. Additionally, organize.py contains a commented out function that takes taxon_list.txt and provides an output that can be copied and pasted into the bash ID array contained in the notebook. Thus, you're given the flexibility to choose what specific IDs you want to observe.
In [ ]:
%%bash
start=$(date +%s.%N)
ids=("47226")
months=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "all")
for id in "${ids[@]}"; do
for month in "${months[@]}"; do
Rscript --vanilla scripts/run-sdm.R data/inaturalist/$id-$month-iNaturalist.txt $id-$month output/
done
done
end=$(date +%s.%N)
runtime=$(python -c "print(${end} - ${start})")
echo "Runtime was $runtime"
From iNaturlist:
"Our export system is a bit flawed, though: if you alter the filters to get more than 10,000 records it may start omitting data, so the above techniques would probably be better. Plus, if you use GBIF, your [users] can benefit from even more records from museums and such."
Anyone using iNaturalist data should perhaps use our API access code, Jeff's, or download the DwC-A that iNaturalist makes for GBIF.
Some of the research-grade observations did not have latitude and longitude entries. Our script takes care of this by omitting these data.
Approximate run time for organization.py to clean and organize/write needed files for the SDM : ~5minutes
Approximate run time for SDM/R Script to fully run 5 Taxon ID's and recieve Raster Images: ~3minutes
Estimated run time for SDM/R Script to fully run all 760-780 Taxon ID's and recieve Raster Images: ~6.5hours
When running the SDM on 760 species, the output produced is approximately 16 GBs in total. Consider your hard drive space before utilizing the SDM on large numbers of species.
After running the SDM, you will likely notice maps produced with yellow bands, or kernel output noting issues with the mins, maxes, and correlations. This is due to the input having either no data, or too few observations. If the input has more than 12 observations, the generated outputs will be correct.
These codeblocks give you the flexibility to call the iNaturalist API and fetch data that way.
In [4]:
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
from IPython.display import display
def get_taxa_id(species_name):
'''
This function returns the taxon_id when given the species name.
Parameters:
species_name: a string object representing a species name, e.g. "Danaus plexippus"
Returns: ids, a list object containing integer id's for the species
'''
base_url = "http://api.inaturalist.org/v1/taxa/autocomplete?q="
request = requests.get(base_url + "%20".join(species_name.split()))
data = request.json()
ids = []
for i in data['results']:
ids.append(i['id'])
return ids
def get_observation(id_no, month, year):
'''
This function returns observation data when given taxon_id, month, and year.
Parameters:
id_no: an integer representing species taxon_id
month: an integer (1-12) representing the month of interest
year: an integer representing year of interest
Returns: observational data for taxon_id for specified month and year.
'''
#Url builder, for the request
base_url = "http://api.inaturalist.org/v1/observations?"
end_url = "&order=desc&order_by=created_at"
url = base_url + 'taxon_id=' + str(id_no) + '&month=' + str(month) + '&year=' + str(year) + end_url
request = requests.get(url)
data = request.json()
return data
def get_count_one_month(id_no_lst, month, year):
'''
This function counts the number of observations of a taxon_id, for each month of a given year.
Parameters:
id_no_lst: a Python list object containing IDs
month: an integer object (1-12) representing the month you want the count for
year: an integer object for the year of interest
Returns: count, an integer of how many observations are given for some id, for some month of a given year.
'''
count = 0
for i in id_no_lst:
count += int(get_observation(i, month, year)['total_results']) #total_results key associates w/ ea. set of obs data
return count
species = [
'Danaus plexippus',
'Hyles lineata',
'Zerene cesonia',
'Papilio multicaudata',
'Agraulis vanillae',
'Papilio cresphontes',
'Strymon melinus',
'Vanessa cardui',
'Hylephila phyleus',
'Danaus gilippus'
]
months = [
'January',
'February',
'March',
'April',
'May',
'June',
'July',
'August',
'September',
'October',
'November',
'December'
]
def main():
#print('running')
species_to_id = {}
frames = []
#Get a dictionary of the taxa -> lst(ids)
for i in species:
species_to_id[i] = get_taxa_id(i)
#Map integers 1-12 to 'January' through 'December'
month_map = dict(zip(range(1,13), months))
species_dict_out = {}
year = 2016
#print(species_to_id)
#Create a dictionary for each species
for spec in species_to_id:
species_dict_out[spec] = {}
#Map each species' months to their corresponding count of species observations for that month (and year)
for mon in month_map:
species_dict_out[spec][month_map[mon]] = get_count_one_month(species_to_id[spec], mon, year)
#print(species_dict_out[spec])
#Creates list of observed IDs for each species, e.g. {'Danaus plexippus': [48662, 235550]}
frames.append(species_dict_out[spec])
#Makes the JSON Species->ID_List structures
#result = json_normalize(frames)
#display(result)
%time main()
In [3]:
taxon_dict={}
def get_organized(file):
file=open(file, 'r')
for line in file:
line=line.split()
taxon_dict[line[0]]=' '.join(line[1:])
#print(taxon_dict[line[0]])
return taxon_dict
hello = get_organized('taxon_list.txt')
#print(taxon_dict)
In [ ]: