Working with GBIF data in presence point records


In [1]:
from iSDM.species import GBIFSpecies

In [2]:
my_species = GBIFSpecies(name_species="Etheostoma_blennioides")

In [3]:
my_species.name_species


Out[3]:
'Etheostoma_blennioides'

just some logging/plotting magic to output in this notebook, nothing to care about.


In [4]:
%matplotlib inline
import logging
root = logging.getLogger()
root.addHandler(logging.StreamHandler())

1. Find and download all matching species data from GBIF. At this point no data cleaning is done yet.

Show only first 5 observation rows (head()).


In [5]:
my_species.find_species_occurrences().head()


Loading species ... 
Number of occurrences: 7226 
True
Loaded species: ['Etheostoma blennioides'] 
Out[5]:
accessRights associatedOccurrences associatedReferences associatedSequences basisOfRecord bibliographicCitation catalogNumber class classKey collectionCode ... type typeStatus verbatimCoordinateSystem verbatimDepth verbatimElevation verbatimEventDate verbatimLocality vernacularName waterBody year
0 Open Access, http://creativecommons.org/public... NaN NaN NaN PRESERVED_SPECIMEN Etheostoma blennioides (YPM ICH 028456) YPM ICH 028456 Actinopterygii 204 VZ ... PhysicalObject NaN NaN NaN NaN NaN NaN perches; perch-like fishes; ray-finned fishes;... NaN 2015.0
1 NaN NaN NaN NaN HUMAN_OBSERVATION NaN 1937841 Actinopterygii 204 Observations ... NaN NaN NaN NaN NaN Thu Sep 10 2015 14:51:49 GMT-0400 (EDT) 3827–4235 Fobes Rd, Rock Creek, OH, US NaN NaN 2015.0
2 NaN NaN NaN NaN HUMAN_OBSERVATION NaN 623289 Actinopterygii 204 Observations ... NaN NaN NaN NaN NaN 2014-04-13 Beaver Creek NaN NaN 2014.0
3 Open Access, http://creativecommons.org/public... NaN Det. by: Thomas J. Near NaN PRESERVED_SPECIMEN Etheostoma blennioides (YPM ICH 026964) YPM ICH 026964 Actinopterygii 204 VZ ... PhysicalObject NaN NaN NaN NaN NaN NaN perches; perch-like fishes; ray-finned fishes;... NaN 2014.0
4 Open Access, http://creativecommons.org/public... NaN Det. by: William Freedburg, Thomas J. Near NaN PRESERVED_SPECIMEN Etheostoma blennioides (YPM ICH 027023) YPM ICH 027023 Actinopterygii 204 VZ ... PhysicalObject NaN NaN NaN NaN NaN NaN perches; perch-like fishes; ray-finned fishes;... NaN 2014.0

5 rows × 138 columns

taxonkey derived from GBIF data. It's a sort of unique ID per species


In [6]:
my_species.ID # taxonkey derived from GBIF. It's a sort of unique ID per species


Out[6]:
2382397

Data is serialized and saved in a file.

Default location: current working directory. Default filename: GBIFID of the species


In [8]:
my_species.save_data()


Saved data: /home/daniela/git/iSDM/notebooks/Etheostoma_blennioides2382397.pkl 

In [9]:
my_species.source.name


Out[9]:
'GBIF'

Let's get a general idea of where the species is distributed on the map


In [10]:
my_species.plot_species_occurrence()


The map is always zoomed to the species borders. Notice low right corner also has one red point.

2. Or just load existing data into a Species object. Let's use the file we saved before.


In [11]:
data = my_species.load_data("./Etheostoma_blennioides2382397.pkl") # or just load existing data into Species object


Loading data from: ./Etheostoma_blennioides2382397.pkl
Succesfully loaded previously saved data.

In [12]:
data.columns # all the columns available per observation


Out[12]:
Index(['accessRights', 'associatedOccurrences', 'associatedReferences',
       'associatedSequences', 'basisOfRecord', 'bibliographicCitation',
       'catalogNumber', 'class', 'classKey', 'collectionCode',
       ...
       'type', 'typeStatus', 'verbatimCoordinateSystem', 'verbatimDepth',
       'verbatimElevation', 'verbatimEventDate', 'verbatimLocality',
       'vernacularName', 'waterBody', 'year'],
      dtype='object', length=138)

3. Examples of simple (meta-)data exploration

Show all unique values of the 'country' column


In [13]:
data['country'].unique().tolist()


Out[13]:
['United States', nan, 'Canada', 'Namibia', 'India']

In [14]:
data.shape # there are 7226 observations, 138 parameters per observation


Out[14]:
(7226, 138)

In [15]:
data['vernacularName'].unique().tolist() # self-explanatory


Out[15]:
['perches; perch-like fishes; ray-finned fishes; vertebrates; chordates; animals',
 nan,
 'Greenside Darter',
 'GREENSIDE DARTER',
 'greenside darter']

How about latitude/longitude? Does the data need cleaning?

head() or tail() is only used to limit the tabular output in this notebook. The "data" structure contains it all.


In [16]:
data['decimalLatitude'].tail(10)


Out[16]:
7216    43.03333
7217    34.84256
7218         NaN
7219    43.03333
7220         NaN
7221         NaN
7222         NaN
7223         NaN
7224    41.38022
7225         NaN
Name: decimalLatitude, dtype: float64

Hmm, so some values are 'NaN', which means not available.

We can fill them with something (default?), or drop those records where latitude/longitude are not available. Let's drop records where the latitude/longitude data is not available


In [17]:
import numpy as np
data_cleaned = data.dropna(subset = ['decimalLatitude', 'decimalLongitude']) # drop records where data not available

In [18]:
data_cleaned.shape # less occurrence records now: 5223


Out[18]:
(5223, 138)

In [19]:
data_cleaned['basisOfRecord'].unique()


Out[19]:
array(['PRESERVED_SPECIMEN', 'HUMAN_OBSERVATION', 'UNKNOWN'], dtype=object)

In [20]:
# this many records with no decimalLatitude and decimalLongitude
import numpy as np
data[data['decimalLatitude'].isnull() & data['decimalLongitude'].isnull()].size


Out[20]:
276414

How many of those have no 'locality' or 'verbatimLocality'? : 27 apparently.


In [21]:
data[data['decimalLatitude'].isnull() & 
     data['decimalLongitude'].isnull() & 
     data['locality'].isnull() & 
     data['verbatimLocality'].isnull()]


Out[21]:
accessRights associatedOccurrences associatedReferences associatedSequences basisOfRecord bibliographicCitation catalogNumber class classKey collectionCode ... type typeStatus verbatimCoordinateSystem verbatimDepth verbatimElevation verbatimEventDate verbatimLocality vernacularName waterBody year
1277 http://fieldmuseum.org/about/copyright-informa... NaN NaN NaN PRESERVED_SPECIMEN NaN 112647 Actinopterygii 204 Fishes ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN NaN 1999.0
4342 NaN NaN NaN NaN UNKNOWN NaN 56-5116 Actinopterygii 204 ON-CDC ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1972.0
4729 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17588 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1968.0
5207 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17580 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5220 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17585 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5236 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17592 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5243 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17598 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5250 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17583 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5253 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17578 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5275 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17582 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5276 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17591 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
5277 NaN NULL NaN NaN PRESERVED_SPECIMEN NaN GCRL 17584 Actinopterygii 204 Occurrence ... NaN NaN NULL NULL NULL NULL NaN NaN NaN 1964.0
6038 not-for-profit use only NaN NaN NaN PRESERVED_SPECIMEN NaN 21736 Actinopterygii 204 Fishes ... PhysicalObject NaN NaN NaN NaN 19500000 NaN NaN NaN 1950.0
6057 not-for-profit use only NaN NaN NaN PRESERVED_SPECIMEN NaN 22581 Actinopterygii 204 Fishes ... PhysicalObject NaN NaN NaN NaN 19500700 NaN NaN NaN 1950.0
6624 not-for-profit use only NaN NaN NaN PRESERVED_SPECIMEN NaN 7092 Actinopterygii 204 Fishes ... PhysicalObject NaN NaN NaN NaN 19380625 NaN NaN NaN 1938.0
6989 NaN NaN NaN NaN PRESERVED_SPECIMEN NaN A-2469 Actinopterygii 204 IC ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1880.0
7045 NaN NaN NaN NaN PRESERVED_SPECIMEN NaN SU 5326 Actinopterygii 204 Occurrence ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7064 NaN NaN NaN NaN PRESERVED_SPECIMEN NaN 5326 Actinopterygii 204 SU (ICH) ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN NaN NaN
7085 not-for-profit use only NaN NaN NaN PRESERVED_SPECIMEN NaN 8498 Actinopterygii 204 Fishes ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN NaN NaN
7087 http://fieldmuseum.org/about/copyright-informa... NaN NaN NaN PRESERVED_SPECIMEN NaN 112640 Actinopterygii 204 Fishes ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN NaN NaN
7093 NaN NaN NaN NaN PRESERVED_SPECIMEN Etheostoma blennioides UAMZ F2503 F2503 Actinopterygii 204 UAMZ ... PhysicalObject NaN NaN NaN NaN unknown NaN NaN NaN NaN
7096 NaN NaN NaN NaN PRESERVED_SPECIMEN NaN SU 715 Actinopterygii 204 Occurrence ... NaN NaN NaN NaN NaN NaN NaN NaN Eel river NaN
7110 NaN NaN NaN NaN PRESERVED_SPECIMEN NaN SU 692 Actinopterygii 204 Occurrence ... NaN NaN NaN NaN NaN NaN NaN NaN Roaring river NaN
7117 http://fieldmuseum.org/about/copyright-informa... NaN NaN NaN PRESERVED_SPECIMEN NaN 112677 Actinopterygii 204 Fishes ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN NaN NaN
7129 NaN NaN NaN NaN PRESERVED_SPECIMEN NaN 715 Actinopterygii 204 SU (ICH) ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN Eel river NaN
7132 not-for-profit use only NaN NaN NaN PRESERVED_SPECIMEN NaN 9050 Actinopterygii 204 Vertebrate Paleontology ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN NaN NaN
7155 NaN NaN NaN NaN PRESERVED_SPECIMEN NaN 692 Actinopterygii 204 SU (ICH) ... PhysicalObject NaN NaN NaN NaN NaN NaN NaN Roaring river NaN

27 rows × 138 columns


In [22]:
data_cleaned[['dateIdentified', 'day', 'month', 'year']].head()


Out[22]:
dateIdentified day month year
0 NaN 23.0 5.0 2015.0
1 2015-09-11T23:37:54.000+0000 10.0 9.0 2015.0
2 2014-04-14T00:24:18.000+0000 13.0 4.0 2014.0
3 NaN 13.0 5.0 2014.0
4 NaN 14.0 5.0 2014.0

Seems like not all records have a 'dateIdentified', but 'day','month', 'year' fields are there for many (all?) records. TODO: what about verbatimDate

Select only observation records newer than 2010;

Say that only latitude, longitude, rightsHolder, datasetName columns are interesting for our selection.


In [23]:
data_selected = data_cleaned[data_cleaned['year']>2010][['decimalLatitude','decimalLongitude', 'rightsHolder', 'datasetName']]

more filtering: select only those with a non-null datasetName


In [24]:
data_selected[~data_selected.datasetName.isnull()].head(10)


Out[24]:
decimalLatitude decimalLongitude rightsHolder datasetName
1 41.79664 -80.97289 Robert L Curtis iNaturalist research-grade observations
2 37.97240 -83.56716 Brian Wulker iNaturalist research-grade observations
10 35.07780 -83.97430 North Carolina Museum of Natural Sciences NCSM Fishes Collection
13 35.17770 -83.88780 North Carolina Museum of Natural Sciences NCSM Fishes Collection
20 35.16030 -83.92020 North Carolina Museum of Natural Sciences NCSM Fishes Collection
39 36.40790 -81.40160 North Carolina Museum of Natural Sciences NCSM Fishes Collection
48 36.41300 -81.40710 North Carolina Museum of Natural Sciences NCSM Fishes Collection
51 36.55790 -81.21670 North Carolina Museum of Natural Sciences NCSM Fishes Collection
52 36.54960 -81.00230 North Carolina Museum of Natural Sciences NCSM Fishes Collection
54 36.38760 -91.53010 NaN Auburn University Museum Fish Collection

If you hare happy with this filtering, and you want to save the species data:


In [25]:
my_species.set_data(data_selected) # update the object "my_species" to contain the filtered data

In [26]:
my_species.save_data(file_name="updated_dataset.pkl")


Saved data: /home/daniela/git/iSDM/notebooks/updated_dataset.pkl 

Plot our filtered selection


In [27]:
my_species.plot_species_occurrence()



In [28]:
my_species.get_data().shape # there are 119 records now


Out[28]:
(119, 4)

4. Load data from downloaded csv file (from GBIF website, not API; differs a bit)


In [29]:
csv_data = my_species.load_csv('../data/GBIF.csv')


Loading data from: ../data/GBIF.csv
Succesfully loaded previously CSV data.
Updated species ID: 2382397 

In [30]:
csv_data.head() # let's peak into the data


Out[30]:
gbifid datasetkey occurrenceid kingdom phylum class order family genus species ... recordnumber identifiedby rights rightsholder recordedby typestatus establishmentmeans lastinterpreted mediatype issue
0 1224542608 71e6db8e-f762-11e1-a439-00145eb45e9a urn:catalog:OMNH:FISH:85718 Animalia Chordata Actinopterygii Perciformes Percidae Etheostoma Etheostoma blennioides ... NaN Dr. Aaron Geheber NaN Sam Noble Oklahoma Museum of Natural History Aaron Geheber NaN NaN 2015-12-23T21:01Z NaN GEODETIC_DATUM_ASSUMED_WGS84
1 17598896 83a8c0da-f762-11e1-a439-00145eb45e9a NaN Animalia Chordata Actinopterygii Perciformes Percidae Etheostoma Etheostoma blennioides ... NaN Baldwin, M.E. NaN NaN Baldwin, M.E.; Bowlby, J.N. NaN NaN 2014-06-04T23:44Z NaN GEODETIC_DATUM_ASSUMED_WGS84
2 17598905 83a8c0da-f762-11e1-a439-00145eb45e9a NaN Animalia Chordata Actinopterygii Perciformes Percidae Etheostoma Etheostoma blennioides ... NaN Baldwin, Mary Elizabeth NaN NaN Baldwin, Mary Elizabeth; Casbourn, Hugh R. NaN NaN 2014-06-04T23:44Z NaN GEODETIC_DATUM_ASSUMED_WGS84
3 198193430 961f602a-f762-11e1-a439-00145eb45e9a NaN Animalia Chordata Actinopterygii Perciformes Percidae Etheostoma Etheostoma blennioides ... NaN NaN NaN NaN R.D. Suttkus, Eaton & Donahue NaN NaN 2014-06-05T03:09Z NaN GEODETIC_DATUM_ASSUMED_WGS84
4 198193618 961f602a-f762-11e1-a439-00145eb45e9a NaN Animalia Chordata Actinopterygii Perciformes Percidae Etheostoma Etheostoma blennioides ... NaN NaN NaN NaN R.D. Suttkus, J.S. Ramsey & M.D. Dahlberg NaN NaN 2014-06-05T03:09Z NaN TAXON_MATCH_HIGHERRANK;GEODETIC_DATUM_ASSUMED_...

5 rows × 42 columns


In [31]:
csv_data['specieskey'].unique()


Out[31]:
array([2382397])

In [32]:
my_species.save_data() # by default this 'speciesKey' is used. Alternative name can be provided


Saved data: /home/daniela/git/iSDM/notebooks/Etheostoma_blennioides2382397.pkl 

In [33]:
csv_data.columns.size # csv data for some reason a lot less columns


Out[33]:
42

In [34]:
data.columns.size # data from using GBIF API directly


Out[34]:
138

Which columns are in 'data', but not in 'csv_data'?


In [35]:
list(set(data.columns.tolist()) - set(csv_data.columns.tolist())) # hmm, 'decimalLatitude' vs 'decimallatitude'


Out[35]:
['georeferencedBy',
 'scientificName',
 'country',
 'identifiers',
 'language',
 'taxonRemarks',
 'type',
 'datasetID',
 'key',
 'higherClassification',
 'eventDate',
 'scientificNameID',
 'associatedReferences',
 'genusKey',
 'rightsHolder',
 'issues',
 'license',
 'habitat',
 'collectionID',
 'basisOfRecord',
 'familyKey',
 'countryCode',
 'ownerInstitutionCode',
 'fieldNotes',
 'orderKey',
 'decimalLatitude',
 'elevationAccuracy',
 'fieldNumber',
 'associatedSequences',
 'extensions',
 'continent',
 'otherCatalogNumbers',
 'locationRemarks',
 'georeferenceSources',
 'georeferenceProtocol',
 'verbatimLocality',
 'stateProvince',
 'source',
 'phylumKey',
 'bibliographicCitation',
 'municipality',
 'georeferenceRemarks',
 'locationAccordingTo',
 'created',
 'geodeticDatum',
 'relations',
 'references',
 'identificationID',
 'waterBody',
 'identificationRemarks',
 'verbatimEventDate',
 'modified',
 'endDayOfYear',
 'identificationQualifier',
 'http://unknown.org/organismID',
 'depthAccuracy',
 'http://unknown.org/occurrenceDetails',
 'media',
 'lastParsed',
 'occurrenceRemarks',
 'nomenclaturalCode',
 'classKey',
 'lifeStage',
 'parentNameUsage',
 'specificEpithet',
 'dynamicProperties',
 'occurrenceID',
 'gbifID',
 'recordedBy',
 'kingdomKey',
 'lastCrawled',
 'preparations',
 'identificationVerificationStatus',
 'occurrenceStatus',
 'taxonKey',
 'higherGeography',
 'footprintWKT',
 'verbatimCoordinateSystem',
 'georeferencedDate',
 'verbatimElevation',
 'facts',
 'identifier',
 'eventID',
 'dateIdentified',
 'typeStatus',
 'infraspecificEpithet',
 'taxonID',
 'individualCount',
 'institutionID',
 'catalogNumber',
 'establishmentMeans',
 'lastInterpreted',
 'recordNumber',
 'publishingCountry',
 'samplingProtocol',
 'organismID',
 'locationID',
 'eventRemarks',
 'collectionCode',
 'disposition',
 'verbatimDepth',
 'coordinateAccuracyInMeters',
 'decimalLongitude',
 'informationWithheld',
 'coordinateAccuracy',
 'datasetName',
 'protocol',
 'county',
 'startDayOfYear',
 'previousIdentifications',
 'taxonRank',
 'datasetKey',
 'publishingOrgKey',
 'georeferenceVerificationStatus',
 'vernacularName',
 'island',
 'accessRights',
 'associatedOccurrences',
 'identifiedBy',
 'eventTime',
 'islandGroup',
 'speciesKey',
 'institutionCode',
 'genericName']

Which columns are in 'csv_data' but not in 'data'?


In [36]:
list(set(csv_data.columns.tolist()) - set(data.columns.tolist()))


Out[36]:
['infraspecificepithet',
 'eventdate',
 'taxonrank',
 'depthaccuracy',
 'datasetkey',
 'countrycode',
 'elevationaccuracy',
 'specieskey',
 'scientificname',
 'decimallongitude',
 'rightsholder',
 'gbifid',
 'establishmentmeans',
 'recordedby',
 'occurrenceid',
 'basisofrecord',
 'lastinterpreted',
 'taxonkey',
 'recordnumber',
 'catalognumber',
 'identifiedby',
 'collectioncode',
 'institutioncode',
 'issue',
 'typestatus',
 'publishingorgkey',
 'mediatype',
 'decimallatitude']

In [ ]: