Graph of iDigBio Specimens over Time

This notebook introduces the basics of loading and analyzing iDigBio data on the GUODA infrastructure hosted by the ACIS Lab and iDigBio. This service is documented in the GUODA Jupyter service wiki on Github.

As an example, we will create a graph that shows how many specimens in iDigBio were collected during each the past 200 years. The data for this graph is a random sample of 100,000 records from an export of the iDigBio. The field used to determine the year is the interpreted datecollected field that iDigBio populates based on Darwin Core terms like dwc:year and dwc:eventDate.

If you are interested in the capabilities of GUODA, you may want to scroll to the end of this notebook to view the final graph. If you are interested in doing this work yourself, please keep reading.

Set up

In this document, narrative describing what the code is intended to do and observations about the results is written in Markdown cells. Markdown is a simple wiki-style language that can be used for formating text. Comments about the actual code are written inside the code cells themselves and prefixed with "#" so the are not run.

The code for this document is written in Python and uses the Apache Spark data analytics framework. The code written in this Jupyter notebook is actually run on servers located at the ACIS lab. All of the needed libraries and Spark configuration are already done and there is nothing to install.

It is customary to import libraries used and set configuration options at the top of scripts. In the next cell, we import and set up the Python packages needed by this notebook.


In [1]:
# The Python Spark (pyspark) libraries include functions designed to be run on columns of data
# stored in Spark data frames. They need to be imported in order to use them. Here we
# are going to use 
from pyspark.sql.functions import year

# The matplotlib package is used for graphing. The next line tells Jupyter that when a
# graphing function is used, it should draw the graph here inline in the notebook.
import matplotlib.pyplot as plt
%matplotlib inline

Loading the data set

Data in GUODA is stored on a clustered file system called HDFS. The Jupyter notebooks are all configured to read and write to HDFS automatically so all file paths are in the HDFS system.

You can read more about working with files and how to see what data sets are availible on the Jupyter service wiki.

This line will load the contents of the file that contains a 100,000 record sub-set of iDigBio into a Spark data frame. Then we can look at how many records are in the data frame to confirm that we are working with the 100k subset.


In [2]:
df = sqlContext.read.load("/guoda/data/idigbio-20190612T171757.parquet")

In [3]:
df.count()


Out[3]:
104661524

Examining the data

Now that the data is in memory, let's look at some of the methods availible to examine it before we move on to summarizing it. This will let you see how data is represented both in Spark and Python as well as what kind of data is availible in the iDigBio data frames.

Data frame structure

First we can look at the columns in the data frame. This is all of iDigBio so there are a lot of them.

Also printed by Python is the data type for each column and if a column contains a nested structure (like the "data" structure which has the raw data originally sent to iDigBio) then it is indented.


In [4]:
df.printSchema()


root
 |-- barcodevalue: string (nullable = true)
 |-- basisofrecord: string (nullable = true)
 |-- bed: string (nullable = true)
 |-- canonicalname: string (nullable = true)
 |-- catalognumber: string (nullable = true)
 |-- class: string (nullable = true)
 |-- collectioncode: string (nullable = true)
 |-- collectionid: string (nullable = true)
 |-- collectionname: string (nullable = true)
 |-- collector: string (nullable = true)
 |-- commonname: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- coordinateuncertainty: float (nullable = true)
 |-- country: string (nullable = true)
 |-- countrycode: string (nullable = true)
 |-- county: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- coreid: string (nullable = true)
 |    |-- dc:rights: string (nullable = true)
 |    |-- dcterms:accessRights: string (nullable = true)
 |    |-- dcterms:bibliographicCitation: string (nullable = true)
 |    |-- dcterms:language: string (nullable = true)
 |    |-- dcterms:license: string (nullable = true)
 |    |-- dcterms:modified: string (nullable = true)
 |    |-- dcterms:references: string (nullable = true)
 |    |-- dcterms:rights: string (nullable = true)
 |    |-- dcterms:rightsHolder: string (nullable = true)
 |    |-- dcterms:source: string (nullable = true)
 |    |-- dcterms:type: string (nullable = true)
 |    |-- dwc:VerbatimEventDate: string (nullable = true)
 |    |-- dwc:acceptedNameUsage: string (nullable = true)
 |    |-- dwc:accessRights: string (nullable = true)
 |    |-- dwc:associatedMedia: string (nullable = true)
 |    |-- dwc:associatedOccurrences: string (nullable = true)
 |    |-- dwc:associatedReferences: string (nullable = true)
 |    |-- dwc:associatedSequences: string (nullable = true)
 |    |-- dwc:associatedTaxa: string (nullable = true)
 |    |-- dwc:basisOfRecord: string (nullable = true)
 |    |-- dwc:bed: string (nullable = true)
 |    |-- dwc:behavior: string (nullable = true)
 |    |-- dwc:catalogNumber: string (nullable = true)
 |    |-- dwc:class: string (nullable = true)
 |    |-- dwc:collectionCode: string (nullable = true)
 |    |-- dwc:collectionID: string (nullable = true)
 |    |-- dwc:continent: string (nullable = true)
 |    |-- dwc:coordinatePrecision: string (nullable = true)
 |    |-- dwc:coordinateUncertaintyInMeters: string (nullable = true)
 |    |-- dwc:country: string (nullable = true)
 |    |-- dwc:countryCode: string (nullable = true)
 |    |-- dwc:county: string (nullable = true)
 |    |-- dwc:dataGeneralizations: string (nullable = true)
 |    |-- dwc:datasetID: string (nullable = true)
 |    |-- dwc:datasetName: string (nullable = true)
 |    |-- dwc:dateIdentified: string (nullable = true)
 |    |-- dwc:day: string (nullable = true)
 |    |-- dwc:decimalLatitude: string (nullable = true)
 |    |-- dwc:decimalLongitude: string (nullable = true)
 |    |-- dwc:disposition: string (nullable = true)
 |    |-- dwc:dynamicProperties: string (nullable = true)
 |    |-- dwc:earliestAgeOrLowestStage: string (nullable = true)
 |    |-- dwc:earliestEonOrLowestEonothem: string (nullable = true)
 |    |-- dwc:earliestEpochOrLowestSeries: string (nullable = true)
 |    |-- dwc:earliestEraOrLowestErathem: string (nullable = true)
 |    |-- dwc:earliestPeriodOrLowestSystem: string (nullable = true)
 |    |-- dwc:endDayOfYear: string (nullable = true)
 |    |-- dwc:establishmentMeans: string (nullable = true)
 |    |-- dwc:eventDate: string (nullable = true)
 |    |-- dwc:eventID: string (nullable = true)
 |    |-- dwc:eventRemarks: string (nullable = true)
 |    |-- dwc:eventTime: string (nullable = true)
 |    |-- dwc:family: string (nullable = true)
 |    |-- dwc:fieldNotes: string (nullable = true)
 |    |-- dwc:fieldNumber: string (nullable = true)
 |    |-- dwc:footprintSRS: string (nullable = true)
 |    |-- dwc:footprintSpatialFit: string (nullable = true)
 |    |-- dwc:footprintWKT: string (nullable = true)
 |    |-- dwc:formation: string (nullable = true)
 |    |-- dwc:genus: string (nullable = true)
 |    |-- dwc:geodeticDatum: string (nullable = true)
 |    |-- dwc:geologicalContextID: string (nullable = true)
 |    |-- dwc:georeferenceProtocol: string (nullable = true)
 |    |-- dwc:georeferenceRemarks: string (nullable = true)
 |    |-- dwc:georeferenceSources: string (nullable = true)
 |    |-- dwc:georeferenceVerificationStatus: string (nullable = true)
 |    |-- dwc:georeferencedBy: string (nullable = true)
 |    |-- dwc:georeferencedDate: string (nullable = true)
 |    |-- dwc:group: string (nullable = true)
 |    |-- dwc:habitat: string (nullable = true)
 |    |-- dwc:higherClassification: string (nullable = true)
 |    |-- dwc:higherGeography: string (nullable = true)
 |    |-- dwc:higherGeographyID: string (nullable = true)
 |    |-- dwc:highestBiostratigraphicZone: string (nullable = true)
 |    |-- dwc:identificationID: string (nullable = true)
 |    |-- dwc:identificationQualifier: string (nullable = true)
 |    |-- dwc:identificationReferences: string (nullable = true)
 |    |-- dwc:identificationRemarks: string (nullable = true)
 |    |-- dwc:identificationVerificationStatus: string (nullable = true)
 |    |-- dwc:identifiedBy: string (nullable = true)
 |    |-- dwc:individualCount: string (nullable = true)
 |    |-- dwc:informationWithheld: string (nullable = true)
 |    |-- dwc:infraspecificEpithet: string (nullable = true)
 |    |-- dwc:institutionCode: string (nullable = true)
 |    |-- dwc:institutionID: string (nullable = true)
 |    |-- dwc:island: string (nullable = true)
 |    |-- dwc:islandGroup: string (nullable = true)
 |    |-- dwc:kingdom: string (nullable = true)
 |    |-- dwc:latestAgeOrHighestStage: string (nullable = true)
 |    |-- dwc:latestEonOrHighestEonothem: string (nullable = true)
 |    |-- dwc:latestEpochOrHighestSeries: string (nullable = true)
 |    |-- dwc:latestEraOrHighestErathem: string (nullable = true)
 |    |-- dwc:latestPeriodOrHighestSystem: string (nullable = true)
 |    |-- dwc:lifeStage: string (nullable = true)
 |    |-- dwc:lithostratigraphicTerms: string (nullable = true)
 |    |-- dwc:locality: string (nullable = true)
 |    |-- dwc:locationAccordingTo: string (nullable = true)
 |    |-- dwc:locationID: string (nullable = true)
 |    |-- dwc:locationRemarks: string (nullable = true)
 |    |-- dwc:lowestBiostratigraphicZone: string (nullable = true)
 |    |-- dwc:materialSampleID: string (nullable = true)
 |    |-- dwc:maximumDepthInMeters: string (nullable = true)
 |    |-- dwc:maximumElevationInMeters: string (nullable = true)
 |    |-- dwc:member: string (nullable = true)
 |    |-- dwc:minimumDepthInMeters: string (nullable = true)
 |    |-- dwc:minimumElevationInMeters: string (nullable = true)
 |    |-- dwc:month: string (nullable = true)
 |    |-- dwc:municipality: string (nullable = true)
 |    |-- dwc:nameAccordingTo: string (nullable = true)
 |    |-- dwc:nameAccordingToID: string (nullable = true)
 |    |-- dwc:namePublishedIn: string (nullable = true)
 |    |-- dwc:namePublishedInID: string (nullable = true)
 |    |-- dwc:namePublishedInYear: string (nullable = true)
 |    |-- dwc:nomenclaturalCode: string (nullable = true)
 |    |-- dwc:nomenclaturalStatus: string (nullable = true)
 |    |-- dwc:occurrenceDetails: string (nullable = true)
 |    |-- dwc:occurrenceID: string (nullable = true)
 |    |-- dwc:occurrenceRemarks: string (nullable = true)
 |    |-- dwc:occurrenceStatus: string (nullable = true)
 |    |-- dwc:order: string (nullable = true)
 |    |-- dwc:organismID: string (nullable = true)
 |    |-- dwc:organismName: string (nullable = true)
 |    |-- dwc:organismQuantity: string (nullable = true)
 |    |-- dwc:organismQuantityType: string (nullable = true)
 |    |-- dwc:originalNameUsage: string (nullable = true)
 |    |-- dwc:otherCatalogNumbers: string (nullable = true)
 |    |-- dwc:ownerInstitutionCode: string (nullable = true)
 |    |-- dwc:parentNameUsage: string (nullable = true)
 |    |-- dwc:phylum: string (nullable = true)
 |    |-- dwc:pointRadiusSpatialFit: string (nullable = true)
 |    |-- dwc:preparations: string (nullable = true)
 |    |-- dwc:previousIdentifications: string (nullable = true)
 |    |-- dwc:recordNumber: string (nullable = true)
 |    |-- dwc:recordedBy: string (nullable = true)
 |    |-- dwc:reproductiveCondition: string (nullable = true)
 |    |-- dwc:rights: string (nullable = true)
 |    |-- dwc:rightsHolder: string (nullable = true)
 |    |-- dwc:samplingEffort: string (nullable = true)
 |    |-- dwc:samplingProtocol: string (nullable = true)
 |    |-- dwc:scientificName: string (nullable = true)
 |    |-- dwc:scientificNameAuthorship: string (nullable = true)
 |    |-- dwc:scientificNameID: string (nullable = true)
 |    |-- dwc:sex: string (nullable = true)
 |    |-- dwc:specificEpithet: string (nullable = true)
 |    |-- dwc:startDayOfYear: string (nullable = true)
 |    |-- dwc:stateProvince: string (nullable = true)
 |    |-- dwc:subgenus: string (nullable = true)
 |    |-- dwc:taxonID: string (nullable = true)
 |    |-- dwc:taxonRank: string (nullable = true)
 |    |-- dwc:taxonRemarks: string (nullable = true)
 |    |-- dwc:taxonomicStatus: string (nullable = true)
 |    |-- dwc:typeStatus: string (nullable = true)
 |    |-- dwc:verbatimCoordinateSystem: string (nullable = true)
 |    |-- dwc:verbatimCoordinates: string (nullable = true)
 |    |-- dwc:verbatimDepth: string (nullable = true)
 |    |-- dwc:verbatimElevation: string (nullable = true)
 |    |-- dwc:verbatimEventDate: string (nullable = true)
 |    |-- dwc:verbatimLatitude: string (nullable = true)
 |    |-- dwc:verbatimLocality: string (nullable = true)
 |    |-- dwc:verbatimLongitude: string (nullable = true)
 |    |-- dwc:verbatimSRS: string (nullable = true)
 |    |-- dwc:verbatimTaxonRank: string (nullable = true)
 |    |-- dwc:vernacularName: string (nullable = true)
 |    |-- dwc:waterBody: string (nullable = true)
 |    |-- dwc:year: string (nullable = true)
 |    |-- fcc:datePicked: string (nullable = true)
 |    |-- fcc:pickedBy: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- idigbio:preservative: string (nullable = true)
 |    |-- idigbio:recordId: string (nullable = true)
 |    |-- idigbio:subfamily: string (nullable = true)
 |    |-- idigbio:substrate: string (nullable = true)
 |    |-- idigbio:superfamily: string (nullable = true)
 |    |-- symbiota:recordEnteredBy: string (nullable = true)
 |    |-- symbiota:verbatimScientificName: string (nullable = true)
 |-- datecollected: timestamp (nullable = true)
 |-- datemodified: timestamp (nullable = true)
 |-- dqs: float (nullable = true)
 |-- earliestageorloweststage: string (nullable = true)
 |-- earliesteonorlowesteonothem: string (nullable = true)
 |-- earliestepochorlowestseries: string (nullable = true)
 |-- earliesteraorlowesterathem: string (nullable = true)
 |-- earliestperiodorlowestsystem: string (nullable = true)
 |-- etag: string (nullable = true)
 |-- eventdate: string (nullable = true)
 |-- family: string (nullable = true)
 |-- fieldnumber: string (nullable = true)
 |-- formation: string (nullable = true)
 |-- genus: string (nullable = true)
 |-- geologicalcontextid: string (nullable = true)
 |-- geopoint: struct (nullable = true)
 |    |-- lat: double (nullable = true)
 |    |-- lon: double (nullable = true)
 |-- group: string (nullable = true)
 |-- hasImage: boolean (nullable = true)
 |-- hasMedia: boolean (nullable = true)
 |-- highertaxon: string (nullable = true)
 |-- highestbiostratigraphiczone: string (nullable = true)
 |-- individualcount: float (nullable = true)
 |-- infraspecificepithet: string (nullable = true)
 |-- institutioncode: string (nullable = true)
 |-- institutionid: string (nullable = true)
 |-- institutionname: string (nullable = true)
 |-- kingdom: string (nullable = true)
 |-- latestageorhigheststage: string (nullable = true)
 |-- latesteonorhighesteonothem: string (nullable = true)
 |-- latestepochorhighestseries: string (nullable = true)
 |-- latesteraorhighesterathem: string (nullable = true)
 |-- latestperiodorhighestsystem: string (nullable = true)
 |-- lithostratigraphicterms: string (nullable = true)
 |-- locality: string (nullable = true)
 |-- lowestbiostratigraphiczone: string (nullable = true)
 |-- maxdepth: float (nullable = true)
 |-- maxelevation: float (nullable = true)
 |-- member: string (nullable = true)
 |-- mindepth: float (nullable = true)
 |-- minelevation: float (nullable = true)
 |-- municipality: string (nullable = true)
 |-- occurrenceid: string (nullable = true)
 |-- order: string (nullable = true)
 |-- phylum: string (nullable = true)
 |-- recordnumber: string (nullable = true)
 |-- recordset: string (nullable = true)
 |-- scientificname: string (nullable = true)
 |-- specificepithet: string (nullable = true)
 |-- startdayofyear: integer (nullable = true)
 |-- stateprovince: string (nullable = true)
 |-- taxonid: string (nullable = true)
 |-- taxonomicstatus: string (nullable = true)
 |-- taxonrank: string (nullable = true)
 |-- typestatus: string (nullable = true)
 |-- uuid: string (nullable = true)
 |-- verbatimeventdate: string (nullable = true)
 |-- verbatimlocality: string (nullable = true)
 |-- version: integer (nullable = true)
 |-- waterbody: string (nullable = true)

Next we can look at the first row of data. The (1) after head tells Python how many rows to print. Since this is all iDigBio data, the rows are pretty big so we'll only show one.


In [5]:
df.head(1)


Out[5]:
[Row(barcodevalue=None, basisofrecord='preservedspecimen', bed=None, canonicalname='bombus ashtoni', catalognumber='cuic_ent 00035073', class='insecta', collectioncode=None, collectionid=None, collectionname=None, collector='b. taraday', commonname=None, continent='north america', coordinateuncertainty=None, country='united states', countrycode='usa', county='tompkins', data=Row(coreid=None, dc:rights=None, dcterms:accessRights=None, dcterms:bibliographicCitation=None, dcterms:language=None, dcterms:license=None, dcterms:modified='2014-03-18', dcterms:references='Digital Bee Collections Network, 2014 (and updates). Version: 2016-03-08. National Science Foundation grant DBI 0956388', dcterms:rights=None, dcterms:rightsHolder=None, dcterms:source=None, dcterms:type=None, dwc:VerbatimEventDate=None, dwc:acceptedNameUsage=None, dwc:accessRights=None, dwc:associatedMedia=None, dwc:associatedOccurrences=None, dwc:associatedReferences=None, dwc:associatedSequences=None, dwc:associatedTaxa=None, dwc:basisOfRecord='PreservedSpecimen', dwc:bed=None, dwc:behavior=None, dwc:catalogNumber='CUIC_ENT 00035073', dwc:class=None, dwc:collectionCode=None, dwc:collectionID=None, dwc:continent=None, dwc:coordinatePrecision=None, dwc:coordinateUncertaintyInMeters=None, dwc:country='UNITED STATES', dwc:countryCode=None, dwc:county='Tompkins', dwc:dataGeneralizations=None, dwc:datasetID='urn:uuid:13674fa4-8611-11e4-8259-0026552be7ea', dwc:datasetName='Collaborative databasing of North American bee collections within a global informatics network project', dwc:dateIdentified=None, dwc:day=None, dwc:decimalLatitude='42.44923', dwc:decimalLongitude='-76.48226', dwc:disposition=None, dwc:dynamicProperties=None, dwc:earliestAgeOrLowestStage=None, dwc:earliestEonOrLowestEonothem=None, dwc:earliestEpochOrLowestSeries=None, dwc:earliestEraOrLowestErathem=None, dwc:earliestPeriodOrLowestSystem=None, dwc:endDayOfYear=None, dwc:establishmentMeans=None, dwc:eventDate='1980-06-20', dwc:eventID=None, dwc:eventRemarks=None, dwc:eventTime=None, dwc:family='Apidae', dwc:fieldNotes=None, dwc:fieldNumber=None, dwc:footprintSRS=None, dwc:footprintSpatialFit=None, dwc:footprintWKT=None, dwc:formation=None, dwc:genus='Bombus', dwc:geodeticDatum=None, dwc:geologicalContextID=None, dwc:georeferenceProtocol=None, dwc:georeferenceRemarks=None, dwc:georeferenceSources=None, dwc:georeferenceVerificationStatus=None, dwc:georeferencedBy=None, dwc:georeferencedDate=None, dwc:group=None, dwc:habitat=None, dwc:higherClassification='Animalia;Arthropoda;Apidae;Apinae;Bombini', dwc:higherGeography=None, dwc:higherGeographyID=None, dwc:highestBiostratigraphicZone=None, dwc:identificationID=None, dwc:identificationQualifier=None, dwc:identificationReferences=None, dwc:identificationRemarks=None, dwc:identificationVerificationStatus=None, dwc:identifiedBy='G. C. Eickwort', dwc:individualCount='1', dwc:informationWithheld=None, dwc:infraspecificEpithet=None, dwc:institutionCode='CUIC', dwc:institutionID=None, dwc:island=None, dwc:islandGroup=None, dwc:kingdom=None, dwc:latestAgeOrHighestStage=None, dwc:latestEonOrHighestEonothem=None, dwc:latestEpochOrHighestSeries=None, dwc:latestEraOrHighestErathem=None, dwc:latestPeriodOrHighestSystem=None, dwc:lifeStage='Adult', dwc:lithostratigraphicTerms=None, dwc:locality='Ithaca, Cornell Campus', dwc:locationAccordingTo='Label', dwc:locationID=None, dwc:locationRemarks=None, dwc:lowestBiostratigraphicZone=None, dwc:materialSampleID=None, dwc:maximumDepthInMeters=None, dwc:maximumElevationInMeters=None, dwc:member=None, dwc:minimumDepthInMeters=None, dwc:minimumElevationInMeters=None, dwc:month=None, dwc:municipality=None, dwc:nameAccordingTo=None, dwc:nameAccordingToID=None, dwc:namePublishedIn=None, dwc:namePublishedInID=None, dwc:namePublishedInYear=None, dwc:nomenclaturalCode=None, dwc:nomenclaturalStatus=None, dwc:occurrenceDetails=None, dwc:occurrenceID='urn:uuid:9372ac72-aeab-11e3-8259-0026552be7ea', dwc:occurrenceRemarks=None, dwc:occurrenceStatus=None, dwc:order=None, dwc:organismID=None, dwc:organismName=None, dwc:organismQuantity=None, dwc:organismQuantityType=None, dwc:originalNameUsage=None, dwc:otherCatalogNumbers=None, dwc:ownerInstitutionCode='Cornell University Insect Collection', dwc:parentNameUsage=None, dwc:phylum=None, dwc:pointRadiusSpatialFit=None, dwc:preparations=None, dwc:previousIdentifications=None, dwc:recordNumber=None, dwc:recordedBy='B. Taraday', dwc:reproductiveCondition=None, dwc:rights=None, dwc:rightsHolder=None, dwc:samplingEffort=None, dwc:samplingProtocol='Netting', dwc:scientificName='Bombus ashtoni', dwc:scientificNameAuthorship='(Cresson,1864)', dwc:scientificNameID=None, dwc:sex='Female', dwc:specificEpithet='ashtoni', dwc:startDayOfYear=None, dwc:stateProvince='New York', dwc:subgenus=None, dwc:taxonID=None, dwc:taxonRank=None, dwc:taxonRemarks=None, dwc:taxonomicStatus=None, dwc:typeStatus='None', dwc:verbatimCoordinateSystem=None, dwc:verbatimCoordinates=None, dwc:verbatimDepth=None, dwc:verbatimElevation='265 m', dwc:verbatimEventDate=None, dwc:verbatimLatitude=None, dwc:verbatimLocality=None, dwc:verbatimLongitude=None, dwc:verbatimSRS=None, dwc:verbatimTaxonRank=None, dwc:vernacularName=None, dwc:waterBody=None, dwc:year='1980', fcc:datePicked=None, fcc:pickedBy=None, id='urn:uuid:9372ac72-aeab-11e3-8259-0026552be7ea_RID', idigbio:preservative=None, idigbio:recordId=None, idigbio:subfamily=None, idigbio:substrate=None, idigbio:superfamily=None, symbiota:recordEnteredBy=None, symbiota:verbatimScientificName=None), datecollected=datetime.datetime(1980, 6, 19, 20, 0), datemodified=datetime.datetime(2016, 3, 9, 4, 17, 27, 393000), dqs=0.1617647111415863, earliestageorloweststage=None, earliesteonorlowesteonothem=None, earliestepochorlowestseries=None, earliesteraorlowesterathem=None, earliestperiodorlowestsystem=None, etag='b3985d7148b9f1a5243dc28ade2187c8a6df9c76', eventdate='1980-06-20', family='apidae', fieldnumber=None, formation=None, genus='bombus', geologicalcontextid=None, geopoint=Row(lat=42.44923, lon=-76.48226), group=None, hasImage=False, hasMedia=False, highertaxon='animalia;arthropoda;apidae;apinae;bombini', highestbiostratigraphiczone=None, individualcount=1.0, infraspecificepithet=None, institutioncode='cuic', institutionid=None, institutionname=None, kingdom='animalia', latestageorhigheststage=None, latesteonorhighesteonothem=None, latestepochorhighestseries=None, latesteraorhighesterathem=None, latestperiodorhighestsystem=None, lithostratigraphicterms=None, locality='ithaca, cornell campus', lowestbiostratigraphiczone=None, maxdepth=None, maxelevation=None, member=None, mindepth=None, minelevation=None, municipality=None, occurrenceid='urn:uuid:9372ac72-aeab-11e3-8259-0026552be7ea', order='hymenoptera', phylum='arthropoda', recordnumber=None, recordset='8919571f-205a-4aed-b9f2-96ccd0108e4c', scientificname='bombus ashtoni', specificepithet='ashtoni', startdayofyear=172, stateprovince='new york', taxonid='1340457', taxonomicstatus='accepted', taxonrank='species', typestatus='none', uuid='43c20745-590c-45f5-9e5c-ddc5153aa573', verbatimeventdate=None, verbatimlocality=None, version=None, waterbody=None)]

Summarizing the data

That's certainly more data than we need to make the graph. Since there is one row in the data frame for each specimen record, what we need to do is group the records by the year they were collected and then count the number of records in each group and associate that with the year. The data frame we want to have as a result should have two columns, one for year and one for the count of the records collected in that year.

This is a common chain of operations often refered to as select, group by, and count which comes from the SQL syntax for doing this operation.

Working with the year complicated by the fact that iDigBio has a datecollected field and not a yearcollected field. While we are often provided a year in the raw data, we assemble and convert all the date information from the Darwin Core fields into a date-type object and store that as datecollected. Because this object is a date type we can sort it and search for ranges. (Consider what would happen if we tried that with raw data strings like "2004-01-14" and "March 15, 2015".)

We need to extract the year part of datecollected and we need to convert it to a number so we can sort on it.


In [6]:
# The outer "(" and ")" surround the chain of Python method calls to allow them to 
# span lines. This is a common convention and makes the data processing pipeline
# easy to read and modify.
#
# The persist() function tells Spark to store the data frame in memory so it can be
# accessed repeatedly without having to be reloaded.
year_summary = (df
                .groupBy(year("datecollected").cast("integer").alias("yearcollected"))
                .count()
                .orderBy("yearcollected")
                .persist()
                )

Let's take a look at this new data frame using some of the commands from above:


In [7]:
year_summary.count()


Out[7]:
1390

In [8]:
year_summary.printSchema()


root
 |-- yearcollected: integer (nullable = true)
 |-- count: long (nullable = false)


In [9]:
year_summary.head(10)


Out[9]:
[Row(yearcollected=None, count=27146522),
 Row(yearcollected=100, count=3),
 Row(yearcollected=102, count=1),
 Row(yearcollected=103, count=2),
 Row(yearcollected=104, count=2),
 Row(yearcollected=105, count=5),
 Row(yearcollected=106, count=9),
 Row(yearcollected=107, count=3),
 Row(yearcollected=108, count=3),
 Row(yearcollected=109, count=4)]

Now that our data is both much smaller and mostly numeric, we can use the describe() method to quickly make summary statistics. This method returns a data frame so we have to use show() to actually print the whole contents of the data frame.


In [10]:
year_summary.describe().show()


+-------+----------------+-----------------+
|summary|   yearcollected|            count|
+-------+----------------+-----------------+
|  count|            1389|             1390|
|   mean|2316.51403887689|75296.06043165468|
| stddev| 2122.4294635612|752125.4364303295|
|    min|             100|                1|
|    max|            9999|         27146522|
+-------+----------------+-----------------+

Spark data frames, Pandas data frames, and filtering

The term "data frame" is a concept for how data is arranged. Different programming languages and even libraries in a single programming language have different implimentations of this idea.

We have been working with a Spark data frame. Now we want to do some graphing and the Python graphing libraries know how to work with a Pandas data frame. Fortunately this is such a common conversion that there is a built-in method to do it.

One thing to be aware of is that Pandas data frames are not stored on our computation cluster like the Spark data frames are. This means they need to be small and you should not do too much computation on them. Our year_summary data frame is only 2 columns and about 220 rows this isn't a problem.

While converting to a Pandas data frame, we will also reduce the years to the range 1817 - 2017. From the output of describe() we could see that there were some years that didn't make sense.


In [11]:
pandas_year_summary = (year_summary
                       .filter(year_summary.yearcollected >= 1817)
                       .filter(year_summary.yearcollected <= 2017)
                       .orderBy("yearcollected")
                       .toPandas()
                       )

In [12]:
pandas_year_summary.head()


Out[12]:
yearcollected count
0 1817 3447
1 1818 11556
2 1819 3748
3 1820 4009
4 1821 4754

(Notice that the display of the first rows looks different from when we ran head() on the Spark data frame? That's because we're looking at the display generated by the Pandas library instead of the Spark library.)

Making a graph

The number of specimens collected in a year is discrete data so a bar graph is one appropriate way to display them.


In [13]:
plt.bar(pandas_year_summary["yearcollected"], 
        pandas_year_summary["count"],
        edgecolor='none', width=1.0
       )
plt.title("Specimens in iDigBio by Collection Year and Continent")
plt.ylabel("Number of Specimen Records")
plt.xlabel("Year")


Out[13]:
<matplotlib.text.Text at 0x7f794c076c88>