Hello! My name is Nicole Donnelly.
I am going to present a workshop of data collection, specifically using python to collect data from APIs and to build a basic web scraper.
To that end...
I know a lot of people, especially here, probably love data. But I really love data.
Data is my happy place.
Some people nurture plants, or animals, or children.
I want to nurture data.
I want it to become the best data it can be. I want it to help people. I want it to be happy.
In my past life as a consultant, I spent a lot of time with data. I worked in computer forensics and electronic discovery. I collected data, inventoried it, organized it, gave it context, analyzed it, reconstructed it, reported on it, and put it in useful formats for people to review and use. I was frequently asked to help inventory and organize data for other projects because I am good at it and I enjoy it.
I decided to make a career change to data science. I completed the professional certificate in data science at Georgetown, became the TA, and joined the faculty. I am also on the faculty at District Data Labs. I completed the Data Science Immersive at Genral Assembly. And now I work for the city in the Office of the Chief Technology Officer where I am detailed to the Office of Unified Communication to work on their data analysis and projects.
I am going to assume people can navigate from the command line, have python installed, and have some experience with python.
I am also going to assume people have not tried using APIs.
I created a notebook so we can go through some of the python and you can see what it is doing.
I have also created some scripts you can run when you are ready. Feel free to modify them to suit your needs. Or, if you have some experience with all of this, feel free to play around with those instead of the notebook.
If you are new to all of this, including python, I have included some resources at the end of the slides.
Everywhere.
Copy/ paste is possible because of APIs.
Common API examples:
“RSS is an XML-based vocabulary for distributing Web content in opt-in feeds. Feeds allow the user to have new content delivered to a computer or mobile device as soon as it is published. ”
Source: http://searchwindevelopment.techtarget.com/definition/RSS
RSS stands is defined as an acrony mulitple ways: Rich Site Summary, RDF (Resource Description Framework) Site Summary, Really Simple Syndication. It is an XML based content distribution format commonly used for blog and news data. We can use it to transfer and collect data. Typically, it is used to read blogs and news. RSS retrieval can also be operationalized to create a corpus for analysis.
Side note: if you are looking for an open source project to get involved with and are interested in operationalizing RSS, check out Baleen
Context: I will focus on RSS and REST APIs with JSON.
Read more about API types here
http://dvd.netflix.com/NewReleasesRSS
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel >
<title>New Releases This Week</title>
<ttl>10080</ttl>
<link>http://dvd.netflix.com</link>
<description>New movies at Netflix this week</description>
<language>en-us</language>
<cf:treatAs xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005">list</cf:treatAs>
<atom:link href="http://dvd.netflix.com/NewReleasesRSS" rel="self" type="application/rss+xml"/>
<item>
<title>Bakery in Brooklyn</title>
<link>https://dvd.netflix.com/Movie/Bakery-in-Brooklyn/80152426</link>
<guid isPermaLink="true">https://dvd.netflix.com/Movie/Bakery-in-Brooklyn/80152426</guid>
<description><a href="https://dvd.netflix.com/Movie/Bakery-in-Brooklyn/80152426"><img src="//secure.netflix.com/us/boxshots/small/80152426.jpg"/></a><br>Vivien and Chloe have just inherited their Aunt's bakery, a boulangerie that has been a cornerstone of the neighborhood for years. Chloe wants a new image and product, while Vivien wants to make sure nothing changes. Their clash of ideas leads to a peculiar solution, they split the shop in half. But Vivien and Chloe will have to learn to overcome their differences in order to save the bakery and everything that truly matters in their lives.</description>
</item>
Context: I will focus on RSS and REST APIs with JSON.
Read more about API types here
http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL
{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.12,"Change":-0.349999999999994,"ChangePercent":-0.297948412360598,"Timestamp":"Wed Oct 19 00:00:00 UTC-04:00 2016","MarketCap":631094444160,"Volume":20034594,"ChangeYTD":105.26,"ChangePercentYTD":11.2673380201406,"High":117.76,"Low":113.8,"Open":117.25}}
It is as easy as constructing the correct url!
REST was influenced by HTTP so is almost always implemented that way.
http://api.dp.la/v2/items?api_key=0123456789&q=goats+AND+cats
{"count":29,
"start":0,
"limit":10,
"docs":[{"@context":"http://dp.la/api/items/context","isShownAt":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","dataProvider":"Missouri State Archives through Missouri Digital Heritage","@type":"ore:Aggregation","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"hasView":{"@id":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"},"object":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","ingestionSequence":12,"id":"9e05f398ca95f9bbfd733e6d3493fd74","ingestDate":"2016-10-11T13:21:48.399681Z","_rev":"7-6bee4d18708d1d16efceeea1e061b316","aggregatedCHO":"#sourceResource","_id":"missouri--urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","sourceResource":{"title":["Alabama Big Cats Safari Adventure"],"description":["Children bottle feeding goats"],"subject":[{"name":"Transparencies, Slides"},{"name":"Tourist Destination"}],"rights":["Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives."],"relation":["Division of Tourism Photograph Collection"],"language":[{"iso639_3":"eng","name":"English"}],"format":"Image","collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"Mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"stateLocatedIn":[{"name":"Missouri"}],"@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74#sourceResource","identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"creator":["GD"]},"admin":{"validation_message":null,"sourceResource":{"title":"Alabama Big Cats Safari Adventure"},"valid_after_enrich":true},"ingestType":"item","@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74","originalRecord":{"id":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"header":{"expirationdatetime":"2016-10-08T17:04:17Z","datestamp":"2016-10-04T13:19:05Z","identifier":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","setSpec":"mdh_divtour"},"metadata":{"mods":{"accessCondition":"Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives.","location":{"url":[{"#text":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","access":"object in context"},{"#text":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","access":"preview"}]},"subject":[{"topic":"Transparencies, Slides"},{"topic":"Tourist Destination"}],"name":{"namePart":"GD","role":{"roleTerm":"creator"}},"relatedItem":{"titleInfo":{"title":"Division of Tourism Photograph Collection"}},"physicalDescription":{"note":"Image"},"xmlns":"http://www.loc.gov/mods/v3","language":{"languageTerm":"eng"},"titleInfo":{"title":"Alabama Big Cats Safari Adventure"},"identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"note":["Children bottle feeding goats",{"#text":"Missouri State Archives through Missouri Digital Heritage","type":"ownership"}]}}},"score":4.534843}, ...
"facets":[]}
Sometimes you will need an API key.
http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL
{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.06,"Change":-0.0600000000000023,"ChangePercent":-0.0512295081967233,"Timestamp":"Thu Oct 20 00:00:00 UTC-04:00 2016","MarketCap":630771137580,"Volume":24059570,"ChangeYTD":105.26,"ChangePercentYTD":11.2103363100893,"High":117.38,"Low":116.33,"Open":116.86}}
http://dev.markitondemand.com/Api/Quote/xml?symbol=AAPL
<QuoteApiModel>
<Data>
<Status>SUCCESS</Status>
<Name>Apple Inc</Name>
<Symbol>AAPL</Symbol>
<LastPrice>117.06</LastPrice>
<Change>-0.06</Change>
<ChangePercent>-0.0512295082</ChangePercent>
<Timestamp>Thu Oct 20 00:00:00 UTC-04:00 2016</Timestamp>
<MarketCap>630771137580</MarketCap>
<Volume>24059570</Volume>
<ChangeYTD>105.26</ChangeYTD>
<ChangePercentYTD>11.2103363101</ChangePercentYTD>
<High>117.38</High>
<Low>116.33</Low>
<Open>116.86</Open>
</Data>
</QuoteApiModel>
APIs return serialized data.
JSON stands for "JavaScript Object Notation", and has become a universal standard for serializing native data structures for transmission. It is light-weight, easy to read, and quick to parse. It is easy to use in python with the json library.
XML stands for "eXtensible Markup Language", and is the granddaddy of serialized data formats (itself based on HTML). XML is fat, ugly, and cumbersome to parse. However, it remains a major format due to its legacy usage across the web. Most people favor using a JSON API, if available. There are xml libraries for python as well: lxml, etree, sax, minidom
There are a lot of online resources for using Git and GitHub. Here is a good one to start with An Intro to Git and GitHub for Beginners (Tutorial).
There are a lot of Python references available. I used Learn Python the Hard Way and Automate the Boring Stuff with Python when I was first learning. I like the challenges on HackerRank to practice. The discussion on each challenge is great for seeing how other people approach the problem and to ask questions. The free beginner lesson on DataQuest are also great if you are just starting out or want more practice.
I created this presentation using Jupyter Notebook and reveal.js. It is being hosted as a GitHub Project Page. I found a few resources out there for how to do this such as Presentation slides with Jupyter Notebook and Deploy reveal.js slideshow on github-pages.