Data Science

APIs, JSON, NoSQL Databases

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time:

  • #### Amazon Web Services
  • #### StarCluster
  • #### IPython.parallel

Questions?

Agenda

  1. APIs & JSON Lecture
  2. NoSQL Databases
  3. Lab: Twitter & MongoDB

A. APIs & JSON

Speaking broadly:

An application programming interface (API) specifies how some software components should interact with each other.

More specifically:

A web API is a programmatic interface to a defined request-response message system, typically expressed in JSON or XML, which is exposed via the web—most commonly by means of an HTTP-based web server.

from Wikipedia

Web APIs allow people to interact with the structures of an application to:

  • get
  • put
  • delete
  • update data

Best practices for web APIs are to use RESTful principles.

REST = REpresentational State Transfer

REST vs. SQL

GET ( ~ SELECT)
POST ( ~ UPDATE)
PUT ( ~ INSERT)
DELETE ( ~ DELETE)

RESTful web API HTTP methods

ResourceGETPUTPOSTDELETE

Collection URI, such as http://example.com/resourcesList the URIs and perhaps other details of the collection's members.Replace the entire collection with another collection.Create a new entry in the collection. The new entry's URI is assigned automatically and is usually returned by the operation.Delete the entire collection.

Element URI, such as http://example.com/resources/item17Retrieve a representation of the addressed member of the collection, expressed in an appropriate Internet media type.Replace the addressed member of the collection, or if it doesn't exist, create it.Not generally used. Treat the addressed member as a collection in its own right and create a new entry in it.Delete the addressed member of the collection. </TABLE>

From http://en.wikipedia.org/wiki/Representational_state_transfer

HTTP requests can be handled easily using Python's requests library.

First we will load our credentials which we keep in a YAML file for safe keeping.


In [1]:
import yaml
credentials = yaml.load(open('/Users/alessandro.gagliardi/api_cred.yml'))

Then we pass those credentials in to a GET request using the requests library. In this case, I am querying my own user data from Github:


In [3]:
import requests
r = requests.get('https://api.github.com/user', 
                 auth=(credentials['USER'], credentials['PASS']))

Requests gives us an object from which we can read its content.


In [4]:
r.content


Out[4]:
'{"login":"eklypse","id":896607,"avatar_url":"https://gravatar.com/avatar/42c577edc388cc9d1050927da89d47cc?d=https%3A%2F%2Fidenticons.github.com%2F8ecc0615aa5020371b8eb9e4538960e6.png&r=x","gravatar_id":"42c577edc388cc9d1050927da89d47cc","url":"https://api.github.com/users/eklypse","html_url":"https://github.com/eklypse","followers_url":"https://api.github.com/users/eklypse/followers","following_url":"https://api.github.com/users/eklypse/following{/other_user}","gists_url":"https://api.github.com/users/eklypse/gists{/gist_id}","starred_url":"https://api.github.com/users/eklypse/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/eklypse/subscriptions","organizations_url":"https://api.github.com/users/eklypse/orgs","repos_url":"https://api.github.com/users/eklypse/repos","events_url":"https://api.github.com/users/eklypse/events{/privacy}","received_events_url":"https://api.github.com/users/eklypse/received_events","type":"User","site_admin":false,"name":"Alessandro D. Gagliardi","company":"Glassdoor.com","blog":"twitter.com/MadDataScience","location":"San Francisco","email":null,"hireable":false,"bio":null,"public_repos":6,"public_gists":1,"followers":3,"following":21,"created_at":"2011-07-05T20:17:04Z","updated_at":"2014-03-05T23:50:05Z","private_gists":1,"total_private_repos":2,"owned_private_repos":0,"disk_usage":50178,"collaborators":0,"plan":{"name":"free","space":307200,"collaborators":0,"private_repos":0}}'

One of the reasons we like JSON is that it is easy to transform into a Python dict object using the json library:


In [5]:
import json
user = json.loads(r.content)
user


Out[5]:
{u'avatar_url': u'https://gravatar.com/avatar/42c577edc388cc9d1050927da89d47cc?d=https%3A%2F%2Fidenticons.github.com%2F8ecc0615aa5020371b8eb9e4538960e6.png&r=x',
 u'bio': None,
 u'blog': u'twitter.com/MadDataScience',
 u'collaborators': 0,
 u'company': u'Glassdoor.com',
 u'created_at': u'2011-07-05T20:17:04Z',
 u'disk_usage': 50178,
 u'email': None,
 u'events_url': u'https://api.github.com/users/eklypse/events{/privacy}',
 u'followers': 3,
 u'followers_url': u'https://api.github.com/users/eklypse/followers',
 u'following': 21,
 u'following_url': u'https://api.github.com/users/eklypse/following{/other_user}',
 u'gists_url': u'https://api.github.com/users/eklypse/gists{/gist_id}',
 u'gravatar_id': u'42c577edc388cc9d1050927da89d47cc',
 u'hireable': False,
 u'html_url': u'https://github.com/eklypse',
 u'id': 896607,
 u'location': u'San Francisco',
 u'login': u'eklypse',
 u'name': u'Alessandro D. Gagliardi',
 u'organizations_url': u'https://api.github.com/users/eklypse/orgs',
 u'owned_private_repos': 0,
 u'plan': {u'collaborators': 0,
  u'name': u'free',
  u'private_repos': 0,
  u'space': 307200},
 u'private_gists': 1,
 u'public_gists': 1,
 u'public_repos': 6,
 u'received_events_url': u'https://api.github.com/users/eklypse/received_events',
 u'repos_url': u'https://api.github.com/users/eklypse/repos',
 u'site_admin': False,
 u'starred_url': u'https://api.github.com/users/eklypse/starred{/owner}{/repo}',
 u'subscriptions_url': u'https://api.github.com/users/eklypse/subscriptions',
 u'total_private_repos': 2,
 u'type': u'User',
 u'updated_at': u'2014-03-05T23:50:05Z',
 u'url': u'https://api.github.com/users/eklypse'}

In [6]:
print user.keys()


[u'disk_usage', u'private_gists', u'public_repos', u'site_admin', u'subscriptions_url', u'gravatar_id', u'hireable', u'id', u'followers_url', u'following_url', u'collaborators', u'total_private_repos', u'blog', u'followers', u'location', u'type', u'email', u'bio', u'gists_url', u'owned_private_repos', u'company', u'events_url', u'html_url', u'updated_at', u'plan', u'received_events_url', u'starred_url', u'public_gists', u'name', u'organizations_url', u'url', u'created_at', u'avatar_url', u'repos_url', u'following', u'login']

We can access values in this dict directly (such as my hireable status) and even render the url of my avatar:


In [7]:
from IPython.display import HTML
print "Hireable: {}".format(user.get('hireable'))
HTML('<img src={} />'.format(user.get('avatar_url')))


Hireable: False
Out[7]:

Twitter API

Twitter has no less than 10 python libraries. We'll be using Python Twitter Tools because it's what's used in Mining the Social Web.

Some services (like Twitter) have released Python libraries of their own to make using their API even easier.


In [9]:
import twitter

auth = twitter.oauth.OAuth(credentials['OAUTH_TOKEN'], 
                           credentials['OAUTH_TOKEN_SECRET'],
                           credentials['CONSUMER_KEY'],
                           credentials['CONSUMER_SECRET'])

twitter_api = twitter.Twitter(auth=auth)

print twitter_api


<twitter.api.Twitter object at 0x1061fd590>

Using a library like this, we don't even need to specify the URL (that's handled internally).

Using a library like this, it's easy to do something like search for tweets mentioning #bigdata

The results are transformed into a Python object (which in this case is a thin wrapper around a dict)


In [10]:
bigdata = twitter_api.search.tweets(q='#bigdata', count=5)
type(bigdata)


Out[10]:
twitter.api.WrappedTwitterResponse

In [11]:
for status in bigdata['statuses']:
    print status.get('text')


Meet @BrotherOffice #OmniJoin at #CloudExpo NY! #Cloud #Storage #BigData #IoT #API #CloudComputing #InternetOfThings http://t.co/ZC4k0V88bl
RT @chrswng: watch @anildash @katecrawford @alondra @stevenhodas @shaminasingh on #bigdata at @nyulaw @data_society @WhiteHouse http://t.co…
RT @katecrawford: Ten points to Alessandro Acquisti for starting his talk with this: http://t.co/ngEMOu4rrt #BigData #Measure
Gamification comes to clinicians http://t.co/KsPk9mEiBk #gaming #bigdata #analytics
RT @CraigMilroy: #RBS chief #analytics officer calls for ‘UK financial services #cloud’ - http://t.co/yo6khUrBmq #bigdata

NoSQL

NoSQL databases are a new trend in databases

The name NoSQL refers to the lack of a relational structure between stored objects. Data are semi-structured.

Most importantly they attempt to minimize the need for JOIN operations, or solve other data needs

This is good for engineers but bad for data scientists.

Still, NoSQL databases have their uses.

What makes a NoSQL database?

  • Doesn't use SQL as query language
    • usually more primitive query langauge
    • sometimes key/value only
  • BASE rather than ACID
    • that is, sacrifices consistency for availability
  • Schemaless
    • that is, data need not conform to a predefined schema (i.e. semi-structured)

BASE vs ACID

  • ACID
    • Atomicity
    • Consistency
    • Isolation
    • Durability
  • BASE
    • Basically Available
    • Soft-state
    • Eventual consistency

CAP

  • Consistency
    • all nodes always give the same answer
  • Availability
    • nodes always answer queries and accept updates
  • Partition-tolerance
    • system continues working even if one or more nodes go down

CAP Theorem: Pick two

Eventual consistency

  • A key property of non-ACID systems
  • Means
    • if no further changes made,
    • eventually all nodes will be consistent
  • In itself eventual consistency is a very weak guarantee
    • when is "eventually"?
    • in practice it means the system can be inconsetent at any time
  • Stronger guarantees are sometimes made
    • with prediction and measuring, actual behavior can be quantified
    • in practice, systems often appear strongly consistent

NoSQL Examples

  • Memcached
  • Apache HBase
  • Cassandra
  • MongoDB
  • CouchDB
  • DynamoDB

Memcached was:

  • Developed by LiveJournal
  • Distributed key-value store (like a Python Dict)
  • Supports two very fast operations: get and set

Memcached is best used for storing application configuration settings, and essential ••caching•• those settings.

Cassandra was:

  • Developed by Facebook
  • Messages application and Inbox Search
  • Key-Value (ish)
    • Supports query by key or key range
  • Very fast writing speeds
  • Useful for record keeping, logging

Mongo was:

  • Developed by 10Gen (now MongoDB, Inc)
  • Document and Collection Structure
  • BSON (JSON-like) Storage system
  • Aggregation Framework

Mongo

What is MongoDB?

MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.

Document Database

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

A MongoDB document.

The advantages of using documents are:

  • Documents (i.e. objects) correspond to native data types in many programming language.
  • Embedded documents and arrays reduce need for expensive joins.
  • Dynamic schema supports fluent polymorphism.

Notice how similar this looks to a Python dictionary.

Let's get started:


In [ ]:
%%bash
mkdir -p data/db
mongod --dbpath data/db

Let's connect to our Mongo database management server (DBMS):


In [19]:
from pymongo import MongoClient
c = MongoClient()

Let's create a database called 'twitter':


In [20]:
db = c.twitter
Yes, it's that easy.

Now let's insert the results we pulled from Twitter and store them as a document in a collection named tweets:


In [21]:
collection = db.tweets

In [24]:
bigdata = twitter_api.search.tweets(q='#bigdata', count=10)
collection.insert(bigdata.get('statuses'))


Out[24]:
[ObjectId('532776be04295427bbc049c8'),
 ObjectId('532776be04295427bbc049c9'),
 ObjectId('532776be04295427bbc049ca'),
 ObjectId('532776be04295427bbc049cb'),
 ObjectId('532776be04295427bbc049cc'),
 ObjectId('532776be04295427bbc049cd'),
 ObjectId('532776be04295427bbc049ce'),
 ObjectId('532776be04295427bbc049cf'),
 ObjectId('532776be04295427bbc049d0'),
 ObjectId('532776be04295427bbc049d1')]

Notice that MongoDB returns with something called an ObjectId for each document we insert.

ObjectId is a 12-byte BSON type, constructed using:

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 3-byte machine identifier,
  • a 2-byte process id, and
  • a 3-byte counter, starting with a random value.

In MongoDB, documents stored in a collection require a unique _id field that acts as a primary key. Because ObjectIds are small, most likely unique, and fast to generate, MongoDB uses ObjectIds as the default value for the _id field if the _id field is not specified.

Now let's test the contents of our database:


In [25]:
c.database_names()


Out[25]:
[u'local', u'test_database', u'twitter']

In [26]:
c.twitter.collection_names()


Out[26]:
[u'system.indexes', u'tweets']

In [27]:
c.twitter.tweets.find_one()


Out[27]:
{u'_id': ObjectId('52e5ee2062964a1140834b6e'),
 u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Sun Jan 26 15:11:20 +0000 2014',
 u'entities': {u'hashtags': [{u'indices': [21, 51],
    u'text': u'MentionSomeoneImportantForYou'}],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': [{u'id': 357831966,
    u'id_str': u'357831966',
    u'indices': [3, 19],
    u'name': u'Shanelly Mancera\u221e',
    u'screen_name': u'ShanellyMancera'}]},
 u'favorite_count': 0,
 u'favorited': False,
 u'geo': None,
 u'id': 427458742427193345L,
 u'id_str': u'427458742427193345',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'de',
 u'metadata': {u'iso_language_code': u'de', u'result_type': u'recent'},
 u'place': None,
 u'retweet_count': 2,
 u'retweeted': False,
 u'retweeted_status': {u'contributors': None,
  u'coordinates': None,
  u'created_at': u'Thu Mar 28 01:40:23 +0000 2013',
  u'entities': {u'hashtags': [{u'indices': [0, 30],
     u'text': u'MentionSomeoneImportantForYou'}],
   u'symbols': [],
   u'urls': [],
   u'user_mentions': []},
  u'favorite_count': 1,
  u'favorited': False,
  u'geo': None,
  u'id': 317088747084726272L,
  u'id_str': u'317088747084726272',
  u'in_reply_to_screen_name': None,
  u'in_reply_to_status_id': None,
  u'in_reply_to_status_id_str': None,
  u'in_reply_to_user_id': None,
  u'in_reply_to_user_id_str': None,
  u'lang': u'de',
  u'metadata': {u'iso_language_code': u'de', u'result_type': u'recent'},
  u'place': None,
  u'retweet_count': 2,
  u'retweeted': False,
  u'source': u'<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry\xae</a>',
  u'text': u'#MentionSomeoneImportantForYou. @MaiiraAgamez*.*',
  u'truncated': False,
  u'user': {u'contributors_enabled': False,
   u'created_at': u'Thu Aug 18 23:54:04 +0000 2011',
   u'default_profile': False,
   u'default_profile_image': False,
   u'description': u'I find your lips so kissable\u2661.',
   u'entities': {u'description': {u'urls': []}},
   u'favourites_count': 123,
   u'follow_request_sent': False,
   u'followers_count': 107,
   u'following': False,
   u'friends_count': 66,
   u'geo_enabled': False,
   u'id': 357831966,
   u'id_str': u'357831966',
   u'is_translator': False,
   u'lang': u'es',
   u'listed_count': 2,
   u'location': u'Colombia ',
   u'name': u'Shanelly Mancera\u221e',
   u'notifications': False,
   u'profile_background_color': u'FFFFFF',
   u'profile_background_image_url': u'http://a0.twimg.com/profile_background_images/378800000116854775/2a48a1dc6b3d112b1ec0d8d79b524183.png',
   u'profile_background_image_url_https': u'https://si0.twimg.com/profile_background_images/378800000116854775/2a48a1dc6b3d112b1ec0d8d79b524183.png',
   u'profile_background_tile': True,
   u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/357831966/1389731941',
   u'profile_image_url': u'http://pbs.twimg.com/profile_images/413890256362954753/HruWzpvE_normal.jpeg',
   u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/413890256362954753/HruWzpvE_normal.jpeg',
   u'profile_link_color': u'7925F7',
   u'profile_sidebar_border_color': u'FFFFFF',
   u'profile_sidebar_fill_color': u'DDEEF6',
   u'profile_text_color': u'333333',
   u'profile_use_background_image': True,
   u'protected': False,
   u'screen_name': u'ShanellyMancera',
   u'statuses_count': 6765,
   u'time_zone': None,
   u'url': None,
   u'utc_offset': None,
   u'verified': False}},
 u'source': u'web',
 u'text': u'RT @ShanellyMancera: #MentionSomeoneImportantForYou. @MaiiraAgamez*.*',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Sat Jun 18 18:35:09 +0000 2011',
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'29-10-13 MEJOR DIA DE MI VIDA LO VI A EL MEJOR IDOLO',
  u'entities': {u'description': {u'urls': []}},
  u'favourites_count': 342,
  u'follow_request_sent': False,
  u'followers_count': 464,
  u'following': False,
  u'friends_count': 349,
  u'geo_enabled': True,
  u'id': 319778294,
  u'id_str': u'319778294',
  u'is_translator': False,
  u'lang': u'es',
  u'listed_count': 1,
  u'location': u'',
  u'name': u'HERE FOR JUSTIN',
  u'notifications': False,
  u'profile_background_color': u'FF6699',
  u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme11/bg.gif',
  u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme11/bg.gif',
  u'profile_background_tile': True,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/319778294/1386618466',
  u'profile_image_url': u'http://pbs.twimg.com/profile_images/427388457049407489/NyDqsaxn_normal.jpeg',
  u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/427388457049407489/NyDqsaxn_normal.jpeg',
  u'profile_link_color': u'B40B43',
  u'profile_sidebar_border_color': u'CC3366',
  u'profile_sidebar_fill_color': u'E5507E',
  u'profile_text_color': u'362720',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'Mayra__S',
  u'statuses_count': 22149,
  u'time_zone': u'Central Time (US & Canada)',
  u'url': None,
  u'utc_offset': -21600,
  u'verified': False}}

Notice the _id included in the document along with the values we already saw before.

Now that we have our data in MongoDB, we can use some of it's search functionality. For example:


In [59]:
popular_tweets = collection.find({'retweet_count': {"$gte": 3}})
popular_tweets.count()


Out[59]:
6

Using ObjectIds for the _id field provides the following additional benefits:

  • you can access the insertion time of the ObjectId, using the .generation_time property in pymongo.
  • sorting on an _id field that stores ObjectId values is roughly equivalent to sorting by insertion time.

Lab

Basic:

  1. Twitter
  2. Mongo
  3. Twitter + Mongo
    • insert search_results['statuses'] from Exercise 5 into a new Mongo database named tweets
    • find_one where retweet_count is greater than or equal to "$gte": 3

Advanced:

  1. Finish: http://tinyurl.com/GAtwitterlab
  2. Use the Aggregation Framework in Mongo: http://docs.mongodb.org/manual/aggregation/

Next week:

Monday: Structured Data

  • Relational Theory and SQL
  • Data Frames

Wednesday: Exploring Data

  • Visualization and Summary Statistics
  • R & ggplot2