Speaking broadly:
An application programming interface (API) specifies how some software components should interact with each other.
More specifically:
A web API is a programmatic interface to a defined request-response message system, typically expressed in JSON or XML, which is exposed via the web—most commonly by means of an HTTP-based web server.
from Wikipedia
Web APIs allow people to interact with the structures of an application to:
Best practices for web APIs are to use RESTful principles.
REST = REpresentational State Transfer
REST vs. SQL
GET ( ~ SELECT)
POST ( ~ UPDATE)
PUT ( ~ INSERT)
DELETE ( ~ DELETE)
requests
library.First we will load our credentials which we keep in a YAML file for safe keeping.
In [1]:
import yaml
credentials = yaml.load(open('/Users/alessandro.gagliardi/api_cred.yml'))
Then we pass those credentials in to a GET request using the requests library. In this case, I am querying my own user data from Github:
In [3]:
import requests
r = requests.get('https://api.github.com/user',
auth=(credentials['USER'], credentials['PASS']))
Requests gives us an object from which we can read its content.
In [4]:
r.content
Out[4]:
In [5]:
import json
user = json.loads(r.content)
user
Out[5]:
In [6]:
print user.keys()
We can access values in this dict directly (such as my hireable status) and even render the url of my avatar:
In [7]:
from IPython.display import HTML
print "Hireable: {}".format(user.get('hireable'))
HTML('<img src={} />'.format(user.get('avatar_url')))
Out[7]:
Twitter has no less than 10 python libraries. We'll be using Python Twitter Tools because it's what's used in Mining the Social Web.
In [9]:
import twitter
auth = twitter.oauth.OAuth(credentials['OAUTH_TOKEN'],
credentials['OAUTH_TOKEN_SECRET'],
credentials['CONSUMER_KEY'],
credentials['CONSUMER_SECRET'])
twitter_api = twitter.Twitter(auth=auth)
print twitter_api
Using a library like this, it's easy to do something like search for tweets mentioning #bigdata
The results are transformed into a Python object (which in this case is a thin wrapper around a dict
)
In [10]:
bigdata = twitter_api.search.tweets(q='#bigdata', count=5)
type(bigdata)
Out[10]:
In [11]:
for status in bigdata['statuses']:
print status.get('text')
NoSQL databases are a new trend in databases
The name NoSQL refers to the lack of a relational structure between stored objects. Data are semi-structured.
Most importantly they attempt to minimize the need for JOIN operations, or solve other data needs
This is good for engineers but bad for data scientists.
Still, NoSQL databases have their uses.
Memcached was:
Memcached is best used for storing application configuration settings, and essential ••caching•• those settings.
Cassandra was:
Mongo was:
A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.
A MongoDB document.
The advantages of using documents are:
Notice how similar this looks to a Python dictionary.
In [ ]:
%%bash
mkdir -p data/db
mongod --dbpath data/db
In [19]:
from pymongo import MongoClient
c = MongoClient()
In [20]:
db = c.twitter
In [21]:
collection = db.tweets
In [24]:
bigdata = twitter_api.search.tweets(q='#bigdata', count=10)
collection.insert(bigdata.get('statuses'))
Out[24]:
Notice that MongoDB returns with something called an ObjectId
for each document we insert.
ObjectId is a 12-byte BSON type, constructed using:
In MongoDB, documents stored in a collection require a unique _id
field that acts as a primary key. Because ObjectIds are small, most likely unique, and fast to generate, MongoDB uses ObjectIds as the default value for the _id
field if the _id
field is not specified.
In [25]:
c.database_names()
Out[25]:
In [26]:
c.twitter.collection_names()
Out[26]:
In [27]:
c.twitter.tweets.find_one()
Out[27]:
Notice the _id
included in the document along with the values we already saw before.
Now that we have our data in MongoDB, we can use some of it's search functionality. For example:
In [59]:
popular_tweets = collection.find({'retweet_count': {"$gte": 3}})
popular_tweets.count()
Out[59]:
Using ObjectIds for the _id
field provides the following additional benefits:
.generation_time
property in pymongo. _id
field that stores ObjectId values is roughly equivalent to sorting by insertion time.pip install pymongo
search_results['statuses']
from Exercise 5 into a new Mongo database named tweets
find_one
where retweet_count
is greater than or equal to "$gte": 3