In [ ]:
from IPython.core.display import HTML

Introduction to MongoDB with PyMongo and NOAA Data

This notebook provides a basic walkthrough of how to use MongoDB and is based on a tutorial originally by Alberto Negron.

What is MongoDB?

MongoDB is a cross-platform document-oriented NoSQL database. Rather than the traditional table-based relational database structure, MongoDB stores JSON-like documents with dynamic schemas (called BSON), making data integration easier and faster for certain types of applications.

Features

Some of the features include:

Document-orientation Instead of taking a business subject and breaking it up into multiple relational structures, MongoDB can store the business subject in the minimal number of documents.

Ad hoc queries MongoDB supports field, range queries, regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions.

Indexing Any field in a MongoDB document can be indexed – including within arrays and embedded documents. Primary and secondary indices are available.

Aggregation Aggregation operators can be strung together to form a pipeline – analogous to Unix pipes.

When it makes sense to use MongoDB

Metadata records are frequently stored as JSON, and almost anything you get from an API will be JSON. For example, check out the metadata records for the National Oceanic and Atmospheric Administration.

MongoDB is a great tool to use with JSON data because it stores structured data as JSON-like documents, using dynamic rather than predefined schemas.

In MongoDB, an element of data is called a document, and documents are stored in collections. One collection may have any number of documents. Collections are a bit like tables in a relational database, and documents are like records. But there is one big difference: every record in a table has the same fields (with, usually, differing values) in the same order, while each document in a collection can have completely different fields from the other documents.

Documents are Python dictionaries that can have strings as keys and can contain various primitive types (int, float,unicode, datetime) as well as other documents (Python dicts) and arrays (Python lists).

Getting started

First we need to import json and pymongo.

Note that the pprint module provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter. This is particularly helpful with JSON. You can read more about pprint here.


In [ ]:
import json
import pymongo
from pprint import pprint

Connect

Just as with the relational database example with sqlite, we need to begin by setting up a connection. With MongoDB, we will be using pymongo, though MongoDB also comes with a console API that uses Javascript.

To make our connection, we will use the PyMongo method MongoClient:


In [ ]:
conn=pymongo.MongoClient()

Create and access a database

Mongodb creates databases and collections automatically for you if they don't exist already. A single instance of MongoDB can support multiple independent databases. When working with PyMongo, we access databases using attribute style access, just like we did with sqlite:


In [ ]:
db = conn.mydb

In [ ]:
conn.database_names()

Collections

A collection is a group of documents stored in MongoDB, and can be thought of as roughly the equivalent of a table in a relational database. Getting a collection in PyMongo works the same as getting a database:


In [ ]:
collection = db.my_collection

In [ ]:
db.collection_names()

Insert data

To insert some data into MongoDB, all we need to do is create a dict and call insert_one on the collection object:


In [ ]:
doc = {"class":"xbus-502","date":"03-05-2016","instructor":"bengfort","classroom":"C222","roster_count":"25"}
collection.insert_one(doc)

You can put anything in:


In [ ]:
doc = {"class":"xbus-502","date":"03-05-2016","teaching_assistant":"bilbro", "sauce": "awesome"}
collection.insert_one(doc)

A practical example

At my job I have been working on a project to help make Commerce datasets easier to find. One of the barriers to searching for records is when the keywords return either too many or too few results. It can also be a problem if the keywords are too technical for lay users.

One solution is to use topic modeling to extract latent themes from the metadata records and then probabilistically assign each record a more sensical set of keywords based on its proximity (via kmeans) to the topics.

In order to get started, first I had to gather up a bunch of JSON metadata records and store them for analysis and modeling. Here's what I did:

import requests

NOAA_URL = "https://data.noaa.gov/data.json"

def load_data(URL):
    """
    Loads the data from URL and returns data in JSON format.
    """
    r = requests.get(URL)
    data = r.json()
    return data

noaa = load_data(NOAA_URL)

But...this kinda takes a long time, so I've created a file for you that contains a small chunk of the records to use for today's workshop.


In [ ]:
with open("data_sample.json") as data_file:    
    noaa = json.load(data_file)

In [ ]:
len(noaa)

Checking out the data

Now let's print out just one record to examine the structure.


In [ ]:
pprint(noaa[0])

Or say we wanted just the "description" field:


In [ ]:
pprint(noaa[0]['description'])

Define the database

We will want to enter these records into our database. But first, we'll define a specific database for the NOAA records:


In [ ]:
db = conn.earthwindfire

Define the collection

Next we define the collection where we'll insert the NOAA metadata records:


In [ ]:
records = db.records

Insert data

Then we loop through each record in the NOAA dataset and insert just the target information for each into the collection.


In [ ]:
# What data fields seem important to you? Add them below following the examples:

def insert(metadata):
    for dataset in metadata:
        data ={}
        data["title"] = dataset["title"]
        data["description"] = dataset["description"]
        data["keywords"] = dataset["keyword"]
        data["accessLevel"] = dataset["accessLevel"]
        data["lang"] = dataset["language"]
        # choose your own
        # choose your own
        # choose your own 
        # choose your own

        records.insert_one(data)

insert(noaa)

In [ ]:
# Check to make sure they're all in there
records.count()

Querying

Querying with .findOne( )

The find_one() method selects and returns a single document from a collection and returns that document (or None if there are no matches). It is useful when you know there is only one matching document, or are only interested in the first match


In [ ]:
records.find_one()

Querying with .find( )

To get more than a single document as the result of a query we use the find() method. find() returns a Cursor instance, which allows us to iterate over all matching documents.

records.find()

For example, we can iterate over the first 2 documents (there are a lot in the collection and this is just an example) in the records collection


In [ ]:
for rec in records.find()[:2]:
    pprint(rec)

Searching

MongoDB queries are represented as JSON-like structures just like documents. To build a query, you just need to specify a dictionary with the properties you want the results to match. For example, let's say we were just interested in publically available satellite data from NESDIS.

This query will match all documents in the records collection with keywords code "NESDIS".


In [ ]:
records.find({"keywords": "NESDIS"}).count()

1117 is probably more than we want to print out in a Jupyter Notebook...

We can further narrow our search by adding additional fields


In [ ]:
records.find({"keywords": "NESDIS","keywords": "Russia","accessLevel":"public"}).count()

Since there's only two, let's check them out:


In [ ]:
for r in records.find({"keywords": "NESDIS","keywords": "Russia","accessLevel":"public"}):
    pprint(r)

If you already know SQL...

The following table provides an overview of common SQL aggregation terms, functions, and concepts and the corresponding MongoDB aggregation operators:

SQL Terms, Functions, and Concepts MongoDB Aggregation Operators
WHERE \$match
GROUP BY \$group
HAVING \$match
SELECT \$project
ORDER BY \$sort
LIMIT \$limit
SUM() \$sum
COUNT() \$sum
join \$lookup

But...thanks to MongoDB's nested data structures, we can also do a lot of things we can't do in a relational database.

Length

Let's look for some entries that have way too many keywords:


In [ ]:
cursor = db.records.find({"$where": "this.keywords.length > 100"}).limit(2);
for rec in cursor:
    pprint(rec)

Full text search with a text index

One of the things that makes MongoDB special is that it enables us to create search indexes. Indexes provide high performance read operations for frequently used queries.

In particular, a text index will enable us to search for string content in a collection. Keep in mind that a collection can have at most one text index.

We will create a text index on the description field so that we can search inside our NOAA records text:


In [ ]:
db.records.create_index([('description', 'text')])

To test our newly created text index on the description field, we will search documents using the $text operator. Let's start by looking for all the documents that have the word 'precipitation' in their description field.


In [ ]:
cursor = db.records.find({'$text': {'$search': 'precipitation'}})
for rec in cursor:
    print rec

In [ ]:
cursor = db.records.find({'$text': {'$search': 'fire'}})
cursor.count()

If we want to create a new text index, we can do so by first dropping the first text index:


In [ ]:
db.records.drop_index("description_text")

We can also create a wildcard text index for scenarios where we want any text fields in the records to be searchable. In such scenarios you can index all the string fields of your document using the $** wildcard specifier.

The query would go something like this:


In [ ]:
db.records.create_index([("$**","text")])

In [ ]:
cursor = db.records.find({'$text': {'$search': "Russia"}})
for rec in cursor:
    pprint(rec)

Projections

Projections allow you to pass along the documents with only the specified fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.

For example, let's redo our fulltext Russia search, but project just the titles of the records:


In [ ]:
cursor = db.records.find({'$text': {'$search': "Russia"}}, {"title": 1,"_id":0 })
for rec in cursor:
    print rec

Limit

.limit() passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, this method outputs either one document (for the first n documents) or zero documents (after the first n documents).


In [ ]:
cursor = db.records.find({'$text': {'$search': "Russia"}}, {"title": 1,"_id":0 }).limit(2)
for rec in cursor:
    print rec

Aggregate

MongoDB can perform aggregation operations with .aggregate(), such as grouping by a specified key and evaluating a total or a count for each distinct group.

Use the $group stage to group by a specified key using the _id field. $group accesses fields by the field path, which is the field name prefixed by a dollar sign.

For example, we can use $group to aggregate all the languages of the NOAA records:


In [ ]:
cursor = db.records.aggregate(
    [
        {"$group": {"_id": "$lang", "count": {"$sum": 1}}}
    ]
)
for document in cursor:
    pprint(document)

Or we can combine $match and $group to aggregate the titles of just the public access records that match the word 'Soviet':


In [ ]:
cursor = db.records.aggregate(
    [
        {"$match": {'$text': {'$search': "Russia"}, "accessLevel": "public"}},
        {"$group": {"_id": "$title"}}
    ]
)

for document in cursor:
    pprint(document)

The aggregation pipeline

The aggregation pipeline allows MongoDB to provide native aggregation capabilities that corresponds to many common data aggregation operations in SQL. Here's where you will put the pieces together to aggregate to get results that you can begin to analyze and perform machine learning on.

Here's an example of an aggregation pipeline:


In [1]:
from IPython.display import Image
Image(filename='images/mongodb_pipeline.png', width=600, height=300)


Out[1]:

Removing data

It's easy (almost too easy) to delete projects, collections, and databases in MongoDB. Before we get rid of anything, let's determine what collections we have in our database:


In [ ]:
conn.earthwindfire.collection_names()

Now let's delete our records collection and check again to see what collections are in our database:


In [ ]:
conn.earthwindfire.drop_collection("records")
conn.earthwindfire.collection_names()

We can also just drop a database. First let's determine what databases we have:


In [ ]:
conn.database_names()

Now let's remove the earthwindfire database:


In [ ]:
conn.drop_database("earthwindfire")
conn.database_names()

Nice work!

Miscellaneous

Statistics

The dbstats method returns statistics that reflect the use state of a single database:


In [ ]:
db = conn.mydb
collection = db.my_collection
db.command({'dbstats': 1})

collStats returns a variety of storage statistics for a given collection. Let's try it out for our NOAA records collection:


In [ ]:
db.command({'collstats': 'my_collection', 'verbose': 'true' })