In [ ]:
from IPython.core.display import HTML
This notebook provides a basic walkthrough of how to use MongoDB and is based on a tutorial originally by Alberto Negron.
MongoDB is a cross-platform document-oriented NoSQL database. Rather than the traditional table-based relational database structure, MongoDB stores JSON-like documents with dynamic schemas (called BSON), making data integration easier and faster for certain types of applications.
Some of the features include:
Document-orientation Instead of taking a business subject and breaking it up into multiple relational structures, MongoDB can store the business subject in the minimal number of documents.
Ad hoc queries MongoDB supports field, range queries, regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions.
Indexing Any field in a MongoDB document can be indexed – including within arrays and embedded documents. Primary and secondary indices are available.
Aggregation Aggregation operators can be strung together to form a pipeline – analogous to Unix pipes.
Metadata records are frequently stored as JSON, and almost anything you get from an API will be JSON. For example, check out the metadata records for the National Oceanic and Atmospheric Administration.
MongoDB is a great tool to use with JSON data because it stores structured data as JSON-like documents, using dynamic rather than predefined schemas.
In MongoDB, an element of data is called a document, and documents are stored in collections. One collection may have any number of documents. Collections are a bit like tables in a relational database, and documents are like records. But there is one big difference: every record in a table has the same fields (with, usually, differing values) in the same order, while each document in a collection can have completely different fields from the other documents.
Documents are Python dictionaries that can have strings as keys and can contain various primitive types (int, float,unicode, datetime) as well as other documents (Python dicts) and arrays (Python lists).
First we need to import json
and pymongo
.
Note that the pprint
module provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter. This is particularly helpful with JSON. You can read more about pprint
here.
In [ ]:
import json
import pymongo
from pprint import pprint
Just as with the relational database example with sqlite
, we need to begin by setting up a connection. With MongoDB, we will be using pymongo
, though MongoDB also comes with a console API that uses Javascript.
To make our connection, we will use the PyMongo method MongoClient
:
In [ ]:
conn=pymongo.MongoClient()
Mongodb creates databases and collections automatically for you if they don't exist already. A single instance of MongoDB can support multiple independent databases. When working with PyMongo, we access databases using attribute style access, just like we did with sqlite
:
In [ ]:
db = conn.mydb
In [ ]:
conn.database_names()
In [ ]:
collection = db.my_collection
In [ ]:
db.collection_names()
In [ ]:
doc = {"class":"xbus-502","date":"03-05-2016","instructor":"bengfort","classroom":"C222","roster_count":"25"}
collection.insert_one(doc)
You can put anything in:
In [ ]:
doc = {"class":"xbus-502","date":"03-05-2016","teaching_assistant":"bilbro", "sauce": "awesome"}
collection.insert_one(doc)
At my job I have been working on a project to help make Commerce datasets easier to find. One of the barriers to searching for records is when the keywords return either too many or too few results. It can also be a problem if the keywords are too technical for lay users.
One solution is to use topic modeling to extract latent themes from the metadata records and then probabilistically assign each record a more sensical set of keywords based on its proximity (via kmeans) to the topics.
In order to get started, first I had to gather up a bunch of JSON metadata records and store them for analysis and modeling. Here's what I did:
import requests
NOAA_URL = "https://data.noaa.gov/data.json"
def load_data(URL):
"""
Loads the data from URL and returns data in JSON format.
"""
r = requests.get(URL)
data = r.json()
return data
noaa = load_data(NOAA_URL)
But...this kinda takes a long time, so I've created a file for you that contains a small chunk of the records to use for today's workshop.
In [ ]:
with open("data_sample.json") as data_file:
noaa = json.load(data_file)
In [ ]:
len(noaa)
In [ ]:
pprint(noaa[0])
Or say we wanted just the "description" field:
In [ ]:
pprint(noaa[0]['description'])
In [ ]:
db = conn.earthwindfire
In [ ]:
records = db.records
In [ ]:
# What data fields seem important to you? Add them below following the examples:
def insert(metadata):
for dataset in metadata:
data ={}
data["title"] = dataset["title"]
data["description"] = dataset["description"]
data["keywords"] = dataset["keyword"]
data["accessLevel"] = dataset["accessLevel"]
data["lang"] = dataset["language"]
# choose your own
# choose your own
# choose your own
# choose your own
records.insert_one(data)
insert(noaa)
In [ ]:
# Check to make sure they're all in there
records.count()
In [ ]:
records.find_one()
.find( )
To get more than a single document as the result of a query we use the find()
method. find()
returns a Cursor instance, which allows us to iterate over all matching documents.
records.find()
For example, we can iterate over the first 2 documents (there are a lot in the collection and this is just an example) in the records collection
In [ ]:
for rec in records.find()[:2]:
pprint(rec)
MongoDB queries are represented as JSON-like structures just like documents. To build a query, you just need to specify a dictionary with the properties you want the results to match. For example, let's say we were just interested in publically available satellite data from NESDIS.
This query will match all documents in the records collection with keywords code "NESDIS".
In [ ]:
records.find({"keywords": "NESDIS"}).count()
1117 is probably more than we want to print out in a Jupyter Notebook...
We can further narrow our search by adding additional fields
In [ ]:
records.find({"keywords": "NESDIS","keywords": "Russia","accessLevel":"public"}).count()
Since there's only two, let's check them out:
In [ ]:
for r in records.find({"keywords": "NESDIS","keywords": "Russia","accessLevel":"public"}):
pprint(r)
The following table provides an overview of common SQL aggregation terms, functions, and concepts and the corresponding MongoDB aggregation operators:
SQL Terms, Functions, and Concepts | MongoDB Aggregation Operators |
---|---|
WHERE | \$match |
GROUP BY | \$group |
HAVING | \$match |
SELECT | \$project |
ORDER BY | \$sort |
LIMIT | \$limit |
SUM() | \$sum |
COUNT() | \$sum |
join | \$lookup |
In [ ]:
cursor = db.records.find({"$where": "this.keywords.length > 100"}).limit(2);
for rec in cursor:
pprint(rec)
One of the things that makes MongoDB special is that it enables us to create search indexes. Indexes provide high performance read operations for frequently used queries.
In particular, a text index will enable us to search for string content in a collection. Keep in mind that a collection can have at most one text index.
We will create a text index on the description field so that we can search inside our NOAA records text:
In [ ]:
db.records.create_index([('description', 'text')])
To test our newly created text index on the description field, we will search documents using the $text
operator. Let's start by looking for all the documents that have the word 'precipitation' in their description field.
In [ ]:
cursor = db.records.find({'$text': {'$search': 'precipitation'}})
for rec in cursor:
print rec
In [ ]:
cursor = db.records.find({'$text': {'$search': 'fire'}})
cursor.count()
If we want to create a new text index, we can do so by first dropping the first text index:
In [ ]:
db.records.drop_index("description_text")
We can also create a wildcard text index for scenarios where we want any text fields in the records to be searchable. In such scenarios you can index all the string fields of your document using the $** wildcard specifier.
The query would go something like this:
In [ ]:
db.records.create_index([("$**","text")])
In [ ]:
cursor = db.records.find({'$text': {'$search': "Russia"}})
for rec in cursor:
pprint(rec)
Projections allow you to pass along the documents with only the specified fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.
For example, let's redo our fulltext Russia search, but project just the titles of the records:
In [ ]:
cursor = db.records.find({'$text': {'$search': "Russia"}}, {"title": 1,"_id":0 })
for rec in cursor:
print rec
In [ ]:
cursor = db.records.find({'$text': {'$search': "Russia"}}, {"title": 1,"_id":0 }).limit(2)
for rec in cursor:
print rec
MongoDB can perform aggregation operations with .aggregate()
, such as grouping by a specified key and evaluating a total or a count for each distinct group.
Use the $group
stage to group by a specified key using the _id field. $group
accesses fields by the field path, which is the field name prefixed by a dollar sign.
For example, we can use $group
to aggregate all the languages of the NOAA records:
In [ ]:
cursor = db.records.aggregate(
[
{"$group": {"_id": "$lang", "count": {"$sum": 1}}}
]
)
for document in cursor:
pprint(document)
Or we can combine $match
and $group
to aggregate the titles of just the public access records that match the word 'Soviet':
In [ ]:
cursor = db.records.aggregate(
[
{"$match": {'$text': {'$search': "Russia"}, "accessLevel": "public"}},
{"$group": {"_id": "$title"}}
]
)
for document in cursor:
pprint(document)
The aggregation pipeline allows MongoDB to provide native aggregation capabilities that corresponds to many common data aggregation operations in SQL. Here's where you will put the pieces together to aggregate to get results that you can begin to analyze and perform machine learning on.
Here's an example of an aggregation pipeline:
In [1]:
from IPython.display import Image
Image(filename='images/mongodb_pipeline.png', width=600, height=300)
Out[1]:
In [ ]:
conn.earthwindfire.collection_names()
Now let's delete our records collection and check again to see what collections are in our database:
In [ ]:
conn.earthwindfire.drop_collection("records")
conn.earthwindfire.collection_names()
We can also just drop a database. First let's determine what databases we have:
In [ ]:
conn.database_names()
Now let's remove the earthwindfire database:
In [ ]:
conn.drop_database("earthwindfire")
conn.database_names()
Nice work!
The dbstats
method returns statistics that reflect the use state of a single database:
In [ ]:
db = conn.mydb
collection = db.my_collection
db.command({'dbstats': 1})
collStats
returns a variety of storage statistics for a given collection. Let's try it out for our NOAA records collection:
In [ ]:
db.command({'collstats': 'my_collection', 'verbose': 'true' })