MongoDB Tutorial - Monday 22 February 2016

Introduction

In what follows, we assume that you have installed MongoDB according to the instructions online and started the mongodb daemon with the following command.

mongod

Note that you might have to create a folder /data/db with appropriate access rights before the daemon starts successfully.

We also assume that you use Python 3 and have the pymongo driver installed and imported (see also Setup instructions at the end of the notebook).

Note To run the notebook yourself, install Jupyter, download the notebook, and open it with Jupyter.

Note This notebook might be updated later. Major updates will be listed at the bottom of it, but you can also check the updates on github.

"Hello World!" : Databases, Collections, Documents

Relational databases contain tables that contain records.

A MongoDB database contains collections that contain documents.



In [ ]:

    
# start a client
client = pm.MongoClient() 

# connect to a database
db = client.moderndb

# get a collection
coll = db.test_collection

Documents follow the JSON format and MongoDB stores them in a binary version of it (BSON).

Below you see examples of JSON documents.

JSON example 0

{}

JSON example 1

{
"name" : "Michael",
"age": 32,
"grades": [71, 85, 90, 34]
}

JSON example 2

{
  "first name": "Michael",
  "last name": "Mathioudakis",
  "age": 32,
  "grades": {
    "ModernDB": 69,
    "Data Mining": 71,
    "Machine Learning": 95
  },
  "graduated": true,
  "previous schools": ["NTUA", "UofT"]
}

In Python, JSON documents are represented as dictionaries. The examples from above are therefore represented as follows.



In [ ]:

    
example_0 = {}



In [ ]:

    
example_1 = {"name": "Michael", "age": 32, "grades": [71, 85, 90, 34]}



In [ ]:

    
example_2 = \
{"first name": "Michael",
  "last name": "Mathioudakis",
  "age": 32,
  "grades": {
    "ModernDB": 69,
    "Data Mining": 71,
    "Machine Learning": 95
  },
  "graduated": True,
  "previous schools": ["NTUA", "UofT"]
}

Note that we can also use native Python objects, like the datetime object below, to specify values.



In [ ]:

    
import datetime
example_3 = {"name": "Modern Database Systems",
          "start": datetime.datetime(2016, 1, 12),
          "end": datetime.datetime(2016, 3, 26),
          "tags": ["rdbms", "mongodb", "spark"]}

Inserting and finding documents

Our collection coll is currently empty. Let's add one document to it.



In [ ]:

    
coll.insert_one(example_0)

If we call the collection's function find(), we get back a cursor.



In [ ]:

    
coll.find()

We can use the cursor to iterate over all documents in the collection.



In [ ]:

    
for doc in coll.find():
    print(doc)

Notice that the empty document we inserted is not really empty, but associated with an "_id" key, added by MongoDB.

Let's try another one.



In [ ]:

    
coll.insert_one(example_1)



In [ ]:

    
for doc in coll.find():
    print(doc)
    print()

Notice how MongoDB added an "_id" for the new document, as well. Let's insert more documents.



In [ ]:

    
coll.insert_many([example_2, example_3])



In [ ]:

    
for doc in coll.find():
    print(doc)
    print()

Notice how the document we insert do not follow a schema?

Let us now find documents that match a condition -- let's say we want to find documents that have a field "name" with value "Michael".



In [ ]:

    
query_result = coll.find({"name": "Michael"})
for doc in query_result:
    print(doc)

Projecting fields

We can use find() not only to retrieve documents that match a condition, but also to project only those fields that we are interested in.

For example, to suppress the "_id" field from appearing in the results, we can provide a second argument to find(), as follows.



In [ ]:

    
query_result = coll.find({"name": "Michael"}, {"_id": 0})
for doc in query_result:
    print(doc)

What if we're interested in keeping only some of the rest of the fields -- let's say, only "grades"?



In [ ]:

    
query_result = coll.find({"name": "Michael"}, {"_id": 0, "grades": 1})
for doc in query_result:
    print(doc)

Loading a larger dataset

Download file primer-dataset.json and load it into mongodb by running the command below.



In [ ]:

    
%%bash
mongoimport --db moderndb --collection restaurants --drop --file primer-dataset.json

Alternatively, you can import the dataset by running the same command on a terminal.

mongoimport --db moderndb --collection restaurants --drop --file dataset.json

The dataset contains documents that look like the one below.

Restaurant Example

{
  "address": {
     "building": "1007",
     "coord": [ -73.856077, 40.848447 ],
     "street": "Morris Park Ave",
     "zipcode": "10462"
  },
  "borough": "Bronx",
  "cuisine": "Bakery",
  "grades": [
     { "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 },
     { "date": { "$date": 1378857600000 }, "grade": "A", "score": 6 },
     { "date": { "$date": 1358985600000 }, "grade": "A", "score": 10 },
     { "date": { "$date": 1322006400000 }, "grade": "A", "score": 9 },
     { "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 }
  ],
  "name": "Morris Park Bake Shop",
  "restaurant_id": "30075445"
}



In [ ]:

    
restaurants = db.restaurants # our new collection



In [ ]:

    
# how many restaurants?
restaurants.count()

Querying the Dataset



In [ ]:

    
# retrieve a cursor over all documents in the collection
cursor = restaurants.find()



In [ ]:

    
# define printing function
def print_my_docs(cursor, num):
    for i in range(num): # print only up to num next documents from cursor
        try:
            print(next(cursor))
            print()
        except:
            break



In [ ]:

    
# let's print a few documents
print_my_docs(cursor, 3)



In [ ]:

    
next(cursor) # get one more document

Specify equality conditions



In [ ]:

    
# top-level field
cursor = restaurants.find({"borough": "Manhattan"})

print_my_docs(cursor, 2)



In [ ]:

    
# nested field (in embedded document)
cursor = restaurants.find({"address.zipcode": "10075"})

print_my_docs(cursor, 2)



In [ ]:

    
# query by field in array
cursor = restaurants.find({"grades.grade": "B"})



In [ ]:

    
# print one document from the query result
next(cursor)['grades']



In [ ]:

    
# exact array match
cursor = restaurants.find({"address.coord": [-73.98513559999999, 40.7676919]})



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

Specify Range Conditions



In [ ]:

    
cursor = restaurants.find({"grades.score": {"$gt": 30}})



In [ ]:

    
cursor = restaurants.find({"grades.score": {"$lt": 10}})



In [ ]:

    
next(cursor)["grades"]

Multiple Conditions



In [ ]:

    
# logical AND
cursor = restaurants.find({"cuisine": "Italian", "address.zipcode": "10075"})



In [ ]:

    
next(cursor)



In [ ]:

    
# logical OR
cursor = restaurants.find({"$or": [{"cuisine": "Italian"},
                                          {"address.zipcode": "10075"}]})



In [ ]:

    
print_my_docs(cursor, 3)



In [ ]:

    
# logical AND, differently
cursor = restaurants.find({"$and": [{"cuisine": "Italian"},
                                          {"address.zipcode": "10075"}]})



In [ ]:

    
next(cursor)

Sorting



In [ ]:

    
cursor = restaurants.find()

# to sort, specify list of sorting criteria,
# each criterion given as a tuple
# (field_name, sort_order)
# here we have only one
sorted_cursor = cursor.sort([("borough", pm.ASCENDING)])



In [ ]:

    
print_my_docs(cursor, 2)



In [ ]:

    
another_sorted_cursor = restaurants.find().sort([("borough", pm.ASCENDING),
                                     ("address.zipcode", pm.DESCENDING)])



In [ ]:

    
print_my_docs(another_sorted_cursor, 3)

Aggregation

Aggregation happens in stages.



In [ ]:

    
# Group Documents by a Field and Calculate Count
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": "$borough", "count": {"$sum": 1}}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

    
# Filter and Group Documents
cursor = restaurants.aggregate(
    [
        {"$match": {"borough": "Queens", "cuisine": "Brazilian"}},
        {"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

    
# Filter and Group and then Filter Again documents
cursor = restaurants.aggregate(
    [
        {"$match": {"borough": "Manhattan", "cuisine": "American "}},
        {"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}},
        {"$match": {"count": {"$gt": 1}}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

    
# Filter and Group and then Filter Again and then Sort Documents
cursor = restaurants.aggregate(
    [
        {"$match": {"borough": "Manhattan", "cuisine": "American "}},
        {"$group": {"_id": "$address.zipcode",  "count": {"$sum": 1}}},
        {"$match": {"count": {"$gt": 1}}},
        {"$sort":  {"count": -1, "_id": -1}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

    
# Same but sort by multiple fields
# Filter and Group and then Filter Again and then Sort Documents
cursor = restaurants.aggregate(
    [
        {"$match": {"borough": "Manhattan", "cuisine": "American "}},
        {"$group": {"_id": "$address.zipcode",  "count": {"$sum": 1}}},
        {"$match": {"count": {"$gt": 1}}},
        {"$sort":  son.SON([("count", -1), ("_id", 1)])} # order matters!!
    ]
)



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

    
# what will this do?
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": None, "count": {"$sum": 1}} }
    ]
)



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

    
# projection
# what will this do?
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}},
        {"$project": {"_id": 0, "count": 1}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 10)



In [ ]:

    
# what will this do?
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": {"cuisine": "$cuisine"}, "count": {"$sum": 1}}},
        {"$sort": {"count": -1}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 5)



In [ ]:

    
# what will this do?
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": {"zip": "$address.zipcode"}, "count": {"$sum": 1}}},
        {"$sort": {"count": -1}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 5)



In [ ]:

    
# what will this do?
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": {"cuisine": "$cuisine", "zip": "$address.zipcode"}, "count": {"$sum": 1}}},
        {"$sort": {"count": -1}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 5)

Limiting the number of results



In [ ]:

    
# what will this do?

cursor = restaurants.aggregate(
    [
        {"$group": {"_id": {"cuisine": "$cuisine", "zip": "$address.zipcode"}, "count": {"$sum": 1}}},
        {"$sort": {"count": -1}},
        {"$limit": 10} # See comment under "In-class questions"
    ]
)

for doc in cursor:
    print(doc["_id"]["cuisine"], doc["_id"]["zip"], doc["count"])

Storing the result as a collection

We can use operator \$out in a final stage to store the result of a query into a new collection. The following example selects restaurants from Manhattan and stores them in their own collection in the same database.



In [ ]:

    
restaurants.aggregate(
    [
        {"$match": {"borough": "Manhattan"}},
        {"$out": "manhattan"}
    ]
)

SQL to Aggregation

Here we explore the correspondence between SQL queries and the aggregation framework.

SQL query

SELECT COUNT(*) AS count
FROM restaurants



In [ ]:

    
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": None, "count": {"$sum": 1}} }
    ]
)

SQL query

SELECT borough, cuisine, COUNT(*) as count
FROM restaurants
GROUP BY borough, cuisine



In [ ]:

    
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": {"borough": "$borough", "cuisine": "$cuisine"}, "count": {"$sum": 1}}}
    ]
)

SQL query

SELECT borough, cuisine, COUNT(*) as count
FROM restaurants
GROUP BY borough, cuisine
HAVING COUNT(*) > 3



In [ ]:

    
cursor = restaurants.aggregate(
    [
        {"$group": {"_id": {"borough": "$borough", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
        {"$match": {"count": {"$gt": 3}}}
    ]
)

SQL Query

SELECT zipcode, cuisine, COUNT(*) as count
FROM restaurants
WHERE borough = "Manhattan"
GROUP BY zipcode, cuisine
HAVING COUNT(*) > 3



In [ ]:

    
cursor = restaurants.aggregate(
    [
        {"$match": {"borough": "Manhattan"}},
        {"$group": {"_id": {"zipcode": "$address.zipcode", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
        {"$match": {"count": {"$gt": 3}}}
    ]
)



In [ ]:

    
print_my_docs(cursor, 5)

SQL Query

SELECT zipcode, cuisine, COUNT(*) as count
FROM restaurants
WHERE borough = "Manhattan"
GROUP BY zipcode, cuisine
HAVING COUNT(*) > 3
ORDER BY count



In [ ]:

    
cursor = restaurants.aggregate(
    [
        {"$match": {"borough": "Manhattan"}},
        {"$group": {"_id": {"zipcode": "$address.zipcode", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
        {"$match": {"count": {"$gt": 3}}},
        {"$sort": {"count": 1}}
    ]
)

Using secondary memory (disk)



In [ ]:

    
cursor = restaurants.aggregate(
    [
        {"$match": {"borough": "Manhattan"}},
        {"$group": {"_id": {"zipcode": "$address.zipcode", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
        {"$match": {"count": {"$gt": 3}}},
        {"$sort": {"count": 1}}
    ],
    allowDiskUse = True # this can be useful when data does not fit in memory, e.g., to perform external sorting
)

Indexing

MongoDb automatically creates an index on the _id field upon creating a collection. We can use create_index() to create index on one or more fields of a collection.

Single-field index



In [ ]:

    
# note that the argument is a list of tuples
# [(<field>: <type>), ...]
# here, we specify only one such tuple for one field
restaurants.create_index([("borough", pm.ASCENDING)])

The index is created only if it does not already exist.

Compound index



In [ ]:

    
# compound index (more than one indexed fields)
restaurants.create_index([
    ("cuisine", pm.ASCENDING),
    ("address.zipcode", pm.DESCENDING)
])

Deleting indexes



In [ ]:

    
restaurants.drop_index('borough_1') # drop this index



In [ ]:

    
restaurants.drop_index('cuisine_1_address.zipcode_-1') # drop that index



In [ ]:

    
restaurants.drop_indexes() # drop all indexes!!1

Multi-key index

An index for a fields with array value.



In [ ]:

    
restaurants.find_one()



In [ ]:

    
restaurants.create_index([("address.coord", 1)])



In [ ]:

    
restaurants.create_index([("grades.score", 1)])



In [ ]:

    
restaurants.create_index([("grades.grade", 1), ("grades.score", 1)])

The following will not work! We cannot currently have compound multi-key indexes.



In [ ]:

    
restaurants.create_index([("address.coord", 1), ("grades.score", 1)]) # NOPE!

Retrieving the execution plan

We can retrieve the execution plan for a find() query by calling the explain() function on the result cursor. We demonstrate this in the following example.



In [ ]:

    
restaurants.drop_indexes() # we drop all indexes first -- use this with care!
restaurants.create_index([("borough", pm.ASCENDING)]) # build an index on field "borough", in ascending order
my_cursor = restaurants.find({"borough": "brooklyn"}) # submit query to find restaurants from specific borough
my_cursor.explain()["queryPlanner"]["winningPlan"] # ask mongodb to explain execution plan

As we see in this example, MongoDB makes use of an index (as indicated by keyword "IXSCAN") -- and particularly the index ('borough_1') we constructed to execute the query. What if we had not built this index?



In [ ]:

    
restaurants.drop_indexes() # we drop all indexes first -- use this with care!
my_cursor = restaurants.find({"borough": "brooklyn"}) # submit query to find restaurants from specific borough
my_cursor.explain()["queryPlanner"]["winningPlan"] # ask mongodb to explain execution plan

In that case, MongoDB simply performs a scan over the collection (as indicated by keyword "COLLSCAN").

Joins

Until very recently, MongoDB did not support joins. It was up to the user to implement a join if needed.

For example, the following double for-loop emulated "page-oriented nested-loops join".



In [ ]:

    
for a in restaurants.find({"borough": "Manhattan"}).limit(7):
    for b in restaurants.find({"borough": "Bronx"}).limit(5):
        if a["cuisine"] == b["cuisine"]:
            print(a["cuisine"], a["address"]["zipcode"], b["address"]["zipcode"])

Joins with \$lookup

This is a new aggregation stage that implements left outer equi-joins.

"A left outer equi-join produces a result set that contains data for all documents from the left table (collection) together with data from the right table (collection) for documents where there is a match with documents from the left table (collection)."



In [ ]:

    
# create first collection
orders_docs = [{ "_id" : 1, "item" : "abc", "price" : 12, "quantity" : 2 },
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1 },
{ "_id" : 3  }]
orders = db.orders
orders.drop()
orders.insert_many(orders_docs)



In [ ]:

    
# create second collection
inventory_docs = [
    { "_id" : 1, "item" : "abc", "description": "product 1", "instock" : 120 },
{ "_id" : 2, "item" : "def", "description": "product 2", "instock" : 80 },
{ "_id" : 3, "item" : "ijk", "description": "product 3", "instock" : 60 },
{ "_id" : 4, "item" : "jkl", "description": "product 4", "instock" : 70 },
{ "_id" : 5, "item": None, "description": "Incomplete" },
{ "_id" : 6 }
]
inventory = db.inventory
inventory.drop()
inventory.insert_many(inventory_docs)



In [ ]:

    
result = orders.aggregate([ # "orders" is the outer collection
    {
      "$lookup":
        {
          "from": "inventory", # the inner collection 
          "localField": "item", # the join field of the outer collection
          "foreignField": "item", # the join field of the outer collection
          "as": "inventory_docs" # name of field with array of joined inner docs
        }
   }
])



In [ ]:

    
print_my_docs(result, 10)

In-class questions

Question: How do we query for documents with an array field, all the elements of which satisfy a condition?

Two approaches (if you can think of a different approach, please let me know):

Use the \$not operators: form a query to express that "there is no element in the array that does not satisfy the condition".
In aggregation, combine an \$unwind stage with a $group stage.

To provide an example, let's say we want to find restaurants with 'A' grades only. Below we show how we can use each of the aforementioned approaches.

First approach: using $not



In [ ]:

    
# using the $not operator
# "find restaurants that contain no grades that are not equal to A"
cursor = restaurants.find({"grades.grade": {"$exists": True}, "grades": {"$not": {"$elemMatch": {"grade": {"$ne": "A"}}}}})



In [ ]:

    
print_my_docs(cursor, 3)

Note on the semantics of the \$not operator

The operator selects documents that do not match the specified condition on the specified field. These documents include ones that do not contain the field.

To demonstrate this, consider the following simple example of a collection.



In [ ]:

    
# simple example of a collection
mycoll = db.mycoll
mycoll.drop()

# insert three documents
mycoll.insert_one({"grades": [7, 7]})
mycoll.insert_one({"grades": [7, 3]})
mycoll.insert_one({"grades": [3, 3]})
mycoll.insert_one({"grades": []})
mycoll.insert_one({})

The result of the following query contains documents that do not contain the "grades" field.



In [ ]:

    
# find documents that have no "grades" elements that are not equal to "A"
mycursor = mycoll.find({"grades": {"$not": {"$elemMatch": {"$ne": 7}}}})
print_my_docs(mycursor, 10)

We can remove such documents from the result as a post-processing step. (Exercise: how?)

Second approach: aggregation pipeline



In [ ]:

    
# using aggregation
mycursor = restaurants.aggregate(
    [
        # unwind the grades array
        {"$unwind": "$grades"}, #now each document contains one "grades" value
        
        # group by document "_id" and count:
        # (i) the total number of documents in each group as `count`
        # -- this is the same as the number of elements in the original array
        # (ii) the number of documents that satisfy the condition (grade = "A") as `num_satisfied`
        {"$group": {"_id": "$_id", "count": {"$sum": 1}, "num_satisfied": {"$sum": {"$cond": [{"$eq": ["$grades.grade", "A"]}, 1, 0]}}}},
        
        # create a field (named `same`) that is 1 if (count = num_satisfied) and 0 otherwise
        {"$project": {"_id": 1, "same_count": {"$cond": [{"$eq": ["$count", "$num_satisfied"]} , 1, 0]}}},
        
        # keep only the document ids for which (same = 1)
        {"$match": {"same_count": 1}}
    ]
)

print_my_docs(mycursor, 5)

Question: Does MongoDB optimize the stages of an aggregation pipeline?

The question was asked in relation to the "limit" query we saw above ("Limiting the number of results").

Indeed, MongoDB does optimize the execution of the aggregation pipeline, as explained here. In relation to the aforementioned query, see, in particular, the part on sort+limit coalescence.

Credits and references

We used and consulted material from:

the offficial PyMongo tutorial as well as this shorter one,
the JSON and BSON documentation, as well as SON,
these posts on the MongoDB blog about the new (v.3.2) left outer equi-join functionality,
this StackOverflow thread.

Updates to the notebook

Compared to the in-class tutorial, we've added answers to two in-class questions.
[Feb 26] Added sub-sections projecting fields, storing the result as a collection, and retrieving the execution plan.

Setup

Run the following commands before the rest of the notebook.



In [ ]:

    
import pymongo as pm
client = pm.MongoClient()
client.drop_database("moderndb")



In [ ]:

    
import bson.son as son



In [ ]: