In what follows, we assume that you have installed MongoDB according to the instructions online and started the mongodb daemon with the following command.
mongod
Note that you might have to create a folder /data/db
with appropriate access rights before the daemon starts successfully.
We also assume that you use Python 3 and have the pymongo driver installed and imported (see also Setup instructions at the end of the notebook).
Note To run the notebook yourself, install Jupyter, download the notebook, and open it with Jupyter.
Note This notebook might be updated later. Major updates will be listed at the bottom of it, but you can also check the updates on github.
In [ ]:
# start a client
client = pm.MongoClient()
# connect to a database
db = client.moderndb
# get a collection
coll = db.test_collection
Documents follow the JSON format and MongoDB stores them in a binary version of it (BSON).
Below you see examples of JSON documents.
JSON example 0
{}
JSON example 1
{
"name" : "Michael",
"age": 32,
"grades": [71, 85, 90, 34]
}
JSON example 2
{
"first name": "Michael",
"last name": "Mathioudakis",
"age": 32,
"grades": {
"ModernDB": 69,
"Data Mining": 71,
"Machine Learning": 95
},
"graduated": true,
"previous schools": ["NTUA", "UofT"]
}
In Python, JSON documents are represented as dictionaries. The examples from above are therefore represented as follows.
In [ ]:
example_0 = {}
In [ ]:
example_1 = {"name": "Michael", "age": 32, "grades": [71, 85, 90, 34]}
In [ ]:
example_2 = \
{"first name": "Michael",
"last name": "Mathioudakis",
"age": 32,
"grades": {
"ModernDB": 69,
"Data Mining": 71,
"Machine Learning": 95
},
"graduated": True,
"previous schools": ["NTUA", "UofT"]
}
Note that we can also use native Python objects, like the datetime
object below, to specify values.
In [ ]:
import datetime
example_3 = {"name": "Modern Database Systems",
"start": datetime.datetime(2016, 1, 12),
"end": datetime.datetime(2016, 3, 26),
"tags": ["rdbms", "mongodb", "spark"]}
In [ ]:
coll.insert_one(example_0)
If we call the collection's function find()
, we get back a cursor.
In [ ]:
coll.find()
We can use the cursor to iterate over all documents in the collection.
In [ ]:
for doc in coll.find():
print(doc)
Notice that the empty document we inserted is not really empty, but associated with an "_id" key, added by MongoDB.
Let's try another one.
In [ ]:
coll.insert_one(example_1)
In [ ]:
for doc in coll.find():
print(doc)
print()
Notice how MongoDB added an "_id" for the new document, as well. Let's insert more documents.
In [ ]:
coll.insert_many([example_2, example_3])
In [ ]:
for doc in coll.find():
print(doc)
print()
Notice how the document we insert do not follow a schema?
Let us now find documents that match a condition -- let's say we want to find documents that have a field "name" with value "Michael".
In [ ]:
query_result = coll.find({"name": "Michael"})
for doc in query_result:
print(doc)
In [ ]:
query_result = coll.find({"name": "Michael"}, {"_id": 0})
for doc in query_result:
print(doc)
What if we're interested in keeping only some of the rest of the fields -- let's say, only "grades"?
In [ ]:
query_result = coll.find({"name": "Michael"}, {"_id": 0, "grades": 1})
for doc in query_result:
print(doc)
Download file primer-dataset.json and load it into mongodb by running the command below.
In [ ]:
%%bash
mongoimport --db moderndb --collection restaurants --drop --file primer-dataset.json
Alternatively, you can import the dataset by running the same command on a terminal.
mongoimport --db moderndb --collection restaurants --drop --file dataset.json
The dataset contains documents that look like the one below.
Restaurant Example
{
"address": {
"building": "1007",
"coord": [ -73.856077, 40.848447 ],
"street": "Morris Park Ave",
"zipcode": "10462"
},
"borough": "Bronx",
"cuisine": "Bakery",
"grades": [
{ "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 },
{ "date": { "$date": 1378857600000 }, "grade": "A", "score": 6 },
{ "date": { "$date": 1358985600000 }, "grade": "A", "score": 10 },
{ "date": { "$date": 1322006400000 }, "grade": "A", "score": 9 },
{ "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 }
],
"name": "Morris Park Bake Shop",
"restaurant_id": "30075445"
}
In [ ]:
restaurants = db.restaurants # our new collection
In [ ]:
# how many restaurants?
restaurants.count()
In [ ]:
# retrieve a cursor over all documents in the collection
cursor = restaurants.find()
In [ ]:
# define printing function
def print_my_docs(cursor, num):
for i in range(num): # print only up to num next documents from cursor
try:
print(next(cursor))
print()
except:
break
In [ ]:
# let's print a few documents
print_my_docs(cursor, 3)
In [ ]:
next(cursor) # get one more document
In [ ]:
# top-level field
cursor = restaurants.find({"borough": "Manhattan"})
print_my_docs(cursor, 2)
In [ ]:
# nested field (in embedded document)
cursor = restaurants.find({"address.zipcode": "10075"})
print_my_docs(cursor, 2)
In [ ]:
# query by field in array
cursor = restaurants.find({"grades.grade": "B"})
In [ ]:
# print one document from the query result
next(cursor)['grades']
In [ ]:
# exact array match
cursor = restaurants.find({"address.coord": [-73.98513559999999, 40.7676919]})
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
In [ ]:
cursor = restaurants.find({"grades.score": {"$gt": 30}})
In [ ]:
cursor = restaurants.find({"grades.score": {"$lt": 10}})
In [ ]:
next(cursor)["grades"]
In [ ]:
# logical AND
cursor = restaurants.find({"cuisine": "Italian", "address.zipcode": "10075"})
In [ ]:
next(cursor)
In [ ]:
# logical OR
cursor = restaurants.find({"$or": [{"cuisine": "Italian"},
{"address.zipcode": "10075"}]})
In [ ]:
print_my_docs(cursor, 3)
In [ ]:
# logical AND, differently
cursor = restaurants.find({"$and": [{"cuisine": "Italian"},
{"address.zipcode": "10075"}]})
In [ ]:
next(cursor)
In [ ]:
cursor = restaurants.find()
# to sort, specify list of sorting criteria,
# each criterion given as a tuple
# (field_name, sort_order)
# here we have only one
sorted_cursor = cursor.sort([("borough", pm.ASCENDING)])
In [ ]:
print_my_docs(cursor, 2)
In [ ]:
another_sorted_cursor = restaurants.find().sort([("borough", pm.ASCENDING),
("address.zipcode", pm.DESCENDING)])
In [ ]:
print_my_docs(another_sorted_cursor, 3)
In [ ]:
# Group Documents by a Field and Calculate Count
cursor = restaurants.aggregate(
[
{"$group": {"_id": "$borough", "count": {"$sum": 1}}}
]
)
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
# Filter and Group Documents
cursor = restaurants.aggregate(
[
{"$match": {"borough": "Queens", "cuisine": "Brazilian"}},
{"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}}
]
)
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
# Filter and Group and then Filter Again documents
cursor = restaurants.aggregate(
[
{"$match": {"borough": "Manhattan", "cuisine": "American "}},
{"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 1}}}
]
)
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
# Filter and Group and then Filter Again and then Sort Documents
cursor = restaurants.aggregate(
[
{"$match": {"borough": "Manhattan", "cuisine": "American "}},
{"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 1}}},
{"$sort": {"count": -1, "_id": -1}}
]
)
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
# Same but sort by multiple fields
# Filter and Group and then Filter Again and then Sort Documents
cursor = restaurants.aggregate(
[
{"$match": {"borough": "Manhattan", "cuisine": "American "}},
{"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 1}}},
{"$sort": son.SON([("count", -1), ("_id", 1)])} # order matters!!
]
)
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
# what will this do?
cursor = restaurants.aggregate(
[
{"$group": {"_id": None, "count": {"$sum": 1}} }
]
)
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
# projection
# what will this do?
cursor = restaurants.aggregate(
[
{"$group": {"_id": "$address.zipcode", "count": {"$sum": 1}}},
{"$project": {"_id": 0, "count": 1}}
]
)
In [ ]:
print_my_docs(cursor, 10)
In [ ]:
# what will this do?
cursor = restaurants.aggregate(
[
{"$group": {"_id": {"cuisine": "$cuisine"}, "count": {"$sum": 1}}},
{"$sort": {"count": -1}}
]
)
In [ ]:
print_my_docs(cursor, 5)
In [ ]:
# what will this do?
cursor = restaurants.aggregate(
[
{"$group": {"_id": {"zip": "$address.zipcode"}, "count": {"$sum": 1}}},
{"$sort": {"count": -1}}
]
)
In [ ]:
print_my_docs(cursor, 5)
In [ ]:
# what will this do?
cursor = restaurants.aggregate(
[
{"$group": {"_id": {"cuisine": "$cuisine", "zip": "$address.zipcode"}, "count": {"$sum": 1}}},
{"$sort": {"count": -1}}
]
)
In [ ]:
print_my_docs(cursor, 5)
In [ ]:
# what will this do?
cursor = restaurants.aggregate(
[
{"$group": {"_id": {"cuisine": "$cuisine", "zip": "$address.zipcode"}, "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10} # See comment under "In-class questions"
]
)
for doc in cursor:
print(doc["_id"]["cuisine"], doc["_id"]["zip"], doc["count"])
We can use operator \$out in a final stage to store the result of a query into a new collection. The following example selects restaurants from Manhattan and stores them in their own collection in the same database.
In [ ]:
restaurants.aggregate(
[
{"$match": {"borough": "Manhattan"}},
{"$out": "manhattan"}
]
)
SQL query
SELECT COUNT(*) AS count
FROM restaurants
In [ ]:
cursor = restaurants.aggregate(
[
{"$group": {"_id": None, "count": {"$sum": 1}} }
]
)
SQL query
SELECT borough, cuisine, COUNT(*) as count
FROM restaurants
GROUP BY borough, cuisine
In [ ]:
cursor = restaurants.aggregate(
[
{"$group": {"_id": {"borough": "$borough", "cuisine": "$cuisine"}, "count": {"$sum": 1}}}
]
)
SQL query
SELECT borough, cuisine, COUNT(*) as count
FROM restaurants
GROUP BY borough, cuisine
HAVING COUNT(*) > 3
In [ ]:
cursor = restaurants.aggregate(
[
{"$group": {"_id": {"borough": "$borough", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 3}}}
]
)
SQL Query
SELECT zipcode, cuisine, COUNT(*) as count
FROM restaurants
WHERE borough = "Manhattan"
GROUP BY zipcode, cuisine
HAVING COUNT(*) > 3
In [ ]:
cursor = restaurants.aggregate(
[
{"$match": {"borough": "Manhattan"}},
{"$group": {"_id": {"zipcode": "$address.zipcode", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 3}}}
]
)
In [ ]:
print_my_docs(cursor, 5)
SQL Query
SELECT zipcode, cuisine, COUNT(*) as count
FROM restaurants
WHERE borough = "Manhattan"
GROUP BY zipcode, cuisine
HAVING COUNT(*) > 3
ORDER BY count
In [ ]:
cursor = restaurants.aggregate(
[
{"$match": {"borough": "Manhattan"}},
{"$group": {"_id": {"zipcode": "$address.zipcode", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 3}}},
{"$sort": {"count": 1}}
]
)
In [ ]:
cursor = restaurants.aggregate(
[
{"$match": {"borough": "Manhattan"}},
{"$group": {"_id": {"zipcode": "$address.zipcode", "cuisine": "$cuisine"}, "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 3}}},
{"$sort": {"count": 1}}
],
allowDiskUse = True # this can be useful when data does not fit in memory, e.g., to perform external sorting
)
In [ ]:
# note that the argument is a list of tuples
# [(<field>: <type>), ...]
# here, we specify only one such tuple for one field
restaurants.create_index([("borough", pm.ASCENDING)])
The index is created only if it does not already exist.
In [ ]:
# compound index (more than one indexed fields)
restaurants.create_index([
("cuisine", pm.ASCENDING),
("address.zipcode", pm.DESCENDING)
])
In [ ]:
restaurants.drop_index('borough_1') # drop this index
In [ ]:
restaurants.drop_index('cuisine_1_address.zipcode_-1') # drop that index
In [ ]:
restaurants.drop_indexes() # drop all indexes!!1
In [ ]:
restaurants.find_one()
In [ ]:
restaurants.create_index([("address.coord", 1)])
In [ ]:
restaurants.create_index([("grades.score", 1)])
In [ ]:
restaurants.create_index([("grades.grade", 1), ("grades.score", 1)])
The following will not work! We cannot currently have compound multi-key indexes.
In [ ]:
restaurants.create_index([("address.coord", 1), ("grades.score", 1)]) # NOPE!
We can retrieve the execution plan for a find() query by calling the explain() function on the result cursor. We demonstrate this in the following example.
In [ ]:
restaurants.drop_indexes() # we drop all indexes first -- use this with care!
restaurants.create_index([("borough", pm.ASCENDING)]) # build an index on field "borough", in ascending order
my_cursor = restaurants.find({"borough": "brooklyn"}) # submit query to find restaurants from specific borough
my_cursor.explain()["queryPlanner"]["winningPlan"] # ask mongodb to explain execution plan
As we see in this example, MongoDB makes use of an index (as indicated by keyword "IXSCAN") -- and particularly the index ('borough_1') we constructed to execute the query. What if we had not built this index?
In [ ]:
restaurants.drop_indexes() # we drop all indexes first -- use this with care!
my_cursor = restaurants.find({"borough": "brooklyn"}) # submit query to find restaurants from specific borough
my_cursor.explain()["queryPlanner"]["winningPlan"] # ask mongodb to explain execution plan
In that case, MongoDB simply performs a scan over the collection (as indicated by keyword "COLLSCAN").
In [ ]:
for a in restaurants.find({"borough": "Manhattan"}).limit(7):
for b in restaurants.find({"borough": "Bronx"}).limit(5):
if a["cuisine"] == b["cuisine"]:
print(a["cuisine"], a["address"]["zipcode"], b["address"]["zipcode"])
This is a new aggregation stage that implements left outer equi-joins.
"A left outer equi-join produces a result set that contains data for all documents from the left table (collection) together with data from the right table (collection) for documents where there is a match with documents from the left table (collection)."
In [ ]:
# create first collection
orders_docs = [{ "_id" : 1, "item" : "abc", "price" : 12, "quantity" : 2 },
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1 },
{ "_id" : 3 }]
orders = db.orders
orders.drop()
orders.insert_many(orders_docs)
In [ ]:
# create second collection
inventory_docs = [
{ "_id" : 1, "item" : "abc", "description": "product 1", "instock" : 120 },
{ "_id" : 2, "item" : "def", "description": "product 2", "instock" : 80 },
{ "_id" : 3, "item" : "ijk", "description": "product 3", "instock" : 60 },
{ "_id" : 4, "item" : "jkl", "description": "product 4", "instock" : 70 },
{ "_id" : 5, "item": None, "description": "Incomplete" },
{ "_id" : 6 }
]
inventory = db.inventory
inventory.drop()
inventory.insert_many(inventory_docs)
In [ ]:
result = orders.aggregate([ # "orders" is the outer collection
{
"$lookup":
{
"from": "inventory", # the inner collection
"localField": "item", # the join field of the outer collection
"foreignField": "item", # the join field of the outer collection
"as": "inventory_docs" # name of field with array of joined inner docs
}
}
])
In [ ]:
print_my_docs(result, 10)
Two approaches (if you can think of a different approach, please let me know):
To provide an example, let's say we want to find restaurants with 'A' grades only. Below we show how we can use each of the aforementioned approaches.
In [ ]:
# using the $not operator
# "find restaurants that contain no grades that are not equal to A"
cursor = restaurants.find({"grades.grade": {"$exists": True}, "grades": {"$not": {"$elemMatch": {"grade": {"$ne": "A"}}}}})
In [ ]:
print_my_docs(cursor, 3)
In [ ]:
# simple example of a collection
mycoll = db.mycoll
mycoll.drop()
# insert three documents
mycoll.insert_one({"grades": [7, 7]})
mycoll.insert_one({"grades": [7, 3]})
mycoll.insert_one({"grades": [3, 3]})
mycoll.insert_one({"grades": []})
mycoll.insert_one({})
The result of the following query contains documents that do not contain the "grades" field.
In [ ]:
# find documents that have no "grades" elements that are not equal to "A"
mycursor = mycoll.find({"grades": {"$not": {"$elemMatch": {"$ne": 7}}}})
print_my_docs(mycursor, 10)
We can remove such documents from the result as a post-processing step. (Exercise: how?)
In [ ]:
# using aggregation
mycursor = restaurants.aggregate(
[
# unwind the grades array
{"$unwind": "$grades"}, #now each document contains one "grades" value
# group by document "_id" and count:
# (i) the total number of documents in each group as `count`
# -- this is the same as the number of elements in the original array
# (ii) the number of documents that satisfy the condition (grade = "A") as `num_satisfied`
{"$group": {"_id": "$_id", "count": {"$sum": 1}, "num_satisfied": {"$sum": {"$cond": [{"$eq": ["$grades.grade", "A"]}, 1, 0]}}}},
# create a field (named `same`) that is 1 if (count = num_satisfied) and 0 otherwise
{"$project": {"_id": 1, "same_count": {"$cond": [{"$eq": ["$count", "$num_satisfied"]} , 1, 0]}}},
# keep only the document ids for which (same = 1)
{"$match": {"same_count": 1}}
]
)
print_my_docs(mycursor, 5)
The question was asked in relation to the "limit" query we saw above ("Limiting the number of results").
Indeed, MongoDB does optimize the execution of the aggregation pipeline, as explained here. In relation to the aforementioned query, see, in particular, the part on sort+limit coalescence.
We used and consulted material from:
In [ ]:
import pymongo as pm
client = pm.MongoClient()
client.drop_database("moderndb")
In [ ]:
import bson.son as son
In [ ]: