Intro to MongoDB

  • Makes sense for data science applications
  • widely used in big data
  • document database or associative array
    • python dictionary

Resources

Why MongoDB

  • flexible schema
  • oriented towards programmers
    • data stored mapped directly to data formats in popular programming languages
    • drivers for most popular programming languages
  • flexible deployments
    • single
    • cluster
    • many choices in between these 2
  • designed for big data
    • horizontally scalable
    • on commodity hardware
    • native support for map reduce
    • has aggregation framework

You can download MongoDB for your platform from the official MongoDB page. You can also read specific MongoDB installation instructions.

You do not need to install MongoDB on your machine for most of the exercises in this course, however for best learning outcome we recommend that you do it. It's fast and easy!

MongoDB has a lot of drivers and client libraries. The one we will be using in this course is PyMongo. See the official documentation for PyMongo installation instructions.


In [1]:
import pprint

def get_client():
    from pymongo import MongoClient
    return MongoClient('mongodb://localhost:27017/')
    
def get_db():    
    # 'examples' here is the database name. It will be created if it does not exist.
    db = get_client().examples
    return db

In [2]:
def add_city(db):
    db.cities.insert_one({"name" : "Chicago"})
    
def get_city(db):
    return db.cities.find_one()

In [3]:
db = get_db()
#add_city(db)

for city in db.cities.find():
    pprint.pprint(city)


{u'_id': ObjectId('57bce0acb80b361733b572c8'), u'name': u'Chicago'}
{u'_id': ObjectId('57bce235b80b361733b572c9'),
 u'isOld': True,
 u'name': u'London'}
{u'_id': ObjectId('57bce25ab80b361733b572ca'),
 u'isNew': True,
 u'name': u'New York'}

Flexible Schema

  • usually
    • some documents or entries will have fields that others do not
    • schema evolves with time
  • MongoDB's indexing system and query execution system take this into account - missing fields, arrays vs values

PyMongo

  • driver/client library for python
  • basic job is to maintain a connection to db and allow you to work with your data in a very natural way

  • We will start mongoDB and process is mongod
  • we will have a python app with pymongo module
  • pymongo talks with mongodb using a wire protocol in BSON format
  • MongoDB ensures that each entry will have an _id field. If we don't specify one it will insert one for us

Multiple queries


In [4]:
#db.cities.insert_one({"name": "London", "isOld": True})

In [5]:
#db.cities.insert_one({"name": "New York", "isNew": True})

In [6]:
def find_in_cities(query):
    for city in db.cities.find(query):
        pprint.pprint(city)

In [7]:
find_in_cities({"name": "London", "isOld": False})

In [8]:
find_in_cities({"name": "London", "isOld": True})


{u'_id': ObjectId('57bce235b80b361733b572c9'),
 u'isOld': True,
 u'name': u'London'}

projections


In [9]:
query = {"name": "London", "isOld": True}
projection = {"_id": 0, "name": 1}
for city in db.cities.find(query, projection):
    pprint.pprint(city)


{u'name': u'London'}

In [10]:
db.cities.count()


Out[10]:
3

In [11]:
list(city for city in db.cities.find())


Out[11]:
[{u'_id': ObjectId('57bce0acb80b361733b572c8'), u'name': u'Chicago'},
 {u'_id': ObjectId('57bce235b80b361733b572c9'),
  u'isOld': True,
  u'name': u'London'},
 {u'_id': ObjectId('57bce25ab80b361733b572ca'),
  u'isNew': True,
  u'name': u'New York'}]

Getting Data into MongoDB


In [12]:
from pymongo import MongoClient
import csv
import json
import io
import re
import pprint


field_map = {
    "name" : "name",
    "bodyStyle_label" : "bodyStyle",
    "assembly_label" : "assembly",
    "class_label" : "class",
    "designer_label" : "designer",
    "engine_label" : "engine",
    "length" : "length",
    "height" : "height",
    "width" : "width",
    "weight" : "weight",
    "wheelbase" : "wheelbase",
    "layout_label" : "layout",
    "manufacturer_label" : "manufacturer",
    "modelEndYear" : "modelEndYear",
    "modelStartYear" : "modelStartYear",
    "predecessorLabel" : "predecessorLabel",
    "productionStartYear" : "productionStartYear",
    "productionEndYear" : "productionEndYear",
    "transmission" : "transmission"
}
fields = field_map.keys()


def skip_lines(input_file, skip):
    for i in range(0, skip):
        next(input_file)

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

def strip_automobile(v):
    return re.sub(r"\s*\(automobile\)\s*", " ", v)

def strip_city(v):
    return re.sub(r"\s*\(city\)\s*", " ", v)

def parse_array(v):
    if (v[0] == "{") and (v[-1] == "}"):
        v = v.lstrip("{")
        v = v.rstrip("}")
        v_array = v.split("|")
        v_array = [i.strip() for i in v_array]
        return v_array
    return v

def mm_to_meters(v):
    if v < 0.01:
        return v * 1000
    return v

def clean_dimension(d, field, v):
    if is_number(v):
        if field == "weight":
            d[field] = float(v) / 1000.0
        else:
            d[field] = mm_to_meters(float(v))
    
def clean_year(d, field, v):
    d[field] = v[0:4]

def parse_array2(v):
    if (v[0] == "{") and (v[-1] == "}"):
        v = v.lstrip("{")
        v = v.rstrip("}")
        v_array = v.split("|")
        v_array = [i.strip() for i in v_array]
        return (True, v_array)
    return (False, v)

def ensure_not_array(v):
    (is_array, v) = parse_array(v)
    if is_array:
        return v[0]
    return v

def ensure_array(v):
    (is_array, v) = parse_array2(v)
    if is_array:
        return v
    return [v]

def ensure_float(v):
    if is_number(v):
        return float(v)

def ensure_int(v):
    if is_number(v):
        return int(v)

def ensure_year_array(val):
    #print "val:", val
    vals = ensure_array(val)
    year_vals = []
    for v in vals:
        v = v[0:4]
        v = int(v)
        if v:
            year_vals.append(v)
    return year_vals

def empty_val(val):
    val = val.strip()
    return (val == "NULL") or (val == "")

def years(row, start_field, end_field):
    start_val = row[start_field]
    end_val = row[end_field]

    if empty_val(start_val) or empty_val(end_val):
        return []

    start_years = ensure_year_array(start_val)
    if start_years:
        start_years = sorted(start_years)
    end_years = ensure_year_array(end_val)
    if end_years:
        end_years = sorted(end_years)
    all_years = []
    if start_years and end_years:
        #print start_years
        #print end_years
        for i in range(0, min(len(start_years), len(end_years))):
            for y in range(start_years[i], end_years[i]+1):
                all_years.append(y)
    return all_years


def process_file_autos(input_file):
    input_data = csv.DictReader(open(input_file))
    autos = []
    skip_lines(input_data, 3)
    for row in input_data:
        auto = {}
        model_years = {}
        production_years = {}
        dimensions = {}
        for field, val in row.iteritems():
            if field not in fields or empty_val(val):
                continue
            if field in ["bodyStyle_label", "class_label", "layout_label"]:
                val = val.lower()
            val = strip_automobile(val)
            val = strip_city(val)
            val = val.strip()
            val = parse_array(val)
            if field in ["length", "width", "height", "weight", "wheelbase"]:
                clean_dimension(dimensions, field_map[field], val)
            elif field in ["modelStartYear", "modelEndYear"]:
                clean_year(model_years, field_map[field], val)
            elif field in ["productionStartYear", "productionEndYear"]:
                clean_year(production_years, field_map[field], val)
            else:
                auto[field_map[field]] = val
        if dimensions:
            auto['dimensions'] = dimensions
        auto['modelYears'] = years(row, 'modelStartYear', 'modelEndYear')
        auto['productionYears'] = years(row, 'productionStartYear', 'productionEndYear')
        autos.append(auto)
    return autos

In [13]:
def insert_autos(infile, db):
    data = process_file_autos(infile)
    # Add your code here. Insert the data in one command.
    db.autos.insert_many(data)

In [14]:
db = get_db()
#insert_autos('../autos.csv', db)
db.autos.find_one()


Out[14]:
{u'_id': ObjectId('57bd7f3eb80b36371c48fb3f'),
 u'assembly': [u'Hethel', u'Norfolk', u'United Kingdom'],
 u'bodyStyle': u'roadster',
 u'class': u'sports car',
 u'designer': u'Colin Chapman',
 u'layout': u'front-engine rear-wheel-drive layout',
 u'manufacturer': u'Lotus Cars',
 u'modelYears': [],
 u'name': [u'Lotus 7', u'Lotus Seven'],
 u'productionYears': [1957,
  1958,
  1959,
  1960,
  1961,
  1962,
  1963,
  1964,
  1965,
  1966,
  1967,
  1968,
  1969,
  1970,
  1971,
  1972],
 u'transmission': u'manual'}

In [15]:
db.autos.count()


Out[15]:
43

Using mongoimport

Documentation for mongoimport can be found here.

The following command is used

mongoimport -db dbname -c collectionname --file input-file.json mongoimport --help

If no hostname and credentials are supplied, mongoimport will try to connect to the default localhost:27017

Operators

  • same idea as in programming languages
  • same syntax as field names
  • distinguish operators from fields using $

Range Queries

  • Inequality operators
    • $gt
    • $lt
    • $gte
    • $lte
    • $ne

Using dot in the names we can query nested documents


In [16]:
list(city for city in db.autos.find())


Out[16]:
[{u'_id': ObjectId('57bd7f3eb80b36371c48fb3f'),
  u'assembly': [u'Hethel', u'Norfolk', u'United Kingdom'],
  u'bodyStyle': u'roadster',
  u'class': u'sports car',
  u'designer': u'Colin Chapman',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Lotus Cars',
  u'modelYears': [],
  u'name': [u'Lotus 7', u'Lotus Seven'],
  u'productionYears': [1957,
   1958,
   1959,
   1960,
   1961,
   1962,
   1963,
   1964,
   1965,
   1966,
   1967,
   1968,
   1969,
   1970,
   1971,
   1972],
  u'transmission': u'manual'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb40'),
  u'class': u'luxury vehicle',
  u'dimensions': {u'height': 1.58,
   u'length': 5.6,
   u'weight': 2050.0,
   u'wheelbase': 3.25,
   u'width': 2.0},
  u'engine': [u'Chaika_(car)__1', u'Chaika_(car)__8-cylinder__1'],
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'GAZ',
  u'modelYears': [],
  u'name': u'M13 Chaika',
  u'productionYears': [1959,
   1960,
   1961,
   1962,
   1963,
   1964,
   1965,
   1966,
   1967,
   1968,
   1969,
   1970,
   1971,
   1972,
   1973,
   1974,
   1975,
   1976,
   1977,
   1978,
   1979,
   1980,
   1981],
  u'transmission': u'3-speed automatic'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb41'),
  u'assembly': [u'Finland',
   u'Kazakhstan',
   u'Montevideo',
   u'Oskemen',
   u'Russia',
   u'Tolyatti',
   u'Ukraine',
   u'Uruguay',
   u'Uusikaupunki',
   u'Zaporizhia'],
  u'bodyStyle': [u'hatchback', u'sedan'],
  u'class': u'economy car',
  u'dimensions': {u'height': 1.335,
   u'length': 4.005,
   u'wheelbase': 2.46,
   u'width': 1.65},
  u'layout': u'front-engine front-wheel-drive layout',
  u'manufacturer': u'Lada',
  u'modelYears': [],
  u'name': [u'Bognor Sagona (ROU)',
   u'Lada Forma (Sedan)',
   u'Lada Sagona (Sedan)',
   u'Lada Samara',
   u'Lada Samara 2',
   u'Lada Sputnik',
   u'VAZ 2108/2109/21099',
   u'VAZ 2113/2114/2115'],
  u'productionYears': [2108]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb42'),
  u'assembly': [u'Eisenach', u'Germany'],
  u'bodyStyle': u'roadster',
  u'class': u'sports car',
  u'designer': [u'Ernst Loof', u'Fritz Fiedler', u'Peter_Szymanowski'],
  u'dimensions': {u'height': 1.4,
   u'length': 3.9,
   u'weight': 830.0,
   u'wheelbase': 2.4,
   u'width': 1.55},
  u'engine': u'BMW_328__1',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'BMW',
  u'modelYears': [],
  u'name': u'BMW 328',
  u'productionYears': [1936, 1937, 1938, 1939, 1940],
  u'transmission': u'4-speed manual'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb43'),
  u'bodyStyle': u'fire apparatus',
  u'class': [u'commercial vehicle', u'emergency vehicle', u'military vehicle'],
  u'engine': u'Green_Goddess__1',
  u'layout': [u'automobile layout',
   u'four-wheel drive',
   u'longitudinal engine'],
  u'manufacturer': u'Bedford Vehicles',
  u'modelYears': [],
  u'name': [u'Bedford RLHZ Self Propelled Pump', u'Green Goddess'],
  u'productionYears': [1953, 1954, 1955, 1956],
  u'transmission': u'4-speed manual'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb44'),
  u'assembly': [u'Germany', u'Leipzig', u'Slovakia'],
  u'class': u'mid-size car',
  u'layout': [u'four-wheel drive', u'front-engine design'],
  u'manufacturer': u'Porsche',
  u'modelYears': [],
  u'name': [u'Porsche 955/957/958', u'Porsche Cayenne'],
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb45'),
  u'assembly': u'Fremont California',
  u'bodyStyle': u'hatchback',
  u'class': u'compact car',
  u'dimensions': {u'wheelbase': 2.60096},
  u'layout': [u'four-wheel drive',
   u'front-engine design',
   u'front-wheel drive'],
  u'manufacturer': u'NUMMI',
  u'modelYears': [2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
  u'name': [u'Pontiac Vibe', u'Toyota Matrix', u'Toyota Voltz'],
  u'productionYears': [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb46'),
  u'class': [u'full-size car', u'mid-size car'],
  u'manufacturer': u'Ford Motor Company',
  u'modelYears': [],
  u'name': u'Ford Taurus',
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb47'),
  u'assembly': [u'Ford Dagenham', u'Halewood'],
  u'class': u'subcompact car',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Ford of Britain',
  u'modelYears': [],
  u'name': u'Ford Anglia',
  u'productionYears': [1939,
   1940,
   1941,
   1942,
   1943,
   1944,
   1945,
   1946,
   1947,
   1948,
   1949,
   1950,
   1951,
   1952,
   1953,
   1954,
   1955,
   1956,
   1957,
   1958,
   1959,
   1960,
   1961,
   1962,
   1963,
   1964,
   1965,
   1966,
   1967]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb48'),
  u'class': [u'sport compact', u'sports car'],
  u'manufacturer': u'Toyota',
  u'modelYears': [1971],
  u'name': u'test 680',
  u'productionYears': [1970,
   1971,
   1972,
   1973,
   1974,
   1975,
   1976,
   1977,
   1978,
   1979,
   1980,
   1981,
   1982,
   1983,
   1984,
   1985,
   1986,
   1987,
   1988,
   1989,
   1990,
   1991,
   1992,
   1993,
   1994,
   1995,
   1996,
   1997,
   1998,
   1999,
   2000,
   2001,
   2002,
   2003,
   2004,
   2005,
   2006]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb49'),
  u'assembly': [u'Fourchambault', u'France'],
  u'bodyStyle': u'cabrio coach',
  u'class': u'microcar',
  u'dimensions': {u'height': 1.27,
   u'length': 2.85,
   u'weight': 375.0,
   u'wheelbase': 1.693,
   u'width': 1.1},
  u'engine': u'Vespa_400__1',
  u'layout': u'rear-engine rear-wheel-drive layout',
  u'manufacturer': u"ACMA (Ateliers de construction de motocycles et d'automobiles)",
  u'modelYears': [],
  u'name': u'Vespa 400',
  u'productionYears': [1957, 1958, 1959, 1960, 1961],
  u'transmission': u'3-speed manual'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb4a'),
  u'class': u'full-size car',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Ford Motor Company',
  u'modelYears': [],
  u'name': u'Edsel',
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb4b'),
  u'dimensions': {u'wheelbase': 0.0845},
  u'engine': u'Reliant_Fox__1',
  u'manufacturer': u'Reliant',
  u'modelYears': [],
  u'name': u'Reliant Fox',
  u'productionYears': [1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990],
  u'transmission': u'4 speed manual'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb4c'),
  u'class': u'family car',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': [u'Ford Motor Company', u'Hyundai Motor Company'],
  u'modelYears': [],
  u'name': [u'Ford Consul Cortina', u'Ford Cortina'],
  u'productionYears': [1962,
   1963,
   1964,
   1965,
   1966,
   1967,
   1968,
   1969,
   1970,
   1971,
   1972,
   1973,
   1974,
   1975,
   1976,
   1977,
   1978,
   1979,
   1980,
   1981,
   1982]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb4d'),
  u'class': u'muscle car',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': [u'Holden', u'Pontiac'],
  u'modelYears': [],
  u'name': u'Pontiac GTO',
  u'productionYears': [1964]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb4e'),
  u'assembly': [u'Cheshire', u'Ellesmere Port'],
  u'class': u'compact car',
  u'manufacturer': u'Vauxhall Motors',
  u'modelYears': [],
  u'name': u'Vauxhall Viva',
  u'productionYears': [1,
   2,
   3,
   4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   34,
   35,
   36,
   37,
   38,
   39,
   40,
   41,
   42,
   43,
   44,
   45,
   46,
   47,
   48,
   49,
   50,
   51,
   52,
   53,
   54,
   55,
   56,
   57,
   58,
   59,
   60,
   61,
   62,
   63,
   64,
   65,
   66,
   67,
   68,
   69,
   70,
   71,
   72,
   73,
   74,
   75,
   76,
   77,
   78,
   79,
   80,
   81,
   82,
   83,
   84,
   85,
   86,
   87,
   88,
   89,
   90,
   91,
   92,
   93,
   94,
   95,
   96,
   97,
   98,
   99,
   100,
   101,
   102,
   103,
   104,
   105,
   106,
   107,
   108,
   109,
   110,
   111,
   112,
   113,
   114,
   115,
   116,
   117,
   118,
   119,
   120,
   121,
   122,
   123,
   124,
   125,
   126,
   127,
   128,
   129,
   130,
   131,
   132,
   133,
   134,
   135,
   136,
   137,
   138,
   139,
   140,
   141,
   142,
   143,
   144,
   145,
   146,
   147,
   148,
   149,
   150,
   151,
   152,
   153,
   154,
   155,
   156,
   157,
   158,
   159,
   160,
   161,
   162,
   163,
   164,
   165,
   166,
   167,
   168,
   169,
   170,
   171,
   172,
   173,
   174,
   175,
   176,
   177,
   178,
   179,
   180,
   181,
   182,
   183,
   184,
   185,
   186,
   187,
   188,
   189,
   190,
   191,
   192,
   193,
   194,
   195,
   196,
   197,
   198,
   199,
   200,
   201,
   202,
   203,
   204,
   205,
   206,
   207,
   208,
   209,
   210,
   211,
   212,
   213,
   214,
   215,
   216,
   217,
   218,
   219,
   220,
   221,
   222,
   223,
   224,
   225,
   226,
   227,
   228,
   229,
   230,
   231,
   232,
   233,
   234,
   235,
   236,
   237,
   238,
   239,
   240,
   241,
   242,
   243,
   244,
   245,
   246,
   247,
   248,
   249,
   250,
   251,
   252,
   253,
   254,
   255,
   256,
   257,
   258,
   259,
   260,
   261,
   262,
   263,
   264,
   265,
   266,
   267,
   268,
   269,
   270,
   271,
   272,
   273,
   274,
   275,
   276,
   277,
   278,
   279,
   280,
   281,
   282,
   283,
   284,
   285,
   286,
   287,
   288,
   289,
   290,
   291,
   292,
   293,
   294,
   295,
   296,
   297,
   298,
   299,
   300,
   301,
   302,
   303,
   304,
   305,
   306,
   307,
   308,
   309,
   310,
   311,
   312,
   313,
   314,
   315,
   316,
   317,
   318,
   319,
   320,
   321,
   322,
   323,
   324,
   325,
   326,
   327,
   328,
   329,
   330,
   331,
   332,
   333,
   334,
   335,
   336,
   337,
   338,
   339,
   340,
   341,
   342,
   343,
   344,
   345,
   346,
   347,
   348,
   349,
   350,
   351,
   352,
   353,
   354,
   355,
   356,
   357,
   358,
   359,
   360,
   361,
   362,
   363,
   364,
   365,
   366,
   367,
   368,
   369,
   370,
   371,
   372,
   373,
   374,
   375,
   376,
   377,
   378,
   379,
   380,
   381,
   382,
   383,
   384,
   385,
   386,
   387,
   388,
   389,
   390,
   391,
   392,
   393,
   394,
   395,
   396,
   397,
   398,
   399,
   400,
   401,
   402,
   403,
   404,
   405,
   406,
   407,
   408,
   409,
   410,
   411,
   412,
   413,
   414,
   415,
   416,
   417,
   418,
   419,
   420,
   421,
   422,
   423,
   424,
   425,
   426,
   427,
   428,
   429,
   430,
   431,
   432,
   433,
   434,
   435,
   436,
   437,
   438,
   439,
   440,
   441,
   442,
   443,
   444,
   445,
   446,
   447,
   448,
   449,
   450,
   451,
   452,
   453,
   454,
   455,
   456,
   457,
   458,
   459,
   460,
   461,
   462,
   463,
   464,
   465,
   466,
   467,
   468,
   469,
   470,
   471,
   472,
   473,
   474,
   475,
   476,
   477,
   478,
   479,
   480,
   481,
   482,
   483,
   484,
   485,
   486,
   487,
   488,
   489,
   490,
   491,
   492,
   493,
   494,
   495,
   496,
   497,
   498,
   499,
   500,
   501,
   502,
   503,
   504,
   505,
   506,
   507,
   508,
   509,
   510,
   511,
   512,
   513,
   514,
   515,
   516,
   517,
   518,
   519,
   520,
   521,
   522,
   523,
   524,
   525,
   526,
   527,
   528,
   529,
   530,
   531,
   532,
   533,
   534,
   535,
   536,
   537,
   538,
   539,
   540,
   541,
   542,
   543,
   544,
   545,
   546,
   547,
   548,
   549,
   550,
   551,
   552,
   553,
   554,
   555,
   556,
   557,
   558,
   559,
   560,
   561,
   562,
   563,
   564,
   565,
   566,
   567,
   568,
   569,
   570,
   571,
   572,
   573,
   574,
   575,
   576,
   577,
   578,
   579,
   580,
   581,
   582,
   583,
   584,
   585,
   586,
   587,
   588,
   589,
   590,
   591,
   592,
   593,
   594,
   595,
   596,
   597,
   598,
   599,
   600,
   601,
   602,
   603,
   604,
   605,
   606,
   607,
   608,
   609,
   610,
   611,
   612,
   613,
   614,
   615,
   616,
   617,
   618,
   619,
   620,
   621,
   622,
   623,
   624,
   625,
   626,
   627,
   628,
   629,
   630,
   631,
   632,
   633,
   634,
   635,
   636,
   637,
   638,
   639,
   640,
   641,
   642,
   643,
   644,
   645,
   646,
   647,
   648,
   649,
   650,
   651,
   652,
   653,
   654,
   655,
   656,
   657,
   658,
   659,
   660,
   661,
   662,
   663,
   664,
   665,
   666,
   667,
   668,
   669,
   670,
   671,
   672,
   673,
   674,
   675,
   676,
   677,
   678,
   679,
   680,
   681,
   682,
   683,
   684,
   685,
   686,
   687,
   688,
   689,
   690,
   691,
   692,
   693,
   694,
   695,
   696,
   697,
   698,
   699,
   700,
   701,
   702,
   703,
   704,
   705,
   706,
   707,
   708,
   709,
   710,
   711,
   712,
   713,
   714,
   715,
   716,
   717,
   718,
   719,
   720,
   721,
   722,
   723,
   724,
   725,
   726,
   727,
   728,
   729,
   730,
   731,
   732,
   733,
   734,
   735,
   736,
   737,
   738,
   739,
   740,
   741,
   742,
   743,
   744,
   745,
   746,
   747,
   748,
   749,
   750,
   751,
   752,
   753,
   754,
   755,
   756,
   757,
   758,
   759,
   760,
   761,
   762,
   763,
   764,
   765,
   766,
   767,
   768,
   769,
   770,
   771,
   772,
   773,
   774,
   775,
   776,
   777,
   778,
   779,
   780,
   781,
   782,
   783,
   784,
   785,
   786,
   787,
   788,
   789,
   790,
   791,
   792,
   793,
   794,
   795,
   796,
   797,
   798,
   799,
   800,
   801,
   802,
   803,
   804,
   805,
   806,
   807,
   808,
   809,
   810,
   811,
   812,
   813,
   814,
   815,
   816,
   817,
   818,
   819,
   820,
   821,
   822,
   823,
   824,
   825,
   826,
   827,
   828,
   829,
   830,
   831,
   832,
   833,
   834,
   835,
   836,
   837,
   838,
   839,
   840,
   841,
   842,
   843,
   844,
   845,
   846,
   847,
   848,
   849,
   850,
   851,
   852,
   853,
   854,
   855,
   856,
   857,
   858,
   859,
   860,
   861,
   862,
   863,
   864,
   865,
   866,
   867,
   868,
   869,
   870,
   871,
   872,
   873,
   874,
   875,
   876,
   877,
   878,
   879,
   880,
   881,
   882,
   883,
   884,
   885,
   886,
   887,
   888,
   889,
   890,
   891,
   892,
   893,
   894,
   895,
   896,
   897,
   898,
   899,
   900,
   901,
   902,
   903,
   904,
   905,
   906,
   907,
   908,
   909,
   910,
   911,
   912,
   913,
   914,
   915,
   916,
   917,
   918,
   919,
   920,
   921,
   922,
   923,
   924,
   925,
   926,
   927,
   928,
   929,
   930,
   931,
   932,
   933,
   934,
   935,
   936,
   937,
   938,
   939,
   940,
   941,
   942,
   943,
   944,
   945,
   946,
   947,
   948,
   949,
   950,
   951,
   952,
   953,
   954,
   955,
   956,
   957,
   958,
   959,
   960,
   961,
   962,
   963,
   964,
   965,
   966,
   967,
   968,
   969,
   970,
   971,
   972,
   973,
   974,
   975,
   976,
   977,
   978,
   979,
   980,
   981,
   982,
   983,
   984,
   985,
   986,
   987,
   988,
   989,
   990,
   991,
   992,
   993,
   994,
   995,
   996,
   997,
   998,
   999,
   1000,
   ...]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb4f'),
  u'class': u'sports car',
  u'manufacturer': u'Reliant',
  u'modelYears': [],
  u'name': u'Reliant Scimitar',
  u'productionYears': [1964,
   1965,
   1966,
   1967,
   1968,
   1969,
   1970,
   1971,
   1972,
   1973,
   1974,
   1975,
   1976,
   1977,
   1978,
   1979,
   1980,
   1981,
   1982,
   1983,
   1984,
   1985,
   1986]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb50'),
  u'class': u'compact car',
  u'manufacturer': u'Vauxhall Motors',
  u'modelYears': [],
  u'name': u'Vauxhall Astra',
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb51'),
  u'assembly': [u'Surrey', u'Thames Ditton'],
  u'bodyStyle': u'roadster',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'AC Cars',
  u'modelYears': [],
  u'name': [u'AC Cobra', u'Shelby Cobra'],
  u'productionYears': [1961,
   1962,
   1963,
   1964,
   1965,
   1966,
   1967,
   1968,
   1969,
   1970,
   1971,
   1972,
   1973,
   1974,
   1975,
   1976,
   1977,
   1978,
   1979,
   1980,
   1981,
   1982,
   1983,
   1984,
   1985,
   1986,
   1987,
   1988,
   1989,
   1990,
   1991,
   1992,
   1993,
   1994,
   1995,
   1996,
   1997,
   1998,
   1999,
   2000,
   2001,
   2002,
   2003,
   2004]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb52'),
  u'assembly': u'Born Netherlands',
  u'bodyStyle': [u'sedan', u'station wagon'],
  u'engine': [u'Volvo_66__1', u'Volvo_66__2'],
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Volvo Cars',
  u'modelYears': [],
  u'name': u'Volvo 66',
  u'productionYears': [1975, 1976, 1977, 1978, 1979, 1980],
  u'transmission': u'Variomatic'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb53'),
  u'bodyStyle': u'station wagon',
  u'dimensions': {u'height': 1.7,
   u'length': 4.4,
   u'wheelbase': 2.6,
   u'width': 1.6},
  u'engine': u'Volvo_Duett__1',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Volvo Cars',
  u'modelYears': [],
  u'name': u'Volvo Duett',
  u'productionYears': [1953,
   1954,
   1955,
   1956,
   1957,
   1958,
   1959,
   1960,
   1961,
   1962,
   1963,
   1964,
   1965,
   1966,
   1967,
   1968,
   1969],
  u'transmission': u'M40 (four speed manual)'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb54'),
  u'class': u'compact car',
  u'layout': u'front-engine front-wheel-drive layout',
  u'manufacturer': u'Saab Automobile',
  u'modelYears': [],
  u'name': [u'Saab 9-1', u'Saab 9-2'],
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb55'),
  u'class': u'compact executive car',
  u'layout': u'front-engine front-wheel-drive layout',
  u'manufacturer': u'Saab Automobile',
  u'modelYears': [],
  u'name': u'Saab 900',
  u'productionYears': [1978,
   1979,
   1980,
   1981,
   1982,
   1983,
   1984,
   1985,
   1986,
   1987,
   1988,
   1989,
   1990,
   1991,
   1992,
   1993,
   1994,
   1995,
   1996,
   1997,
   1998]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb56'),
  u'bodyStyle': u'roadster',
  u'modelYears': [],
  u'productionYears': [1908,
   1909,
   1910,
   1911,
   1912,
   1913,
   1914,
   1915,
   1916,
   1917,
   1918,
   1919,
   1920,
   1921,
   1922,
   1923]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb57'),
  u'class': u'full-size car',
  u'manufacturer': [u'Chevrolet', u'General Motors'],
  u'modelYears': [1958],
  u'name': u'Chevrolet Impala',
  u'productionYears': [1994]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb58'),
  u'assembly': [u'Canada', u'Southwold Ontario', u'St. Thomas Assembly'],
  u'bodyStyle': u'sedan',
  u'class': u'full-size ford',
  u'layout': [u'body-on-frame', u'front-engine rear-wheel-drive layout'],
  u'manufacturer': u'Ford Motor Company',
  u'modelYears': [1992,
   1993,
   1994,
   1995,
   1996,
   1997,
   1998,
   1999,
   2000,
   2001,
   2002,
   2003,
   2004,
   2005,
   2006,
   2007,
   2008,
   2009,
   2010,
   2011,
   2012],
  u'name': u'Ford Crown Victoria',
  u'productionYears': [1955]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb59'),
  u'engine': u'BMW_Dixi__1',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'BMW',
  u'modelYears': [],
  u'name': u'Dixi',
  u'productionYears': [1927, 1928, 1929]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb5a'),
  u'class': u'off-road vehicle',
  u'manufacturer': u'Toyota',
  u'modelYears': [],
  u'name': u'Toyota Land Cruiser',
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb5b'),
  u'class': u'compact car',
  u'manufacturer': u'Volkswagen',
  u'modelYears': [],
  u'name': u'Volkswagen Jetta',
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb5c'),
  u'dimensions': {u'wheelbase': 2.6924},
  u'modelYears': [],
  u'name': u'E-M-F 30',
  u'productionYears': [],
  u'transmission': u'3-speed sliding gear manual'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb5d'),
  u'class': u'personal luxury car',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Ford Motor Company',
  u'modelYears': [],
  u'name': u'Ford Thunderbird',
  u'productionYears': [1955]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb5e'),
  u'assembly': [u'Cambridge Ontario', u'Canada'],
  u'bodyStyle': u'hatchback',
  u'class': u'compact car',
  u'layout': [u'four-wheel drive',
   u'front-engine design',
   u'front-wheel drive'],
  u'manufacturer': u'Toyota',
  u'modelYears': [],
  u'name': u'Toyota Matrix',
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb60'),
  u'assembly': [u'Detroit', u'St. Louis'],
  u'class': u'muscle car',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Chrysler',
  u'modelYears': [],
  u'name': u'Plymouth Road Runner',
  u'productionYears': [1968,
   1969,
   1970,
   1971,
   1972,
   1973,
   1974,
   1975,
   1976,
   1977,
   1978,
   1979,
   1980]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb61'),
  u'bodyStyle': u'hatchback',
  u'class': u'compact car',
  u'manufacturer': u'Lancia',
  u'modelYears': [],
  u'name': u'Lancia Delta',
  u'productionYears': [1979]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb62'),
  u'assembly': [u'Kenosha Wisconsin',
   u'Mexico',
   u'Mexico City',
   u'United States'],
  u'bodyStyle': [u'hatchback', u'station wagon'],
  u'class': u'compact car',
  u'designer': u'Richard A. Teague',
  u'dimensions': {u'height': 1.34112,
   u'length': 4.36372,
   u'weight': 1360.8,
   u'wheelbase': 2.54,
   u'width': 1.96342},
  u'engine': [u'AMC_Pacer__1',
   u'AMC_Pacer__2',
   u'AMC_Pacer__3',
   u'AMC_Pacer__4'],
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'American Motors',
  u'modelYears': [],
  u'name': u'AMC Pacer',
  u'productionYears': [1975, 1976, 1977, 1978, 1979, 1980],
  u'transmission': [u'3-speed automatic',
   u'3-speed manual',
   u'3-speed with overdrive',
   u'4-speed manual']},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb63'),
  u'assembly': [u'Canada', u'Oakville Ontario'],
  u'class': u'minivan',
  u'designer': u'Moray Callum',
  u'layout': u'front-engine front-wheel-drive layout',
  u'manufacturer': u'Ford Motor Company',
  u'modelYears': [],
  u'name': u'Ford Windstar',
  u'productionYears': [1994]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb64'),
  u'bodyStyle': u'station wagon',
  u'class': u'station wagon',
  u'dimensions': {u'length': 5.01904, u'wheelbase': 2.9337, u'width': 1.92786},
  u'engine': [u'Ford_Parklane__1', u'Ford_Parklane__2'],
  u'modelYears': [],
  u'name': u'Ford Parklane',
  u'productionYears': [1956],
  u'transmission': [u'3-speed manual', u'Ford-O-Matic automatic']},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb65'),
  u'assembly': [u'Atlanta',
   u'Canada',
   u'Chicago',
   u'Georgia (U.S. state)',
   u'Illinois',
   u'Lorain Ohio',
   u'Oakville Ontario',
   u'Ohio',
   u'Ontario',
   u'United States'],
  u'class': u'mid-size car',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Ford Motor Company',
  u'modelYears': [],
  u'name': u'Ford Torino',
  u'productionYears': [1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb66'),
  u'class': [u'compact car', u'full-size car', u'mid-size car'],
  u'manufacturer': u'Nissan Motor Company',
  u'modelYears': [],
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb67'),
  u'assembly': [u'Fenton Missouri', u'Windsor Ontario'],
  u'class': u'minivan',
  u'layout': [u'front-engine four-wheel-drive layout',
   u'front-engine front-wheel-drive layout'],
  u'manufacturer': u'Chrysler',
  u'modelYears': [],
  u'name': u'Dodge Caravan/Dodge Grand Caravan',
  u'productionYears': []},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb68'),
  u'assembly': [u'Mexico', u'Silao'],
  u'bodyStyle': u'pickup truck',
  u'class': u'full-size car',
  u'designer': u'Marc_R._Asposito',
  u'dimensions': {u'wheelbase': 3.302},
  u'layout': [u'automobile layout',
   u'four-wheel drive',
   u'front-engine design'],
  u'manufacturer': u'General Motors',
  u'modelYears': [],
  u'name': [u'Cadillac Escalade EXT', u'Chevrolet Avalanche'],
  u'productionYears': [2001,
   2002,
   2003,
   2004,
   2005,
   2006,
   2007,
   2008,
   2009,
   2010,
   2011,
   2012,
   2013]},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb69'),
  u'bodyStyle': [u'sedan delivery', u'station wagon'],
  u'class': u'compact car',
  u'dimensions': {u'length': 4.3, u'wheelbase': 2.5},
  u'engine': [u'Saab_95__1', u'Saab_95__2', u'Saab_95__3'],
  u'layout': u'front-engine front-wheel-drive layout',
  u'manufacturer': [u'Saab Automobile', u'Valmet Automotive'],
  u'modelYears': [],
  u'name': u'Saab 95',
  u'productionYears': [1959,
   1960,
   1961,
   1962,
   1963,
   1964,
   1965,
   1966,
   1967,
   1968,
   1969,
   1970,
   1971,
   1972,
   1973,
   1974,
   1975,
   1976,
   1977,
   1978],
  u'transmission': u'4-speed manual'},
 {u'_id': ObjectId('57bd7f3eb80b36371c48fb6a'),
  u'bodyStyle': u'roadster',
  u'engine': u'Volvo_P1900__1',
  u'layout': u'front-engine rear-wheel-drive layout',
  u'manufacturer': u'Volvo Cars',
  u'modelYears': [],
  u'name': [u'Volvo P1900', u'Volvo Sport'],
  u'productionYears': [1956, 1957],
  u'transmission': u'3-speed manual'}]

In [17]:
query = {'dimensions.weight': {"$gt":  2000}}
db.autos.count(query)


Out[17]:
1

In [18]:
#using ASCII to query for string types
query = {'name': {"$gte": "H", "$lte": "Z"}}
print db.autos.count(query)
print db.autos.find_one(query)


23
{u'designer': u'Colin Chapman', u'assembly': [u'Hethel', u'Norfolk', u'United Kingdom'], u'name': [u'Lotus 7', u'Lotus Seven'], u'transmission': u'manual', u'modelYears': [], u'productionYears': [1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972], u'layout': u'front-engine rear-wheel-drive layout', u'_id': ObjectId('57bd7f3eb80b36371c48fb3f'), u'bodyStyle': u'roadster', u'class': u'sports car', u'manufacturer': u'Lotus Cars'}

These operators can also be used with datetime objects

$exists

Allows us to query and find the structure of the documents


In [19]:
def printCountAndOne(collection, query):
    print "***"
    print collection.count(query)
    pprint.pprint(collection.find_one(query))

In [20]:
print db.autos.count()

printCountAndOne(db.autos, {'assembly': {"$exists": 1}})

printCountAndOne(db.autos, {'assembly': {"$exists": 0}})


43
***
18
{u'_id': ObjectId('57bd7f3eb80b36371c48fb3f'),
 u'assembly': [u'Hethel', u'Norfolk', u'United Kingdom'],
 u'bodyStyle': u'roadster',
 u'class': u'sports car',
 u'designer': u'Colin Chapman',
 u'layout': u'front-engine rear-wheel-drive layout',
 u'manufacturer': u'Lotus Cars',
 u'modelYears': [],
 u'name': [u'Lotus 7', u'Lotus Seven'],
 u'productionYears': [1957,
                      1958,
                      1959,
                      1960,
                      1961,
                      1962,
                      1963,
                      1964,
                      1965,
                      1966,
                      1967,
                      1968,
                      1969,
                      1970,
                      1971,
                      1972],
 u'transmission': u'manual'}
***
25
{u'_id': ObjectId('57bd7f3eb80b36371c48fb40'),
 u'class': u'luxury vehicle',
 u'dimensions': {u'height': 1.58,
                 u'length': 5.6,
                 u'weight': 2050.0,
                 u'wheelbase': 3.25,
                 u'width': 2.0},
 u'engine': [u'Chaika_(car)__1', u'Chaika_(car)__8-cylinder__1'],
 u'layout': u'front-engine rear-wheel-drive layout',
 u'manufacturer': u'GAZ',
 u'modelYears': [],
 u'name': u'M13 Chaika',
 u'productionYears': [1959,
                      1960,
                      1961,
                      1962,
                      1963,
                      1964,
                      1965,
                      1966,
                      1967,
                      1968,
                      1969,
                      1970,
                      1971,
                      1972,
                      1973,
                      1974,
                      1975,
                      1976,
                      1977,
                      1978,
                      1979,
                      1980,
                      1981],
 u'transmission': u'3-speed automatic'}

In [21]:
printCountAndOne(db.autos, {"assembly": {"$regex": "Japan"}})


***
0
None

Querying Arrays Using Scalars

MongoDB queries inside arrays using scalars

$in

Containing at least one of the array values


In [22]:
printCountAndOne(db.autos, {"productionYears": {"$in": [1986,1987,1988, 1992]}})


***
5
{u'_id': ObjectId('57bd7f3eb80b36371c48fb48'),
 u'class': [u'sport compact', u'sports car'],
 u'manufacturer': u'Toyota',
 u'modelYears': [1971],
 u'name': u'test 680',
 u'productionYears': [1970,
                      1971,
                      1972,
                      1973,
                      1974,
                      1975,
                      1976,
                      1977,
                      1978,
                      1979,
                      1980,
                      1981,
                      1982,
                      1983,
                      1984,
                      1985,
                      1986,
                      1987,
                      1988,
                      1989,
                      1990,
                      1991,
                      1992,
                      1993,
                      1994,
                      1995,
                      1996,
                      1997,
                      1998,
                      1999,
                      2000,
                      2001,
                      2002,
                      2003,
                      2004,
                      2005,
                      2006]}

$all

Containing all of the values


In [23]:
printCountAndOne(db.autos, {"productionYears": {"$all": [1986,1987,1988, 1992]}})


***
3
{u'_id': ObjectId('57bd7f3eb80b36371c48fb48'),
 u'class': [u'sport compact', u'sports car'],
 u'manufacturer': u'Toyota',
 u'modelYears': [1971],
 u'name': u'test 680',
 u'productionYears': [1970,
                      1971,
                      1972,
                      1973,
                      1974,
                      1975,
                      1976,
                      1977,
                      1978,
                      1979,
                      1980,
                      1981,
                      1982,
                      1983,
                      1984,
                      1985,
                      1986,
                      1987,
                      1988,
                      1989,
                      1990,
                      1991,
                      1992,
                      1993,
                      1994,
                      1995,
                      1996,
                      1997,
                      1998,
                      1999,
                      2000,
                      2001,
                      2002,
                      2003,
                      2004,
                      2005,
                      2006]}

updates


In [24]:
from random import randint
new_name = "test {}".format(randint(0,999))
print "new name is {}".format(new_name)

auto = db.autos.find_one({"productionYears": {"$in": [1986,1987,1988, 1992]}})
auto['name'] = new_name
db.autos.save(auto)

db.autos.find_one({"name": {"$eq": new_name}})


new name is test 950
/home/aseem/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:7: DeprecationWarning: save is deprecated. Use insert_one or replace_one instead
Out[24]:
{u'_id': ObjectId('57bd7f3eb80b36371c48fb48'),
 u'class': [u'sport compact', u'sports car'],
 u'manufacturer': u'Toyota',
 u'modelYears': [1971],
 u'name': u'test 950',
 u'productionYears': [1970,
  1971,
  1972,
  1973,
  1974,
  1975,
  1976,
  1977,
  1978,
  1979,
  1980,
  1981,
  1982,
  1983,
  1984,
  1985,
  1986,
  1987,
  1988,
  1989,
  1990,
  1991,
  1992,
  1993,
  1994,
  1995,
  1996,
  1997,
  1998,
  1999,
  2000,
  2001,
  2002,
  2003,
  2004,
  2005,
  2006]}

set unset


In [25]:
#if exists then update else insert
db.autos.update_one({"assembly": {"$regex": "Japan"}}, {"$set": {"test": "japanese"}})

print db.autos.count({"test": "japanese"})

#if exists then remove field else ignore
db.autos.update_one({"test": "japanese"}, {"$unset": {"test": ""}})

print db.autos.count({"test": "japanese"})


0
0

NOTE Need to remember that the second argument must have operator $set or $unset. If that is not there then the entire document found by the first one would be replace by what is passed as the second argument.


In [26]:
db.autos.update_many({"assembly": {"$regex": "Japan"}}, {"$set": {"test": "japanese"}})

print db.autos.count({"test": "japanese"})


0

In [27]:
#db.autos.drop()
query = {"test": "japanese"}
#print db.autos.delete_many(query)

print db.autos.count()


43

PROBLEM SET STARTS HERE

In this problem set you work with another type of infobox data, audit it, clean it, come up with a data model, insert it into MongoDB and then run some queries against your database. The set contains data about Arachnid class animals.

Your task in this exercise is to parse the file, process only the fields that are listed in the FIELDS dictionary as keys, and return a list of dictionaries of cleaned values.

The following things should be done:

  • keys of the dictionary changed according to the mapping in FIELDS dictionary
  • trim out redundant description in parenthesis from the 'rdf-schema#label' field, like "(spider)"
  • if 'name' is "NULL" or contains non-alphanumeric characters, set it to the same value as 'label'.
  • if a value of a field is "NULL", convert it to None
  • if there is a value in 'synonym', it should be converted to an array (list) by stripping the "{}" characters and splitting the string on "|". Rest of the cleanup is up to you, e.g. removing "*" prefixes etc. If there is a singular synonym, the value should still be formatted in a list.
  • strip leading and ending whitespace from all fields, if there is any
  • the output structure should be as follows:

[ { 'label': 'Argiope', 'uri': 'http://dbpedia.org/resource/Argiope_(spider)', 'description': 'The genus Argiope includes rather large and spectacular spiders that often ...', 'name': 'Argiope', 'synonym': ["One", "Two"], 'classification': { 'family': 'Orb-weaver spider', 'class': 'Arachnid', 'phylum': 'Arthropod', 'order': 'Spider', 'kingdom': 'Animal', 'genus': None } }, { 'label': ... , }, ... ]


In [28]:
import codecs
import csv
import json
import pprint
import re

DATAFILE = 'arachnid.csv'
FIELDS ={'rdf-schema#label': 'label',
         'URI': 'uri',
         'rdf-schema#comment': 'description',
         'synonym': 'synonym',
         'name': 'name',
         'family_label': 'family',
         'class_label': 'class',
         'phylum_label': 'phylum',
         'order_label': 'order',
         'kingdom_label': 'kingdom',
         'genus_label': 'genus'}
         
         
def remove_brackets_part(string):
    pattern = re.compile(r"\([^)]*?\)")
    return pattern.sub("", string).strip()


def process_file(filename, fields):

    process_fields = fields.values()
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        for i in range(3):
            l = reader.next()

        for line in reader:
            for key, value in fields.iteritems():
                if key != value:
                    line[value] = line[key]

                if value in line:
                    line[value] = line[value].strip()

            line['label'] = remove_brackets_part(line['label'])

            if ('name' not in line) or (line['name'] is None) or line['name'].isalnum():
                line['name'] = line['label']

            if 'synonym' in line:
                if line['synonym'] == 'NULL':
                    line['synonym'] = None
                else:
                    line['synonym'] = parse_array(line['synonym'])

            for key in line.keys():
                if (key not in line) or (line[key] == 'NULL'):
                    line[key] = None

            data.append({'label': line['label'],
                         'uri': line['uri'],
                         'description': line['description'],
                         'name': line['name'],
                         'synonym': line['synonym'],
                         'classification': {
                             'family': line['family'],
                             'class': line['class'],
                             'phylum': line['phylum'],
                             'order': line['order'],
                             'kingdom': line['kingdom'],
                             'genus': line['genus']
                         }
                         })
    return data


def parse_array(v):
    if (v[0] == "{") and (v[-1] == "}"):
        v = v.lstrip("{")
        v = v.rstrip("}")
        v_array = v.split("|")
        v_array = [i.strip() for i in v_array]
        return v_array
    return [v]

In [29]:
def test():
    data = process_file(DATAFILE, FIELDS)
    print "Your first entry:"
    pprint.pprint(data[0])
    first_entry = {
        "synonym": None, 
        "name": "Argiope", 
        "classification": {
            "kingdom": "Animal", 
            "family": "Orb-weaver spider", 
            "order": "Spider", 
            "phylum": "Arthropod", 
            "genus": None, 
            "class": "Arachnid"
        }, 
        "uri": "http://dbpedia.org/resource/Argiope_(spider)", 
        "label": "Argiope", 
        "description": "The genus Argiope includes rather large and spectacular spiders that often have a strikingly coloured abdomen. These spiders are distributed throughout the world. Most countries in tropical or temperate climates host one or more species that are similar in appearance. The etymology of the name is from a Greek name meaning silver-faced."
    }

    assert len(data) == 76
    assert data[0] == first_entry
    assert data[17]["name"] == "Ogdenia"
    assert data[48]["label"] == "Hydrachnidiae"
    assert data[14]["synonym"] == ["Cyrene Peckham & Peckham"]

In [30]:
test()


Your first entry:
{'classification': {'class': 'Arachnid',
                    'family': 'Orb-weaver spider',
                    'genus': None,
                    'kingdom': 'Animal',
                    'order': 'Spider',
                    'phylum': 'Arthropod'},
 'description': 'The genus Argiope includes rather large and spectacular spiders that often have a strikingly coloured abdomen. These spiders are distributed throughout the world. Most countries in tropical or temperate climates host one or more species that are similar in appearance. The etymology of the name is from a Greek name meaning silver-faced.',
 'label': 'Argiope',
 'name': 'Argiope',
 'synonym': None,
 'uri': 'http://dbpedia.org/resource/Argiope_(spider)'}

In [31]:
db = get_db()

with open('arachnid.json') as f:
    data = json.loads(f.read())
    db.arachnid.insert(data, db)


/home/aseem/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:5: DeprecationWarning: insert is deprecated. Use insert_one or insert_many instead.

For this exercise, the arachnid data is already in the database. You have been given the task of including 'binomialAuthority' information in the records. You will do this by processing the arachnid.csv to extract binomial authority data and then using this data to update the corresponding data base records.

The following things should be done in the function add_field:

  • process the csv file and extract 2 fields - 'rdf-schema#label' and 'binomialAuthority_label'
  • clean up the 'rdf-schema#label' the same way as in the first exercise, removing redundant "(spider)" suffixes
  • return a dictionary with the cleaned 'rdf-schema#label' field values as keys, and 'binomialAuthority_label' field values as values
  • if 'binomialAuthority_label' is "NULL" for a row in the csv, skip the item

The following should be done in the function update_db:

  • query the 'label' field in the database using rdf-schema#label keys from the data dictionary
  • update the documents by adding a new item under 'classification' with the key 'binomialAuthority' and the binomialAuthority_label value from the data dictionary as the value

For item {'Argiope': 'Jill Ward'} in the data dictionary, the resulting document structure should look like this:

{ 'label': 'Argiope', 'uri': 'http://dbpedia.org/resource/Argiope_(spider)', 'description': 'The genus Argiope includes rather large and spectacular spiders that often ...', 'name': 'Argiope', 'synonym': ["One", "Two"], 'classification': { 'binomialAuthority' : 'Jill Ward' 'family': 'Orb-weaver spider', 'class': 'Arachnid', 'phylum': 'Arthropod', 'order': 'Spider', 'kingdom': 'Animal', 'genus': None } }


In [32]:
DATAFILE = 'arachnid.csv'
FIELDS ={'rdf-schema#label': 'label',
         'binomialAuthority_label': 'binomialAuthority'}

def add_field(filename, fields):
    """
    Complete this function to set up a dictionary for adding binomialAuthority
    information to the database.
    """
    process_fields = fields.keys()
    data = {}
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        for i in range(3):
            l = reader.next()
        
        for line in reader:
        
        
            binomialAuthority_label = line['binomialAuthority_label']
        
            if binomialAuthority_label == 'NULL':
                continue
            
            label = remove_brackets_part(line['rdf-schema#label'])
            data[label] = binomialAuthority_label


    return data


def update_db(data, db):
    """
    Use the dictionary you generated from add_field to update the database.
    """
    for key, value in data.iteritems():
        pprint.pprint("{}:{}".format(key, value))
        db.arachnid.update_many({"label": {"$eq": key}}, {"$set": {"classification.binomialAuthority": value}})

In [33]:
def testFinal():
    # Please change only the add_field and update_db functions!
    # Changes done to this function will not be taken into account
    # when doing a Test Run or Submit, they are just for your own reference
    # and as an example for running this code locally!
    
    data = add_field(DATAFILE, FIELDS)
    
    from pymongo import MongoClient
    client = MongoClient("mongodb://localhost:27017")
    db = client.examples

    update_db(data, db)

    updated = db.arachnid.find_one({'label': 'Opisthoncana'})
    assert updated['classification']['binomialAuthority'] == 'Embrik Strand'
    pprint.pprint(data)

testFinal()


'Zealanapis australis:{1951 in science|Raymond Robert Forster}'
'Six-spotted fishing spider:Charles Athanase Walckenaer'
'Orvilleus:Arthur M. Chickering'
'Opisthoncana:Embrik Strand'
{'Opisthoncana': 'Embrik Strand',
 'Orvilleus': 'Arthur M. Chickering',
 'Six-spotted fishing spider': 'Charles Athanase Walckenaer',
 'Zealanapis australis': '{1951 in science|Raymond Robert Forster}'}