More solving the problem with some code before I write the code.

This post is a continuation on solving a problem before writing the code.

Defining the problem.

2 objects I want to understand are unfamiliar to me:

I ultimately have a goal of building an interactive website with the police data.

I decided to first tackle the problem of learning how to query MongoDB.

In the process of looking at the police data, I discovered that it needed some cleaning up.

As a user of the police data, I do not expect to see entries with blank or impossible values.

Of course finding impossible or illogical data is not unusual. Stuff happens!

So the task at hand is to clean up the data and do some sanity checks.

Sanity checking the data.

First, import some Python built-ins that will be needed to get the job done.


In [2]:
import logging
from pprint import pprint
from itertools import chain
from datetime import datetime

The running instance of MongoDB was created using Docker files from dm-wyncode/docker-mongo-flpd-hackathon-data.

I wrote an idiomatic Python package so I could easily reuse code. The repository is dm-wyncode/flpd_helper

import objects needed to talk to a running instance of MongoDb


In [3]:
from flpd_helper import (
    citations,  # a MongoDB collection created from CSV data
    valid_citations,  # a MongoDB collection where validated data will go
    db,  # a database instance
    log_collection_counts,  # function that prints collection counts
)
# importing from flpd_results in info being logged from that package


collection accidents_geojson already exists
collection citations_geojson already exists
collection valid_accidents_geojson already exists
collection valid_citations_geojson already exists
document count for accidents:  13472
document count for citations:  46456
document count for mapped_accidents:  13472
document count for mapped_citations:  45385
document count for valid_accidents:  13472
document count for valid_citations:  45385
document count for accidents_geojson:  0
document count for citations_geojson:  0
document count for valid_accidents_geojson:  0
document count for valid_citations_geojson:  0

I created constants that avoid repetition and make the aggregation pipelines easier to read.

You can see their values in this Python module.


In [4]:
from bson import SON
from flpd_helper.constants import *  # import almost all the constants.
from flpd_helper.constants import (  # underscores are not imported with import * syntax
    _DATE_OCCURRED,
    _ID,
)

Create a logger for logging debugging info and displaying it in this notebook.


In [5]:
from logging_utility import get_logger
logger = logging.getLogger(__name__)
logger = get_logger(logger)
logger.debug("The logger is working.")


The logger is working.

Sanity check: Are there blank values in the citation data entries?

Create a pipeline which is an array of dictionaries that contain MongoDb query instructions.

This one aggregates the field "Date Occurred" and then counts and sorts the data with a limit of 10 items.


In [6]:
pipeline = [
    {GROUP: {_ID: _DATE_OCCURRED, COUNT: {SUM: 1}}},
    {SORT: SON([(COUNT, -1), (_ID, -1), ])},
    {LIMIT: 10}
]

The actual string that makes up the pipeline looks like this. Recall that I created constants rather than repetedly typing the same quoted strings.


In [7]:
pprint(pipeline)


[{'$group': {'_id': '$Date Occurred', 'count': {'$sum': 1}}},
 {'$sort': SON([('count', -1), ('_id', -1)])},
 {'$limit': 10}]

Note that there are 1071 records where the date is blank.


In [8]:
list(citations.aggregate(pipeline))


Out[8]:
[{'_id': '', 'count': 1071},
 {'_id': '11/19/2015 12:00:00 AM', 'count': 203},
 {'_id': '01/28/2015 12:00:00 AM', 'count': 201},
 {'_id': '01/08/2015 12:00:00 AM', 'count': 194},
 {'_id': '02/11/2015 12:00:00 AM', 'count': 183},
 {'_id': '02/03/2015 12:00:00 AM', 'count': 179},
 {'_id': '01/21/2015 12:00:00 AM', 'count': 177},
 {'_id': '02/04/2015 12:00:00 AM', 'count': 173},
 {'_id': '01/07/2015 12:00:00 AM', 'count': 172},
 {'_id': '05/25/2016 12:00:00 AM', 'count': 170}]

Creating a new and better collection called "valid_citations".

Another problem I discovered with the raw data was that the "Date Occured" field had a text string in it with USA-styled date notation. While this date notation may be idiomatically comfortable to USAmericans, the reversal of the day and month makes it impossible to sort date strings in code.

I decided to go one effort better and insert datetime objects into the "Date Occurred" field.

The code to remove the blank entries and insert valid records with datetime objects is here in the 'load_valid_date_data' function.

Check valid_citations collection has documents.


In [9]:
log_collection_counts((valid_citations, ))


document count for valid_citations:  45385

Notice that the document count for the valid_citations collection is less than the document count for the citations collection because the invalid entries were removed..


In [10]:
log_collection_counts((citations, ))


document count for citations:  46456

This aggregate results in a citations count per date.

Here datetime.datetime.utcfromtimestamp(0) will be fed into the pipeline as a BSON Date representing "epoch". When you \$subtract one BSON Date from another the difference in milliseconds is returned. This allows you to "round" the date to the current day by again subtracting the \$mod result to get the remainder of milliseconds difference from a day.

The same is true of \$add where "adding" a BSON Date to a numeric value will result in a BSON Date.

Taken from datetime aggregation how-to.


In [11]:
# give a citation count based on date and time 
pipeline = [
    { GROUP: {
        _ID: {
            ADD: [
               { SUBTRACT: [
                   { SUBTRACT: [ _DATE_OCCURRED, datetime.utcfromtimestamp(0) ] },
                   { MOD: [
                       { SUBTRACT: [ _DATE_OCCURRED, datetime.utcfromtimestamp(0) ] },
                       1000 * 60 * 60 * 24
                   ]}
               ]},
               datetime.utcfromtimestamp(0)
           ]
        },
        COUNT: { SUM: 1 }
    }},
    { SORT: { _ID: 1 } },
]
logger.info(pipeline)
limited_pipeline = pipeline[:]  # copy pipeline
limited_pipeline.append({LIMIT: 15})
list(valid_citations.aggregate(limited_pipeline))


[{'$group': {'count': {'$sum': 1}, '_id': {'$add': [{'$subtract': [{'$subtract': ['$Date Occurred', datetime.datetime(1970, 1, 1, 0, 0)]}, {'$mod': [{'$subtract': ['$Date Occurred', datetime.datetime(1970, 1, 1, 0, 0)]}, 86400000]}]}, datetime.datetime(1970, 1, 1, 0, 0)]}}}, {'$sort': {'_id': 1}}]
Out[11]:
[{'_id': datetime.datetime(2006, 2, 19, 0, 0), 'count': 2},
 {'_id': datetime.datetime(2006, 2, 23, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2006, 7, 17, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 1, 8, 0, 0), 'count': 2},
 {'_id': datetime.datetime(2014, 1, 29, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 2, 3, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 2, 10, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 2, 17, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 3, 3, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 3, 4, 0, 0), 'count': 3},
 {'_id': datetime.datetime(2014, 3, 13, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 3, 14, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 6, 14, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2014, 10, 11, 0, 0), 'count': 8},
 {'_id': datetime.datetime(2014, 10, 19, 0, 0), 'count': 2}]

This aggregate results in a citations count per date using substrings rather than datetime objects.


In [12]:
# padding with zeros, number of citations by date
pipeline = [
    {GROUP: {
        _ID : { CONCAT: [  # join the year, month, day with a dash
            {SUBSTR: [{YEAR: _DATE_OCCURRED}, 0, 4 ]},
            DASH,
            { COND: [  # pad the month with a leading "0" if less than 9
                        { LESS_THAN_OR_EQUAL: [ { MONTH: _DATE_OCCURRED }, 9 ] },
                        { CONCAT: [
                                        ZERO_STRING, { SUBSTR: [ { MONTH: _DATE_OCCURRED }, 0, 2 ] }
                                  ]},
                        { SUBSTR: [ { MONTH: _DATE_OCCURRED }, 0, 2 ] }
                    ]},
            DASH,
            { COND: [  # pad the day of month with a leading "0" if less than 9
                        { LESS_THAN_OR_EQUAL: [ { DAYOFMONTH: _DATE_OCCURRED }, 9 ] },
                        { CONCAT: [
                                        ZERO_STRING, { SUBSTR: [ { DAYOFMONTH: _DATE_OCCURRED }, 0, 2 ] }
                                  ]},
                        { SUBSTR: [ { DAYOFMONTH: _DATE_OCCURRED }, 0, 2 ] }
                    ]},
        ]},
        COUNT: {SUM: 1 }  # count how many records for each YYYY-MM-DD entry
     }},
     {SORT: { _ID: 1 }}  # sort on YYYY-MM-DD
]
logger.info(pipeline)
limited_pipeline = pipeline[:]  # copy pipeline
limited_pipeline.append({LIMIT: 15})
list(valid_citations.aggregate(limited_pipeline))


[{'$group': {'count': {'$sum': 1}, '_id': {'$concat': [{'$substr': [{'$year': '$Date Occurred'}, 0, 4]}, '-', {'$cond': [{'$lte': [{'$month': '$Date Occurred'}, 9]}, {'$concat': ['0', {'$substr': [{'$month': '$Date Occurred'}, 0, 2]}]}, {'$substr': [{'$month': '$Date Occurred'}, 0, 2]}]}, '-', {'$cond': [{'$lte': [{'$dayOfMonth': '$Date Occurred'}, 9]}, {'$concat': ['0', {'$substr': [{'$dayOfMonth': '$Date Occurred'}, 0, 2]}]}, {'$substr': [{'$dayOfMonth': '$Date Occurred'}, 0, 2]}]}]}}}, {'$sort': {'_id': 1}}]
Out[12]:
[{'_id': '2006-02-19', 'count': 2},
 {'_id': '2006-02-23', 'count': 1},
 {'_id': '2006-07-17', 'count': 1},
 {'_id': '2014-01-08', 'count': 2},
 {'_id': '2014-01-29', 'count': 1},
 {'_id': '2014-02-03', 'count': 1},
 {'_id': '2014-02-10', 'count': 1},
 {'_id': '2014-02-17', 'count': 1},
 {'_id': '2014-03-03', 'count': 1},
 {'_id': '2014-03-04', 'count': 3},
 {'_id': '2014-03-13', 'count': 1},
 {'_id': '2014-03-14', 'count': 1},
 {'_id': '2014-06-14', 'count': 1},
 {'_id': '2014-10-11', 'count': 8},
 {'_id': '2014-10-19', 'count': 2}]

Using Python to get the max and min.

TODO: Learn how to do it with MongoDB.

max and min dates of the citations data in the citations collection.


In [13]:
logger.info(min(date[_ID] for date in valid_citations.aggregate(pipeline)))
logger.info(max(date[_ID] for date in valid_citations.aggregate(pipeline)))


2006-02-19
2016-07-22

How many empty dates are there?

Using Python to do the counting with a simple find.

I'll trust this simple method as the gold standard against which to compare a more complex MongoDb query.

I wanted a method in which I am confident against which I could compare an aggregate method from MongoDB. Since I am learning the aggregate method I wanted to test my answer. Since I do not have canonical answers yet about this data set I have to make two educated guesses and compare them.


In [14]:
from IPython.core.display import display, HTML
display(HTML('<span id="line-count"></span>'))


Number of documents removed from the database.


In [15]:
# plain search
# Python is doing the counting
blanks = list(citations.find(
    {  # query filter
        DATE_OCCURRED: ''
    },
))
logger.info(len(blanks))


1071

Using a MongoDB aggregate to do the counting.


In [16]:
# doing the count with MongoDB
pipeline = [
    { GROUP: {
        _ID: {
            EQUALS: [ _DATE_OCCURRED, EMPTY_STRING ],
        },
        COUNT: { SUM: 1 },
    }},
    { SORT: { _ID: 1 } },
]
logger.info(pipeline)
counts = list(citations.aggregate(pipeline))
pprint(counts)


[{'$group': {'count': {'$sum': 1}, '_id': {'$eq': ['$Date Occurred', '']}}}, {'$sort': {'_id': 1}}]
[{'_id': False, 'count': 45385}, {'_id': True, 'count': 1071}]

Testing my two educated guesses.


In [17]:
all_counts = empties, not_empties = list(chain(*[[count[COUNT] for count in counts 
                                                  if count[_ID] is criteria] 
                                                  for criteria in (True, False)]))
pipeline_sum = sum(all_counts)
citations_total = citations.count()
delimiter = ': '
labels = (
    "empties count",
    "not empties count",
    'sum of empties and non-empties',
    'total count',
    'two methods equal',
)
data = (
    empties,
    not_empties,
    pipeline_sum,
    citations_total,
    citations_total == pipeline_sum,
)
data = zip(labels, (str(datum) for datum in data))
for datum in data:
    logger.info(delimiter.join(datum))


empties count: 1071
not empties count: 45385
sum of empties and non-empties: 46456
total count: 46456
two methods equal: True

In a future…

I want to use a React componet library called Griddle to display the data.