This post is a continuation on solving a problem before writing the code.
2 objects I want to understand are unfamiliar to me:
I ultimately have a goal of building an interactive website with the police data.
I decided to first tackle the problem of learning how to query MongoDB.
In the process of looking at the police data, I discovered that it needed some cleaning up.
As a user of the police data, I do not expect to see entries with blank or impossible values.
Of course finding impossible or illogical data is not unusual. Stuff happens!
So the task at hand is to clean up the data and do some sanity checks.
In [2]:
import logging
from pprint import pprint
from itertools import chain
from datetime import datetime
The running instance of MongoDB was created using Docker files from dm-wyncode/docker-mongo-flpd-hackathon-data.
I wrote an idiomatic Python package so I could easily reuse code. The repository is dm-wyncode/flpd_helper
import objects needed to talk to a running instance of MongoDb
In [3]:
from flpd_helper import (
citations, # a MongoDB collection created from CSV data
valid_citations, # a MongoDB collection where validated data will go
db, # a database instance
log_collection_counts, # function that prints collection counts
)
# importing from flpd_results in info being logged from that package
I created constants that avoid repetition and make the aggregation pipelines easier to read.
You can see their values in this Python module.
In [4]:
from bson import SON
from flpd_helper.constants import * # import almost all the constants.
from flpd_helper.constants import ( # underscores are not imported with import * syntax
_DATE_OCCURRED,
_ID,
)
Create a logger for logging debugging info and displaying it in this notebook.
In [5]:
from logging_utility import get_logger
logger = logging.getLogger(__name__)
logger = get_logger(logger)
logger.debug("The logger is working.")
Create a pipeline which is an array of dictionaries that contain MongoDb query instructions.
This one aggregates the field "Date Occurred" and then counts and sorts the data with a limit of 10 items.
In [6]:
pipeline = [
{GROUP: {_ID: _DATE_OCCURRED, COUNT: {SUM: 1}}},
{SORT: SON([(COUNT, -1), (_ID, -1), ])},
{LIMIT: 10}
]
The actual string that makes up the pipeline looks like this. Recall that I created constants rather than repetedly typing the same quoted strings.
In [7]:
pprint(pipeline)
Note that there are 1071 records where the date is blank.
In [8]:
list(citations.aggregate(pipeline))
Out[8]:
Another problem I discovered with the raw data was that the "Date Occured" field had a text string in it with USA-styled date notation. While this date notation may be idiomatically comfortable to USAmericans, the reversal of the day and month makes it impossible to sort date strings in code.
I decided to go one effort better and insert datetime objects into the "Date Occurred" field.
The code to remove the blank entries and insert valid records with datetime objects is here in the 'load_valid_date_data' function.
Check valid_citations collection has documents.
In [9]:
log_collection_counts((valid_citations, ))
Notice that the document count for the valid_citations collection is less than the document count for the citations collection because the invalid entries were removed..
In [10]:
log_collection_counts((citations, ))
Here datetime.datetime.utcfromtimestamp(0) will be fed into the pipeline as a BSON Date representing "epoch". When you \$subtract one BSON Date from another the difference in milliseconds is returned. This allows you to "round" the date to the current day by again subtracting the \$mod result to get the remainder of milliseconds difference from a day.
The same is true of \$add where "adding" a BSON Date to a numeric value will result in a BSON Date.
Taken from datetime aggregation how-to.
In [11]:
# give a citation count based on date and time
pipeline = [
{ GROUP: {
_ID: {
ADD: [
{ SUBTRACT: [
{ SUBTRACT: [ _DATE_OCCURRED, datetime.utcfromtimestamp(0) ] },
{ MOD: [
{ SUBTRACT: [ _DATE_OCCURRED, datetime.utcfromtimestamp(0) ] },
1000 * 60 * 60 * 24
]}
]},
datetime.utcfromtimestamp(0)
]
},
COUNT: { SUM: 1 }
}},
{ SORT: { _ID: 1 } },
]
logger.info(pipeline)
limited_pipeline = pipeline[:] # copy pipeline
limited_pipeline.append({LIMIT: 15})
list(valid_citations.aggregate(limited_pipeline))
Out[11]:
In [12]:
# padding with zeros, number of citations by date
pipeline = [
{GROUP: {
_ID : { CONCAT: [ # join the year, month, day with a dash
{SUBSTR: [{YEAR: _DATE_OCCURRED}, 0, 4 ]},
DASH,
{ COND: [ # pad the month with a leading "0" if less than 9
{ LESS_THAN_OR_EQUAL: [ { MONTH: _DATE_OCCURRED }, 9 ] },
{ CONCAT: [
ZERO_STRING, { SUBSTR: [ { MONTH: _DATE_OCCURRED }, 0, 2 ] }
]},
{ SUBSTR: [ { MONTH: _DATE_OCCURRED }, 0, 2 ] }
]},
DASH,
{ COND: [ # pad the day of month with a leading "0" if less than 9
{ LESS_THAN_OR_EQUAL: [ { DAYOFMONTH: _DATE_OCCURRED }, 9 ] },
{ CONCAT: [
ZERO_STRING, { SUBSTR: [ { DAYOFMONTH: _DATE_OCCURRED }, 0, 2 ] }
]},
{ SUBSTR: [ { DAYOFMONTH: _DATE_OCCURRED }, 0, 2 ] }
]},
]},
COUNT: {SUM: 1 } # count how many records for each YYYY-MM-DD entry
}},
{SORT: { _ID: 1 }} # sort on YYYY-MM-DD
]
logger.info(pipeline)
limited_pipeline = pipeline[:] # copy pipeline
limited_pipeline.append({LIMIT: 15})
list(valid_citations.aggregate(limited_pipeline))
Out[12]:
In [13]:
logger.info(min(date[_ID] for date in valid_citations.aggregate(pipeline)))
logger.info(max(date[_ID] for date in valid_citations.aggregate(pipeline)))
I'll trust this simple method as the gold standard against which to compare a more complex MongoDb query.
I wanted a method in which I am confident against which I could compare an aggregate method from MongoDB. Since I am learning the aggregate method I wanted to test my answer. Since I do not have canonical answers yet about this data set I have to make two educated guesses and compare them.
In [14]:
from IPython.core.display import display, HTML
display(HTML('<span id="line-count"></span>'))
In [15]:
# plain search
# Python is doing the counting
blanks = list(citations.find(
{ # query filter
DATE_OCCURRED: ''
},
))
logger.info(len(blanks))
In [16]:
# doing the count with MongoDB
pipeline = [
{ GROUP: {
_ID: {
EQUALS: [ _DATE_OCCURRED, EMPTY_STRING ],
},
COUNT: { SUM: 1 },
}},
{ SORT: { _ID: 1 } },
]
logger.info(pipeline)
counts = list(citations.aggregate(pipeline))
pprint(counts)
In [17]:
all_counts = empties, not_empties = list(chain(*[[count[COUNT] for count in counts
if count[_ID] is criteria]
for criteria in (True, False)]))
pipeline_sum = sum(all_counts)
citations_total = citations.count()
delimiter = ': '
labels = (
"empties count",
"not empties count",
'sum of empties and non-empties',
'total count',
'two methods equal',
)
data = (
empties,
not_empties,
pipeline_sum,
citations_total,
citations_total == pipeline_sum,
)
data = zip(labels, (str(datum) for datum in data))
for datum in data:
logger.info(delimiter.join(datum))
I want to use a React componet library called Griddle to display the data.