This post is further elaboration on solving a problem before writing the code.
My goal is to use a React componet library called Griddle to display the data.
First, however, I have to check data sanity so its display in Griddle makes sense.
2 objects I want to understand are unfamiliar to me:
After coding and blogging my sanity check of a 'citations' data collection I decided to further examine the 'accidents' data collection.
So the task at hand is to sanity check an 'accidents' collection based on previous checks: sanity checks.
In [1]:
import logging
from pprint import pprint
from itertools import chain
from datetime import datetime
The running instance of MongoDB was created using Docker files from dm-wyncode/docker-mongo-flpd-hackathon-data.
I wrote an idiomatic Python package so I could easily reuse code. The repository is dm-wyncode/flpd_helper
import objects needed to talk to a running instance of MongoDb
In [2]:
from bson import SON
from flpd_helper import (
citations,
accidents, # a MongoDB collection created from CSV data
valid_accidents, # a MongoDB collection where validated data will go
valid_citations, # a MongoDB collection where validated data will go
log_collection_counts, # function that prints collection counts
)
# importing from flpd_results in info being logged from that package
I created constants that avoid repetition and make the aggregation pipelines easier to read.
You can see their values in this Python module.
In [3]:
from flpd_helper.constants import * # import almost all the constants.
from flpd_helper.constants import ( # underscores are not imported with import * syntax
_DATE_OCCURRED,
_DATETIME_OCCURRED,
_ID,
)
from flpd_helper import convert_date_string
Create a logger for logging debugging info and displaying it in this notebook.
In [4]:
from logging_utility import get_logger
logger = logging.getLogger(__name__)
logger = get_logger(logger)
logger.debug("The logger is working.")
In [5]:
sample = accidents.find_one()
logger.debug([key for key in sample.keys() if 'date' in key.lower()])
In [6]:
# define constant
DATETIME_OCCURRED = 'Date/Time Occurred'
In [7]:
sample = valid_citations.find_one()
logger.debug([key for key in sample.keys() if 'time' in key.lower()])
In [8]:
TIME_ENTERED = 'Time Entered'
logger.debug(sample[TIME_ENTERED])
In [9]:
logger.debug([item[TIME_ENTERED]
for item in valid_citations.find({}, {TIME_ENTERED: True})
if not item[TIME_ENTERED]
])
In [10]:
logger.debug(
{len(str(item[TIME_ENTERED]))
for item in valid_citations.find({}, {TIME_ENTERED: True})}
)
In [11]:
logger.debug(
all([item[TIME_ENTERED] == 0
for item in valid_citations.find({}, {TIME_ENTERED: True})
if len(str(item[TIME_ENTERED])) == 1
])
)
In [12]:
logger.debug(
[("length: {}".format(n), {item[TIME_ENTERED]
for item in valid_citations.find({}, {TIME_ENTERED: True})
if len(str(item[TIME_ENTERED])) == n
}) for n in range(1, 3)]
)
I could imagine a time entry with 3 digits representing a time like "1:13". I am not sure what to make of entries with 1 or 2 digits. An entry with 4 digits could certainly make a time e.g. 13:41.
In [13]:
running_total = 0
digit_ranges = (
range(0, 10),
{item[TIME_ENTERED]
for item in valid_citations.find({}, {TIME_ENTERED: True})
if len(str(item[TIME_ENTERED])) == 2
} # documents with a 'Time Entered' value with a length of 2
)
for digit_range in digit_ranges:
for time_entered in digit_range:
count = len(list(valid_citations.find({TIME_ENTERED: time_entered}, {TIME_ENTERED: True})))
running_total += count
logger.debug("There are {} records with a 'Time Entered' value of '{}' for a running total of {}".format(
count, time_entered, running_total))
In [14]:
pipeline = [
{GROUP: {_ID: _DATETIME_OCCURRED, COUNT: {SUM: 1}}},
{SORT: SON([(COUNT, -1), (_ID, -1), ])},
]
The actual string that makes up the pipeline looks like this. Recall that I created constants rather than repetedly typing the same quoted strings.
In [15]:
pprint(pipeline)
In [16]:
logger.debug("All accidents field '{}' convert to datetime.datetime".format(DATETIME_OCCURRED))
logger.debug(all(
isinstance(convert_date_string(item[_ID]), datetime)
for item in accidents.aggregate(pipeline)
))
Another problem I discovered with the raw data was that the "Date/Time Occurred" field had a text string in it with USA-styled date notation. While this date notation may be idiomatically comfortable to USAmericans, the reversal of the day and month makes it impossible to sort date strings in code.
I decided to go one effort better and insert datetime objects into the "Date/Time Occurred" field.
The code to remove the blank entries and insert valid records with datetime objects is here.
In [17]:
# clear the valid_accidents collection
from flpd_helper import load_valid_date_data
valid_accidents.remove()
load_valid_date_data(collections=((accidents, valid_accidents,), ), date_key=DATETIME_OCCURRED)
Here datetime.datetime.utcfromtimestamp(0) will be fed into the pipeline as a BSON Date representing "epoch". When you \$subtract one BSON Date from another the difference in milliseconds is returned. This allows you to "round" the date to the current day by again subtracting the \$mod result to get the remainder of milliseconds difference from a day.
The same is true of \$add where "adding" a BSON Date to a numeric value will result in a BSON Date.
Taken from datetime aggregation how-to.
In [18]:
# give a citation count based on date and time
pipeline = [
{ GROUP: {
_ID: {
ADD: [
{ SUBTRACT: [
{ SUBTRACT: [ _DATETIME_OCCURRED, datetime.utcfromtimestamp(0) ] },
{ MOD: [
{ SUBTRACT: [ _DATETIME_OCCURRED, datetime.utcfromtimestamp(0) ] },
1000 * 60 * 60 * 24
]}
]},
datetime.utcfromtimestamp(0)
]
},
COUNT: { SUM: 1 }
}},
{ SORT: { _ID: 1 } },
]
logger.info(pipeline)
limited_pipeline = pipeline[:] # copy pipeline
limited_pipeline.append({LIMIT: 15})
list(valid_accidents.aggregate(limited_pipeline))
Out[18]:
The accident count per day in a sample of 15 looks reasonable.
In [19]:
logger.info(min(date[_ID] for date in valid_accidents.aggregate(pipeline)))
logger.info(max(date[_ID] for date in valid_accidents.aggregate(pipeline)))
I want to use a React componet library called Griddle to display the data.