Sanity checking the data.

First, import some Python built-ins that will be needed to get the job done.



In [1]:

    
import logging
from pprint import pprint
from itertools import chain
from datetime import datetime

The running instance of MongoDB was created using Docker files from dm-wyncode/docker-mongo-flpd-hackathon-data.

I wrote an idiomatic Python package so I could easily reuse code. The repository is dm-wyncode/flpd_helper

import objects needed to talk to a running instance of MongoDb



In [2]:

    
from bson import SON
from flpd_helper import (
    citations,
    accidents, # a MongoDB collection created from CSV data
    valid_accidents, # a MongoDB collection where validated data will go
    valid_citations,  # a MongoDB collection where validated data will go
    log_collection_counts,  # function that prints collection counts
)
# importing from flpd_results in info being logged from that package









    



collection accidents_geojson already exists
collection citations_geojson already exists
collection valid_accidents_geojson already exists
collection valid_citations_geojson already exists
document count for accidents:  13472
document count for citations:  46456
document count for valid_accidents:  13472
document count for valid_citations:  45385
document count for accidents_geojson:  0
document count for citations_geojson:  44611
document count for valid_accidents_geojson:  0
document count for valid_citations_geojson:  0

I created constants that avoid repetition and make the aggregation pipelines easier to read.

You can see their values in this Python module.



In [3]:

    
from flpd_helper.constants import *  # import almost all the constants.
from flpd_helper.constants import (  # underscores are not imported with import * syntax
    _DATE_OCCURRED,
    _DATETIME_OCCURRED,
    _ID,
)
from flpd_helper import convert_date_string

Create a logger for logging debugging info and displaying it in this notebook.



In [4]:

    
from logging_utility import get_logger
logger = logging.getLogger(__name__)
logger = get_logger(logger)
logger.debug("The logger is working.")









    



The logger is working.

Sanity check

Q: Does the accidents collection use the same 'Date Occurred' key as the citations collection?



In [5]:

    
sample = accidents.find_one()
logger.debug([key for key in sample.keys() if 'date' in key.lower()])









    



['Date/Time Occurred']



In [6]:

    
# define constant
DATETIME_OCCURRED = 'Date/Time Occurred'

A: No, it uses 'Date/Time Occurred'

Which led me to…

Q: Does the citations collection have a 'time' field separate from the 'Date Occurred' field?



In [7]:

    
sample = valid_citations.find_one()
logger.debug([key for key in sample.keys() if 'time' in key.lower()])









    



['Time Entered', 'Court Date/Time']

A: Yes! In the citations collections time is entered in a different field as 'Time Entered'



In [8]:

    
TIME_ENTERED = 'Time Entered'
logger.debug(sample[TIME_ENTERED])

Q: Are there any non-truthy values for times?



In [9]:

    
logger.debug([item[TIME_ENTERED] 
              for item in valid_citations.find({}, {TIME_ENTERED: True})
              if not item[TIME_ENTERED]
             ])









    



[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

A: Yes! There are some 0's.

Which leads me to question…

Q: How many digits are in each 'Time Entered' entry?



In [10]:

    
logger.debug(
    {len(str(item[TIME_ENTERED]))
     for item in valid_citations.find({}, {TIME_ENTERED: True})}
)









    



{1, 2, 3, 4}

Q: Do all the 'Time Entered' entries that have a length of 1 equal 0?

Test…



In [11]:

    
logger.debug(
    all([item[TIME_ENTERED] == 0
     for item in valid_citations.find({}, {TIME_ENTERED: True})
     if len(str(item[TIME_ENTERED])) == 1
    ])
)









    



False

No.

Q: What other 'Time Entered' entries with a length of (1, 2, ) have for a value?



In [12]:

    
logger.debug(
    [("length: {}".format(n), {item[TIME_ENTERED]
     for item in valid_citations.find({}, {TIME_ENTERED: True})
     if len(str(item[TIME_ENTERED])) == n
    }) for n in range(1, 3)]
)









    



[('length: 1', {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}), ('length: 2', {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 71})]

I could imagine a time entry with 3 digits representing a time like "1:13". I am not sure what to make of entries with 1 or 2 digits. An entry with 4 digits could certainly make a time e.g. 13:41.



In [13]:

    
running_total = 0
digit_ranges = (
    range(0, 10), 
    {item[TIME_ENTERED]
     for item in valid_citations.find({}, {TIME_ENTERED: True})
     if len(str(item[TIME_ENTERED])) == 2
    }  # documents with a 'Time Entered' value with a length of 2
)
for digit_range in digit_ranges:
    for time_entered in digit_range:
        count = len(list(valid_citations.find({TIME_ENTERED: time_entered}, {TIME_ENTERED: True})))
        running_total += count
        logger.debug("There are {} records with a 'Time Entered' value of '{}' for a running total of {}".format(
                count, time_entered, running_total))









    



There are 24 records with a 'Time Entered' value of '0' for a running total of 24
There are 60 records with a 'Time Entered' value of '1' for a running total of 84
There are 25 records with a 'Time Entered' value of '2' for a running total of 109
There are 20 records with a 'Time Entered' value of '3' for a running total of 129
There are 17 records with a 'Time Entered' value of '4' for a running total of 146
There are 21 records with a 'Time Entered' value of '5' for a running total of 167
There are 21 records with a 'Time Entered' value of '6' for a running total of 188
There are 17 records with a 'Time Entered' value of '7' for a running total of 205
There are 14 records with a 'Time Entered' value of '8' for a running total of 219
There are 14 records with a 'Time Entered' value of '9' for a running total of 233
There are 34 records with a 'Time Entered' value of '10' for a running total of 267
There are 23 records with a 'Time Entered' value of '11' for a running total of 290
There are 23 records with a 'Time Entered' value of '12' for a running total of 313
There are 24 records with a 'Time Entered' value of '13' for a running total of 337
There are 22 records with a 'Time Entered' value of '14' for a running total of 359
There are 23 records with a 'Time Entered' value of '15' for a running total of 382
There are 22 records with a 'Time Entered' value of '16' for a running total of 404
There are 22 records with a 'Time Entered' value of '17' for a running total of 426
There are 22 records with a 'Time Entered' value of '18' for a running total of 448
There are 13 records with a 'Time Entered' value of '19' for a running total of 461
There are 19 records with a 'Time Entered' value of '20' for a running total of 480
There are 15 records with a 'Time Entered' value of '21' for a running total of 495
There are 18 records with a 'Time Entered' value of '22' for a running total of 513
There are 12 records with a 'Time Entered' value of '23' for a running total of 525
There are 15 records with a 'Time Entered' value of '24' for a running total of 540
There are 19 records with a 'Time Entered' value of '25' for a running total of 559
There are 24 records with a 'Time Entered' value of '26' for a running total of 583
There are 13 records with a 'Time Entered' value of '27' for a running total of 596
There are 18 records with a 'Time Entered' value of '28' for a running total of 614
There are 23 records with a 'Time Entered' value of '29' for a running total of 637
There are 33 records with a 'Time Entered' value of '30' for a running total of 670
There are 24 records with a 'Time Entered' value of '31' for a running total of 694
There are 13 records with a 'Time Entered' value of '32' for a running total of 707
There are 21 records with a 'Time Entered' value of '33' for a running total of 728
There are 24 records with a 'Time Entered' value of '34' for a running total of 752
There are 20 records with a 'Time Entered' value of '35' for a running total of 772
There are 16 records with a 'Time Entered' value of '36' for a running total of 788
There are 23 records with a 'Time Entered' value of '37' for a running total of 811
There are 17 records with a 'Time Entered' value of '38' for a running total of 828
There are 15 records with a 'Time Entered' value of '39' for a running total of 843
There are 33 records with a 'Time Entered' value of '40' for a running total of 876
There are 22 records with a 'Time Entered' value of '41' for a running total of 898
There are 14 records with a 'Time Entered' value of '42' for a running total of 912
There are 15 records with a 'Time Entered' value of '43' for a running total of 927
There are 11 records with a 'Time Entered' value of '44' for a running total of 938
There are 16 records with a 'Time Entered' value of '45' for a running total of 954
There are 21 records with a 'Time Entered' value of '46' for a running total of 975
There are 24 records with a 'Time Entered' value of '47' for a running total of 999
There are 13 records with a 'Time Entered' value of '48' for a running total of 1012
There are 17 records with a 'Time Entered' value of '49' for a running total of 1029
There are 18 records with a 'Time Entered' value of '50' for a running total of 1047
There are 25 records with a 'Time Entered' value of '51' for a running total of 1072
There are 5 records with a 'Time Entered' value of '52' for a running total of 1077
There are 19 records with a 'Time Entered' value of '53' for a running total of 1096
There are 14 records with a 'Time Entered' value of '54' for a running total of 1110
There are 23 records with a 'Time Entered' value of '55' for a running total of 1133
There are 15 records with a 'Time Entered' value of '56' for a running total of 1148
There are 13 records with a 'Time Entered' value of '57' for a running total of 1161
There are 17 records with a 'Time Entered' value of '58' for a running total of 1178
There are 17 records with a 'Time Entered' value of '59' for a running total of 1195
There are 1 records with a 'Time Entered' value of '71' for a running total of 1196

Answer: See above. There are at least 1196 records with potentially ambiguous time entries in the valid_citations collection.

For now I will leave the time entries alone in valid_citations. There may be a proper interpretation of the time entries.

Q: Do all the 'Date/Time Occurred' entries in accidents validate in some way?



In [14]:

    
pipeline = [
    {GROUP: {_ID: _DATETIME_OCCURRED, COUNT: {SUM: 1}}},
    {SORT: SON([(COUNT, -1), (_ID, -1), ])},
]

The actual string that makes up the pipeline looks like this. Recall that I created constants rather than repetedly typing the same quoted strings.



In [15]:

    
pprint(pipeline)









    



[{'$group': {'_id': '$Date/Time Occurred', 'count': {'$sum': 1}}},
 {'$sort': SON([('count', -1), ('_id', -1)])}]



In [16]:

    
logger.debug("All accidents field '{}' convert to datetime.datetime".format(DATETIME_OCCURRED))
logger.debug(all(
    isinstance(convert_date_string(item[_ID]), datetime)
    for item in accidents.aggregate(pipeline)
))









    



All accidents field 'Date/Time Occurred' convert to datetime.datetime
True

Creating a new and better collection called "valid_accidents".

Another problem I discovered with the raw data was that the "Date/Time Occurred" field had a text string in it with USA-styled date notation. While this date notation may be idiomatically comfortable to USAmericans, the reversal of the day and month makes it impossible to sort date strings in code.

I decided to go one effort better and insert datetime objects into the "Date/Time Occurred" field.

The code to remove the blank entries and insert valid records with datetime objects is here.



In [17]:

    
# clear the valid_accidents collection
from flpd_helper import load_valid_date_data
valid_accidents.remove()
load_valid_date_data(collections=((accidents, valid_accidents,), ), date_key=DATETIME_OCCURRED)









    



/home/dmmmd/.virtualenvs/jupyter-notebooks/lib/python3.4/site-packages/ipykernel/__main__.py:3: DeprecationWarning: remove is deprecated. Use delete_one or delete_many instead.
  app.launch_new_instance()

This aggregate results in an accidents count per date irrespective of time of day.

Here datetime.datetime.utcfromtimestamp(0) will be fed into the pipeline as a BSON Date representing "epoch". When you \$subtract one BSON Date from another the difference in milliseconds is returned. This allows you to "round" the date to the current day by again subtracting the \$mod result to get the remainder of milliseconds difference from a day.

The same is true of \$add where "adding" a BSON Date to a numeric value will result in a BSON Date.

Taken from datetime aggregation how-to.



In [18]:

    
# give a citation count based on date and time 
pipeline = [
    { GROUP: {
        _ID: {
            ADD: [
               { SUBTRACT: [
                   { SUBTRACT: [ _DATETIME_OCCURRED, datetime.utcfromtimestamp(0) ] },
                   { MOD: [
                       { SUBTRACT: [ _DATETIME_OCCURRED, datetime.utcfromtimestamp(0) ] },
                       1000 * 60 * 60 * 24
                   ]}
               ]},
               datetime.utcfromtimestamp(0)
           ]
        },
        COUNT: { SUM: 1 }
    }},
    { SORT: { _ID: 1 } },
]
logger.info(pipeline)
limited_pipeline = pipeline[:]  # copy pipeline
limited_pipeline.append({LIMIT: 15})
list(valid_accidents.aggregate(limited_pipeline))









    



[{'$group': {'_id': {'$add': [{'$subtract': [{'$subtract': ['$Date/Time Occurred', datetime.datetime(1970, 1, 1, 0, 0)]}, {'$mod': [{'$subtract': ['$Date/Time Occurred', datetime.datetime(1970, 1, 1, 0, 0)]}, 86400000]}]}, datetime.datetime(1970, 1, 1, 0, 0)]}, 'count': {'$sum': 1}}}, {'$sort': {'_id': 1}}]






    Out[18]:





[{'_id': datetime.datetime(2012, 4, 11, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2015, 1, 14, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2015, 2, 2, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2015, 2, 6, 0, 0), 'count': 2},
 {'_id': datetime.datetime(2015, 2, 8, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2015, 2, 9, 0, 0), 'count': 1},
 {'_id': datetime.datetime(2015, 2, 11, 0, 0), 'count': 3},
 {'_id': datetime.datetime(2015, 2, 12, 0, 0), 'count': 4},
 {'_id': datetime.datetime(2015, 2, 13, 0, 0), 'count': 4},
 {'_id': datetime.datetime(2015, 2, 15, 0, 0), 'count': 2},
 {'_id': datetime.datetime(2015, 2, 16, 0, 0), 'count': 2},
 {'_id': datetime.datetime(2015, 2, 17, 0, 0), 'count': 9},
 {'_id': datetime.datetime(2015, 2, 18, 0, 0), 'count': 5},
 {'_id': datetime.datetime(2015, 2, 19, 0, 0), 'count': 7},
 {'_id': datetime.datetime(2015, 2, 20, 0, 0), 'count': 4}]

The accident count per day in a sample of 15 looks reasonable.

Using Python to get the max and min.

TODO: Learn how to do it with MongoDB.

max and min dates of the citations data in the citations collection.



In [19]:

    
logger.info(min(date[_ID] for date in valid_accidents.aggregate(pipeline)))
logger.info(max(date[_ID] for date in valid_accidents.aggregate(pipeline)))









    



2012-04-11 00:00:00
2016-07-22 00:00:00

Resources for creation MongoDB aggregations.

In a future…

I want to use a React componet library called Griddle to display the data.

More solving the problem with some code before I write the code.

Checking an 'accidents' data collection.