Programmatic calculation of differing data types frequencies among the same fields of different documents in the same MongoDB collection.

…continued from part 03

The data is messy because the value data types are not uniform across the fields!


In [9]:
from IPython.core.display import display, HTML
from part_four_meta import giffy_html
display(HTML(giffy_html))




Immediate goal: Create new collections where document field values are the same data type for a given field in data stored in a MongoDB database.

End goal: display data using ReactJS compontent Griddle

Problem: The same field across documents in the data have disparate data types.

Example of the problem.

Use code from flpd_helper to get frequencies of data types across fields across documents for a collection.


In [10]:
# This cell is meta code whose purpose is to generate some data
# so that it can be displayed in an HTML table in this post
from flpd_helper.tools import data_types_differences, formatted_type_counts
from flpd_helper import validated_collection_names
from itertools import chain

FIRST_ITEM = 0
# data_types_differences can take over a minute to run
duplicates_map = data_types_differences()

In [11]:
formatted_type_counts_ = formatted_type_counts(duplicates_map)

In [12]:
# This cell is meta code whose purpose is to display HTML into this notebook 
# so that it will display in the Zippped Code blog.

# ipy_table is a 3rd party package
from ipy_table import make_table, set_column_style, render
import significance_for_differences

from IPython.lib import deepreload
deepreload.reload(significance_for_differences)

header = list(('Collection Name', 'Field', 'Data Types and Counts', ))
if header not in formatted_type_counts_:
    formatted_type_counts_.insert(0, header)
make_table(formatted_type_counts_)
set_column_style(0, width='100', bold=False,)
set_column_style(1, width='100', color='hsla(225, 80%, 94%, 1)')
set_column_style(2, width='100', bold=True,)
display(HTML(significance_for_differences.table_title))
render()


Some fields have different data types.

Jump to what is the significance of this is.

Out[12]:
Collection&nbspNameFieldData&nbspTypes&nbspand&nbspCounts
valid_accidentsDistrict&nbspCode'int'&nbsptype&nbspcount:&nbsp7149
'str'&nbsptype&nbspcount:&nbsp6323
valid_accidentsReport&nbspArea&nbspDescription'int'&nbsptype&nbspcount:&nbsp7163
'str'&nbsptype&nbspcount:&nbsp6309
valid_accidentsRoad&nbspType&nbspCode'int'&nbsptype&nbspcount:&nbsp13470
'str'&nbsptype&nbspcount:&nbsp2
valid_accidentsTime&nbspOccured'int'&nbsptype&nbspcount:&nbsp13470
'str'&nbsptype&nbspcount:&nbsp2
valid_accidentsIntersection'int'&nbsptype&nbspcount:&nbsp9
'str'&nbsptype&nbspcount:&nbsp13463
valid_accidentsTime&nbspCleared&nbspScene'int'&nbsptype&nbspcount:&nbsp13469
'str'&nbsptype&nbspcount:&nbsp3
valid_accidentsRoad&nbspCharacter'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp13471
valid_accidentsIn/Near'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp13471
valid_accidentsReport&nbspArea&nbspCode'str'&nbsptype&nbspcount:&nbsp13178
'float'&nbsptype&nbspcount:&nbsp294
valid_accidentsgeoy'str'&nbsptype&nbspcount:&nbsp6075
'float'&nbsptype&nbspcount:&nbsp7397
valid_accidentsTrafficway&nbspFlow&nbspCode'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp13471
valid_accidentsRoad&nbspSurface&nbspCodes'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp13471
valid_accidentsDevelopment&nbspType&nbspCode'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp13471
valid_accidentsZone&nbspCode'int'&nbsptype&nbspcount:&nbsp7397
'str'&nbsptype&nbspcount:&nbsp6075
valid_accidentsLocation&nbspof&nbspImpact&nbspCode'int'&nbsptype&nbspcount:&nbsp1373
'str'&nbsptype&nbspcount:&nbsp12099
valid_accidentsLocality&nbspCode'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp13471
valid_accidentsTime&nbspReported'int'&nbsptype&nbspcount:&nbsp13465
'str'&nbsptype&nbspcount:&nbsp7
valid_accidentsgeox'str'&nbsptype&nbspcount:&nbsp6075
'float'&nbsptype&nbspcount:&nbsp7397
valid_accidentsStreet'int'&nbsptype&nbspcount:&nbsp12
'str'&nbsptype&nbspcount:&nbsp13460
valid_citationsApartment/Floor'int'&nbsptype&nbspcount:&nbsp4524
'str'&nbsptype&nbspcount:&nbsp40861
valid_citationsDistrict&nbspCode'int'&nbsptype&nbspcount:&nbsp43344
'str'&nbsptype&nbspcount:&nbsp2041
valid_citationsCase&nbspNumber'Int64'&nbsptype&nbspcount:&nbsp15601
'int'&nbsptype&nbspcount:&nbsp3
'str'&nbsptype&nbspcount:&nbsp29781
valid_citationsCourt&nbspAddress'int'&nbsptype&nbspcount:&nbsp3
'str'&nbsptype&nbspcount:&nbsp45382
valid_citationsReporting&nbspArea&nbspCode'str'&nbsptype&nbspcount:&nbsp44410
'float'&nbsptype&nbspcount:&nbsp975
valid_citationsStreet&nbspNumber'int'&nbsptype&nbspcount:&nbsp42784
'str'&nbsptype&nbspcount:&nbsp2601
valid_citationsStreet&nbspName'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp45384
valid_citationsTime&nbspEntered'int'&nbsptype&nbspcount:&nbsp45383
'str'&nbsptype&nbspcount:&nbsp2
valid_citationsage'int'&nbsptype&nbspcount:&nbsp45325
'str'&nbsptype&nbspcount:&nbsp60
valid_citationsgeoy'str'&nbsptype&nbspcount:&nbsp774
'float'&nbsptype&nbspcount:&nbsp44611
valid_citationsEmployee&nbspDivision&nbspCode'int'&nbsptype&nbspcount:&nbsp2
'str'&nbsptype&nbspcount:&nbsp45383
valid_citationsReporting&nbspArea&nbspDesc'int'&nbsptype&nbspcount:&nbsp43221
'str'&nbsptype&nbspcount:&nbsp2162
'float'&nbsptype&nbspcount:&nbsp2
valid_citationsZone&nbspCode'int'&nbsptype&nbspcount:&nbsp44721
'str'&nbsptype&nbspcount:&nbsp664
valid_citationstract'int'&nbsptype&nbspcount:&nbsp1
'str'&nbsptype&nbspcount:&nbsp45384
valid_citationsgeox'str'&nbsptype&nbspcount:&nbsp774
'float'&nbsptype&nbspcount:&nbsp44611

In [13]:
display(HTML(significance_for_differences.significance_anchor))


The mongoengine.Document class has attributes that are instances of mongoengine.Field classes. When the document is saved, values are validated. This means that the data types of the fields must be the same throughout.

So I wrote some code to dynamically calculate which mongoengine.Field class makes the most sense.

For some, I used the type with the highest frequency. The assumption of highest frequency for a type being the correct type was incorrect! For example, field 'District Code' has more integer types than strings. A string type is more appropriate though because the values of type string have letters in them and cannot be sanely cast to integers.

The fix.

I wrote a module that maps data types to mongoengine.Field classes. It even includes some custom functions that modify some of the values so that they may be cast into their appropriate data type.

For example, the cast_spaced_time function takes a value that is supposed to be a time, e.g. '1 30' and turns it into an integer: 130.


In [14]:
from flpd_helper.documents import cast_spaced_time

new_value = cast_spaced_time()('2 40')  # closure that it imitates a Class
print(new_value)


240

Recap of actions taken on data so far on the data.

Import csv data.

Result is a database named 'app' with collections of data taken from the CSV files.

Remove empty rows and created (creation of validated_accidents, validated_citations) collections.

Create new collections using extended mongoengine.Document classes.