Sample Data Wrangling Project

OpenStreetMap Sample Project - Data Wrangling with MongoDB



In [ ]:

    
import pprint
import pandas as pd
import numpy as np
import csv

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

from IPython.display import Image, display

1 - Data Preparation

Map Area: Saint-Joseph - Île de La Réunion



In [ ]:

    
Image(filename='LaReunion.png', width=300, height=300)



In [ ]:

    
from openstreetmap import audit, shape
from openstreetmap.data_gouv_fr import postalcode, fantoir



In [ ]:

    
OSM_FILE = "data/Saint-Joseph.La-Reunion.osm"

1.1 - Auditing Data

1.1.1 - node and way xml tags

A quick look at the different 'tag' available in our data set



In [ ]:

    
!openstreetmap/tags.py -o data/Saint-Joseph.La-Reunion.osm

Few tags are to be ignored for the rest of cleansing process as tag names contains problematic characters.

The auditing will focus on the following tags (to illustrate the different checks that could be put in place for cleansing) from the remaining tags:

population, capacity, elevation, direction, phone, postal_code
addr:city, addr:postcode, addr:housenumber, addr:street

other tags will be included "as-is".

note: City, Post Code, Street name and type will be checked against official data sources (see references in audit.py)



In [ ]:

    
# Official databases are filtered to limit the data for the area of interest aka 974**
# All tag values not compliants with the rules defined are listed in associated files under audit folder
# The diffirents files will be manually checked and updated for the next step in cleansing process
!openstreetmap/audit.py -i -o data/Saint-Joseph.La-Reunion.osm -f data/FANTOIR1016 -a 974 -u audit

From the auditing, the following problems have been identified:

Some street types are wrongly spelled or abbreviated
No street type information for some street (all street listed under one single entry 'None' in our audit results)
Street name case not consistent (Full upercase, lowercase or mix)
House number contain more than one number (grouping of several houses is not considered as an issue for the auditing process and the validation rule has been updated to take it into account)
Same issue for the phone (e.g.: "0692407639;0692407637")
Some city are wrongly spelled or case not consistent (Full upercase, lowercase or mix)
An extra space inside postcode value + an unknown postcode value: 97447 to be replaced by 97442 according to the city name

2 - From XML to JSON to MongoDB

2.1 - Create JSON file

Based on the cleansing identified in previous section, Some values are partially or fully updated during the JSON file generation from the data recovered from associated files under update folder



In [ ]:

    
!openstreetmap/shape.py -o data/Saint-Joseph.La-Reunion.osm -u update

2.2 - MongoDB Import



In [ ]:

    
# Drop previous database if any
!mongo OpenStreetMap --eval "db.dropDatabase()"



In [ ]:

    
!mongoimport -d OpenStreetMap -c LaReunion --file data/Saint-Joseph.La-Reunion.osm.json

2.3 - Data Overview



In [ ]:

    
from pymongo import MongoClient
from bson.son import SON

client = MongoClient()

def pretty(documents):
    for document in documents:
        pprint.pprint(document)



In [ ]:

    
db = client.OpenStreetMap

Just peek the previously selected sample example from our python cleansing code



In [ ]:

    
pretty(db.LaReunion.find({"id": "3480487005"}))

Check that we have the same count as the result of our python cleansing code



In [ ]:

    
db.LaReunion.count() == 801017

Count the number of "xml node" imported



In [ ]:

    
pipeline = [
    {"$unwind": "$type"},
    {"$group": {"_id": "$type", "count": {"$sum": 1}}},
    {"$sort": SON([("count", -1), ("_id", -1)])}
]
nodes = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
nodes.columns = ["node", "count"]
nodes.head()

How many different users have contributed to this database?



In [ ]:

    
len(db.LaReunion.find().distinct("created.user"))

When did the contribution take place?



In [ ]:

    
pipeline = [
    {"$project": {"_id": False, "timestamp":  "$created.timestamp", "user": "$created.user" } },
    {"$group": {"_id": "$timestamp",  "users": { "$sum": 1 }}}
]
contributions = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
contributions["_id"] = pd.to_datetime(contributions._id)
contributions.columns = ["date", "users"]
axes = contributions.set_index(['date']).plot(figsize=(12,6), title="Number of users contribution by date")

We have quite recent contributions for this dataset, let's identify the top 10 contributors?



In [ ]:

    
pipeline = [
    { "$project": { "_id": False, "user": "$created.user" } },
    { "$group": { "_id": "$user", "count": { "$sum": 1 } } },
    { "$sort": SON([("count", -1), ("_id", -1)]) },
    { "$limit": 10 }
]
pretty(list(db.LaReunion.aggregate(pipeline)))

We can consider having 1 top contributor followed by 5 high contributors... let see the distribution of the contribution for all users



In [ ]:

    
pipeline = [
    { "$project": { "_id": False, "user": "$created.user" } },
    { "$group": { "_id": "$user", "count": { "$sum": 1 } } },
    { "$sort": SON([("count", -1), ("_id", -1)]) },
    { "$project": { "_id": "$count"} }
]
contributions = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
contributions.describe().T

The above ouput clearly show that we have outliers in term of user contribution... just see if mongodb can provide us the distribution of the all users contribution with a predefined bins.



In [ ]:

    
pipeline = [
    { "$project": { "_id": False, "user": "$created.user" } },
    { "$group": { "_id": "$user", "count": { "$sum": 1 } } },
    {
        "$bucket": {
            "groupBy": "$count",
            "boundaries": [ 1, 100, 10000, 25000, 100000 ],
            "default": "TOP",
            "output": {
              "count": { "$sum": 1 },
              "users": { "$push": "$_id" }
            }
        }
    }
]
contributions = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
contributions

The result confirm that most user are below 100 contributions and the main contributors are composed of 5 high contributors and TOP contributor

3 - How MongoDb will ease the data cleansing process...?

First focus on all document with type='node' and having a subdocument 'address'



In [ ]:

    
pipeline = [
    { "$match": { "type": "node", "address": { "$exists": True } } },
    { "$project": { "_id": False, 
                   "city": "$address.city", 
                   "housenumber": "$address.housenumber", 
                   "postcode": "$address.postcode", 
                   "street": "$address.street" } },
    #{ "$limit": 10 }
]
addresses = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
addresses.shape



In [ ]:

    
addresses.describe()

There is some missing data for all selected attributes (with one Nan for postcode)

3.1 - City

Let see if we can identify any problems with city attribute... Can we recover the missing data for the city?

note: This will be possible if the postcode is not null... to crosscheck with the official postcal code database



In [ ]:

    
from openstreetmap.data_gouv_fr import postalcode
codes = postalcode.PostalCode("data/laposte_hexasmal.csv").localityByPostcode()
addresses.loc[:, "city"] = addresses.postcode.dropna().apply(lambda x: codes[int(x)].title())
addresses[(addresses.city.isnull() == True) & (addresses.postcode.isnull() == False)].city.count()



In [ ]:

    
addresses.city.unique()

One remaining issue: a 'nan' value that could not be updated without extra information



In [ ]:

    
addresses.set_index("postcode")["city"].dropna().to_dict()

The above code addressed most remaining issues not handled during the initial cleansing step, for the City and PostCode fields, what about street? What are the remaining issues?



In [ ]:

    
# To compare with the official database, some french characters have to be translated to their corresponding ASCII
# what a pity for a french official database
xtd = {ord(u'’'): u"'", ord(u'é'): u'e', ord(u'è'): u'e', ord(u'É'): u'E',}
def tr(x):
    return x.translate(xtd).upper()

ways_referential = fantoir.FANTOIR().ways("data/FANTOIR1016", "974")
addresses["CHECKED"] = addresses.street.dropna().apply(lambda x: tr(x) in ways_referential["FULL_NAME"].values)



In [ ]:

    
# Generate a step 2 mapping file for manual street name update ... updated file to be moved to update folder
mapping = addresses[addresses.CHECKED == False]["street"].unique()
df = pd.DataFrame.from_dict({ "OLD": mapping, "NEW": mapping })
df.to_csv("audit/street_clean_step2.csv",
          encoding='utf-8', 
          index=False, 
          quoting=csv.QUOTE_ALL)

Let's load updated file...and apply the mapping

note: not all street will be updated (limited to illustration purpose only)



In [ ]:

    
updated_mapping = pd.read_csv("update/street_clean_step2.csv", encoding='utf-8').set_index("NEW")["OLD"].to_dict()
addresses.loc[:, "street"] = addresses.street.apply(lambda x: x if x not in updated_mapping.keys() 
                                                    else updated_mapping[x])

We have identify some rules to (partially) update the addresses, why not updating the mongo database...

Postal Code and City



In [ ]:

    
for n in db.LaReunion.find({ "type": "node", 
                            "address": { "$exists": True }, 
                            "address.postcode": { "$exists": True } }):
    postcode = n["address"]["postcode"].replace(' ', '')
    db.LaReunion.update_one({ "_id": n["_id"] }, { "$set": { "address.postcode":  postcode } }, upsert=False)



In [ ]:

    
for n in db.LaReunion.find({ "type": "node", 
                            "address": { "$exists": True }, 
                            "address.postcode": { "$exists": True } }):
    city = codes[int(n["address"]["postcode"])]
    db.LaReunion.update_one({ "_id": n["_id"] }, { "$set": { "address.city":  city } }, upsert=False)



In [ ]:

    
pipeline = [
    { "$match": { "type": "node", "address": { "$exists": True }, "address.postcode": { "$exists": True } } },
    { "$project": { "_id": False, 
                   "city": "$address.city", 
                   "housenumber": "$address.housenumber", 
                   "postcode": "$address.postcode", 
                   "street": "$address.street" } },
    { "$limit": 10 }
]
pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))



In [ ]:

    
pretty(list(db.LaReunion.find({ 
        "type": "node", 
        "address": { "$exists": True }, 
        "address.city": { "$exists": False },
        "address.postcode": { "$exists": False }}).limit(1)))

Still some cleaning to be done... but not possible with the information we have in hands

3 - Conclusion

The dataset is far from being complete and accurate. The different steps followed during the cleansing process has demonstrated that the dataset can be partially completed and made more accurate with some simple rules and correlation with some external databases (for French Area, the official database is publically available or referenced from http://data.gouv.fr).

The cleansing process described above did not deliver a clean and fully accurate database as a lot a of specific rules need to be put in place. The number of different contributors probably influenced the way street name when populated into the database. This is clearly an area where an automated validation process (during changes/updates integration) could improve the quality: such automated process could rely (for French Area) on official ways referential available from http://data.gouv.fr (references of such database are listed in the source code generated for this project, which include information for postcode, city, street name).

In addition to the referential databases, some standard regular expressions (such as Phone Number, Street Name, PostCode..) could be defined (and customized by country) and applied during changes/updates integration to reject or validate the user inputs.

Another improvment that could lead to a better quality and standardization of the dataset is the used of standard ontology. Possible ways to move in that direction: -1 http://wiki.openstreetmap.org/wiki/OSMonto: the main advantage is that it stays very close to the existing tags from openstreetmap, but will not allow to easily connect to other source of information without additional development -2 align to a well defined RDF schema such as https://schema.org/Place or equivalent: the main advantage is the inter-operability of the different database allowing cross-validation of the data.

For the second way, it means a big changes and migration effort will be significant.



In [ ]: