In [ ]:
import pprint
import pandas as pd
import numpy as np
import csv
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import Image, display
In [ ]:
Image(filename='LaReunion.png', width=300, height=300)
In [ ]:
from openstreetmap import audit, shape
from openstreetmap.data_gouv_fr import postalcode, fantoir
In [ ]:
OSM_FILE = "data/Saint-Joseph.La-Reunion.osm"
In [ ]:
!openstreetmap/tags.py -o data/Saint-Joseph.La-Reunion.osm
Few tags are to be ignored for the rest of cleansing process as tag names contains problematic characters.
The auditing will focus on the following tags (to illustrate the different checks that could be put in place for cleansing) from the remaining tags:
other tags will be included "as-is".
note: City, Post Code, Street name and type will be checked against official data sources (see references in audit.py)
In [ ]:
# Official databases are filtered to limit the data for the area of interest aka 974**
# All tag values not compliants with the rules defined are listed in associated files under audit folder
# The diffirents files will be manually checked and updated for the next step in cleansing process
!openstreetmap/audit.py -i -o data/Saint-Joseph.La-Reunion.osm -f data/FANTOIR1016 -a 974 -u audit
From the auditing, the following problems have been identified:
In [ ]:
!openstreetmap/shape.py -o data/Saint-Joseph.La-Reunion.osm -u update
In [ ]:
# Drop previous database if any
!mongo OpenStreetMap --eval "db.dropDatabase()"
In [ ]:
!mongoimport -d OpenStreetMap -c LaReunion --file data/Saint-Joseph.La-Reunion.osm.json
In [ ]:
from pymongo import MongoClient
from bson.son import SON
client = MongoClient()
def pretty(documents):
for document in documents:
pprint.pprint(document)
In [ ]:
db = client.OpenStreetMap
Just peek the previously selected sample example from our python cleansing code
In [ ]:
pretty(db.LaReunion.find({"id": "3480487005"}))
Check that we have the same count as the result of our python cleansing code
In [ ]:
db.LaReunion.count() == 801017
Count the number of "xml node" imported
In [ ]:
pipeline = [
{"$unwind": "$type"},
{"$group": {"_id": "$type", "count": {"$sum": 1}}},
{"$sort": SON([("count", -1), ("_id", -1)])}
]
nodes = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
nodes.columns = ["node", "count"]
nodes.head()
How many different users have contributed to this database?
In [ ]:
len(db.LaReunion.find().distinct("created.user"))
When did the contribution take place?
In [ ]:
pipeline = [
{"$project": {"_id": False, "timestamp": "$created.timestamp", "user": "$created.user" } },
{"$group": {"_id": "$timestamp", "users": { "$sum": 1 }}}
]
contributions = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
contributions["_id"] = pd.to_datetime(contributions._id)
contributions.columns = ["date", "users"]
axes = contributions.set_index(['date']).plot(figsize=(12,6), title="Number of users contribution by date")
We have quite recent contributions for this dataset, let's identify the top 10 contributors?
In [ ]:
pipeline = [
{ "$project": { "_id": False, "user": "$created.user" } },
{ "$group": { "_id": "$user", "count": { "$sum": 1 } } },
{ "$sort": SON([("count", -1), ("_id", -1)]) },
{ "$limit": 10 }
]
pretty(list(db.LaReunion.aggregate(pipeline)))
We can consider having 1 top contributor followed by 5 high contributors... let see the distribution of the contribution for all users
In [ ]:
pipeline = [
{ "$project": { "_id": False, "user": "$created.user" } },
{ "$group": { "_id": "$user", "count": { "$sum": 1 } } },
{ "$sort": SON([("count", -1), ("_id", -1)]) },
{ "$project": { "_id": "$count"} }
]
contributions = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
contributions.describe().T
The above ouput clearly show that we have outliers in term of user contribution... just see if mongodb can provide us the distribution of the all users contribution with a predefined bins.
In [ ]:
pipeline = [
{ "$project": { "_id": False, "user": "$created.user" } },
{ "$group": { "_id": "$user", "count": { "$sum": 1 } } },
{
"$bucket": {
"groupBy": "$count",
"boundaries": [ 1, 100, 10000, 25000, 100000 ],
"default": "TOP",
"output": {
"count": { "$sum": 1 },
"users": { "$push": "$_id" }
}
}
}
]
contributions = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
contributions
The result confirm that most user are below 100 contributions and the main contributors are composed of 5 high contributors and TOP contributor
First focus on all document with type='node' and having a subdocument 'address'
In [ ]:
pipeline = [
{ "$match": { "type": "node", "address": { "$exists": True } } },
{ "$project": { "_id": False,
"city": "$address.city",
"housenumber": "$address.housenumber",
"postcode": "$address.postcode",
"street": "$address.street" } },
#{ "$limit": 10 }
]
addresses = pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
addresses.shape
In [ ]:
addresses.describe()
There is some missing data for all selected attributes (with one Nan for postcode)
In [ ]:
from openstreetmap.data_gouv_fr import postalcode
codes = postalcode.PostalCode("data/laposte_hexasmal.csv").localityByPostcode()
addresses.loc[:, "city"] = addresses.postcode.dropna().apply(lambda x: codes[int(x)].title())
addresses[(addresses.city.isnull() == True) & (addresses.postcode.isnull() == False)].city.count()
In [ ]:
addresses.city.unique()
One remaining issue: a 'nan' value that could not be updated without extra information
In [ ]:
addresses.set_index("postcode")["city"].dropna().to_dict()
The above code addressed most remaining issues not handled during the initial cleansing step, for the City and PostCode fields, what about street? What are the remaining issues?
In [ ]:
# To compare with the official database, some french characters have to be translated to their corresponding ASCII
# what a pity for a french official database
xtd = {ord(u'’'): u"'", ord(u'é'): u'e', ord(u'è'): u'e', ord(u'É'): u'E',}
def tr(x):
return x.translate(xtd).upper()
ways_referential = fantoir.FANTOIR().ways("data/FANTOIR1016", "974")
addresses["CHECKED"] = addresses.street.dropna().apply(lambda x: tr(x) in ways_referential["FULL_NAME"].values)
In [ ]:
# Generate a step 2 mapping file for manual street name update ... updated file to be moved to update folder
mapping = addresses[addresses.CHECKED == False]["street"].unique()
df = pd.DataFrame.from_dict({ "OLD": mapping, "NEW": mapping })
df.to_csv("audit/street_clean_step2.csv",
encoding='utf-8',
index=False,
quoting=csv.QUOTE_ALL)
Let's load updated file...and apply the mapping
note: not all street will be updated (limited to illustration purpose only)
In [ ]:
updated_mapping = pd.read_csv("update/street_clean_step2.csv", encoding='utf-8').set_index("NEW")["OLD"].to_dict()
addresses.loc[:, "street"] = addresses.street.apply(lambda x: x if x not in updated_mapping.keys()
else updated_mapping[x])
We have identify some rules to (partially) update the addresses, why not updating the mongo database...
Postal Code and City
In [ ]:
for n in db.LaReunion.find({ "type": "node",
"address": { "$exists": True },
"address.postcode": { "$exists": True } }):
postcode = n["address"]["postcode"].replace(' ', '')
db.LaReunion.update_one({ "_id": n["_id"] }, { "$set": { "address.postcode": postcode } }, upsert=False)
In [ ]:
for n in db.LaReunion.find({ "type": "node",
"address": { "$exists": True },
"address.postcode": { "$exists": True } }):
city = codes[int(n["address"]["postcode"])]
db.LaReunion.update_one({ "_id": n["_id"] }, { "$set": { "address.city": city } }, upsert=False)
In [ ]:
pipeline = [
{ "$match": { "type": "node", "address": { "$exists": True }, "address.postcode": { "$exists": True } } },
{ "$project": { "_id": False,
"city": "$address.city",
"housenumber": "$address.housenumber",
"postcode": "$address.postcode",
"street": "$address.street" } },
{ "$limit": 10 }
]
pd.DataFrame(list(db.LaReunion.aggregate(pipeline)))
In [ ]:
pretty(list(db.LaReunion.find({
"type": "node",
"address": { "$exists": True },
"address.city": { "$exists": False },
"address.postcode": { "$exists": False }}).limit(1)))
Still some cleaning to be done... but not possible with the information we have in hands
The dataset is far from being complete and accurate. The different steps followed during the cleansing process has demonstrated that the dataset can be partially completed and made more accurate with some simple rules and correlation with some external databases (for French Area, the official database is publically available or referenced from http://data.gouv.fr).
The cleansing process described above did not deliver a clean and fully accurate database as a lot a of specific rules need to be put in place. The number of different contributors probably influenced the way street name when populated into the database. This is clearly an area where an automated validation process (during changes/updates integration) could improve the quality: such automated process could rely (for French Area) on official ways referential available from http://data.gouv.fr (references of such database are listed in the source code generated for this project, which include information for postcode, city, street name).
In addition to the referential databases, some standard regular expressions (such as Phone Number, Street Name, PostCode..) could be defined (and customized by country) and applied during changes/updates integration to reject or validate the user inputs.
Another improvment that could lead to a better quality and standardization of the dataset is the used of standard ontology. Possible ways to move in that direction: -1 http://wiki.openstreetmap.org/wiki/OSMonto: the main advantage is that it stays very close to the existing tags from openstreetmap, but will not allow to easily connect to other source of information without additional development -2 align to a well defined RDF schema such as https://schema.org/Place or equivalent: the main advantage is the inter-operability of the different database allowing cross-validation of the data.
For the second way, it means a big changes and migration effort will be significant.
In [ ]: