Florina Georgescu - Airbus Operations SL
Github: https://github.com/inageorgescu
Map Area: Málaga, Spain
In [4]:
from IPython.display import Image
Image("Malaga_map.jpg")
Out[4]:
In [5]:
Image("workflow.jpg")
Out[5]:
In [6]:
#Import all necesary modules as follows:
#import flexible container object, designed to store hierarchical data structures in memory
import xml.etree.cElementTree as ET
#import function to supply missing values
from collections import defaultdict
#import Regular expression operations
import re
#import “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter
import pprint
#import Codec registry and base classes
import codecs
#import JSON encoder and decoder
import json
import pymongo
import os
In [7]:
#Create a sample file in order to perform the tests and be able to visualize easily
OSM_FILE = "Malaga.osm" # OSM file for Malaga
SAMPLE_FILE = "sample.osm"
k = 10 # Parameter: take every k-th top level element
def get_element(osm_file, tags=('node', 'way', 'relation')):
"""Yield element if it is the right type of tag
Reference:
http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
"""
context = iter(ET.iterparse(osm_file, events=('start', 'end')))
_, root = next(context)
for event, elem in context:
if event == 'end' and elem.tag in tags:
yield elem
root.clear()
with open(SAMPLE_FILE, 'wb') as output:
output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
output.write('<osm>\n ')
# Write every kth top level element
for i, element in enumerate(get_element(OSM_FILE)):
if i % k == 0:
output.write(ET.tostring(element, encoding='utf-8'))
output.write('</osm>')
The data that I will audit is a OSM file for a given map, in this case Málaga. OSM XML is list of instances of data (nodes, ways, and relations) that represent points on the map. Ways contain node references to form either a polyline or polygon on the map. Nodes and ways both contain children tag elements that represent key value pairs of descriptive information about a given node or way.
After using the a sample file of the Malaga area and running it against a provisional audit_atributes.py file, I noticed three main problems with the data, which I will tackle:
In Europe the street type is written after the name of the street and must have a standard naming. It is common to use abreviation for it therefore you can have same street type called different ways e.g. "Avenida" can be founf as "AV". "AVDA" etc. This problem was programmatically solved by function update_names() before creating the database.
There are streets without a street type. This probelm can be solved only if you have a list with street names and their type. It is not our case therefore I will only list ( with the function audit()) the streets that meet this condition in order to identify that there is a problem in the map.
There are some postal codes that might have not been corectly introduced because they do not belong to the valid postal codes in the area or they are wornlgly introduced. This problem was programmatically solved by function update_postal_codes() before creating the database.
In [8]:
#Parse file and count number of unique element types
def count_tags(filename):
tags = {}
for event, elem in ET.iterparse(filename):
if elem.tag in tags:
tags[elem.tag] += 1
else:
tags[elem.tag] = 1
return tags
view_tags = count_tags(OSM_FILE)
# pprint.pprint(view_tags)
In [9]:
from IPython.display import display
from audit_atributes import audit, update_names, audit_pc, update_postal_codes
OSM_FILE ='./Malaga.osm'
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
street_types = audit(OSM_FILE)
# display(street_types)
In [31]:
#Update street type
updates = update_names(OSM_FILE)
# display(updates)
In [11]:
updates_pc = update_postal_codes(OSM_FILE)
# display(updates_pc)
In [12]:
import os
print "The downloaded OSM file is {} MB".format(os.path.getsize('Malaga.osm')/1.0e6) # convert from bytes to megabytes
In [13]:
from pymongo import MongoClient
db_name = 'openstreetmap1'
# Connect to Mongo DB
client = MongoClient('localhost:27017')
# Database 'openstreetmap' will be created
db = client[db_name]
collection = db['Malaga']
There is a code for running directly from python the mongoimport ( see https://github.com/bestkao/data-wrangling-with-openstreetmap-and-mongodb) but as I do not have administrator right for my computer I had to manually instroduce the command In order to create the data base the for the openstreet execute in console cmd the following code: mongoimport -h 127.0.0.1:27017 --db openstreetmap1 --collection Malaga --file C:\Users\c41237\Desktop\Udacity\P3\P3_OpenStreetMap\Malaga.osm.json
In [14]:
# Number of documents
documents = collection.find().count()
display(documents)
In [15]:
#Number of unique users
users=len(collection.distinct('created.user'))
display(users)
In [16]:
#number of nodes & ways
nodes=collection.find({'type':'node'}).count()
ways=collection.find({'type':'way'}).count()
display(nodes)
display(ways)
In [17]:
#disply types and number of nodes & ways
node_way = collection.aggregate([
{"$group" : {"_id" : "$type", "count" : {"$sum" : 1}}}])
pprint.pprint(list(node_way))
In [18]:
#top3 contributors to the map
top3 = collection.aggregate([{ '$group' : {'_id' : '$created.user',
'count' : { '$sum' : 1}}},
{ '$sort' : {'count' : -1}},
{ '$limit' : 3 }])
display(list(top3))
In [19]:
#number of documents with street addresses
addresses=collection.find({'address.street': {'$exists': 1}}).count()
print(addresses)
In [20]:
#top3 postal codes
top10_pc = collection.aggregate([{ '$group' : {'_id' : '$address.postcode',
'count' : { '$sum' : 1}}},
{ '$sort' : {'count' : -1}},
{ '$limit' : 3 }])
display(list(top10_pc))
In [21]:
#list of postal codes limited to 3
postal_codes = collection.aggregate([
{"$match" : {"address.postcode" : {"$exists" : 1}}}, \
{"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}, \
{"$sort" : {"count" : -1}}])
# pprint.pprint(list(postal_codes))
In [22]:
#list top 10 street names in data base
streets = collection.aggregate([
{"$match" : {"address.street" : {"$exists" : 1}}}, \
{"$group" : {"_id" : "$address.street", "count" : {"$sum" : 1}}}, \
{"$sort" : {"count" : -1}},
{"$limit":10}])
pprint.pprint(list(streets))
In [23]:
#Top 10 amenities
top10_amenities = collection.aggregate([{"$match":{"amenity":{"$exists":1}}},
{"$group":{"_id":"$amenity","count":{"$sum":1}}},
{"$sort":{"count":-1}},
{"$limit":10}])
display(list(top10_amenities))
In [24]:
#Top 5 building types
type_buildings = collection.aggregate([
{'$match': {'building': {'$exists': 1}}},
{'$group': { '_id': '$building','count': {'$sum': 1}}},
{'$sort': {'count': -1}}, {'$limit': 5}
])
pprint.pprint(list(type_buildings))
In [25]:
#Top 3 cult buildings
religion_buildings = collection.aggregate([
{"$match" : {"amenity" : "place_of_worship"}}, \
{"$group" : {"_id" : {"religion" : "$religion", "denomination" : "$denomination"}, "count" : {"$sum" : 1}}}, \
{"$sort" : {"count" : -1}}, {'$limit': 3}])
pprint.pprint(list(religion_buildings))
In [26]:
#Top 10 leisures
leisures = collection.aggregate([{"$match" : {"leisure" : {"$exists" : 1}}}, \
{"$group" : {"_id" : "$leisure", "count" : {"$sum" : 1}}}, \
{"$sort" : {"count" : -1}}, \
{"$limit" : 10}])
pprint.pprint(list(leisures))
In [32]:
#Top 10 univeristies
universities = collection.aggregate([
{"$match" : {"amenity" : "university"}}, \
{"$group" : {"_id" : {"name" : "$name"}, "count" : {"$sum" : 1}}}, \
{"$sort" : {"count" : -1}},
{"$limit":10}
])
# pprint.pprint(list(universities))
In [33]:
#Top 10 cuisines
restaurant = collection.aggregate([
{"$match":{"cuisine":{"$exists":1},"amenity":"restaurant"}},
{"$group":{"_id":"$cuisine","count":{"$sum":1}}},
{"$sort":{"count":-1}}, {"$limit":10} ])
# pprint.pprint(list(restaurant))
The data base needs to be updated with the missing street types as there are a large number without it.
In [29]:
noaddresses=collection.find({'address.street': {'$exists': 0}}).count()
display(noaddresses)
This can be done either by having the information first hand, entering the data one by one or using other map data such as https://www.google.es/maps or https://www.viamichelin.es/. If you go to one of this websites and you tipe in the name ( e.g. Gustavo Pittaluga you will identify that is a street type "Calle". One other option it will be to use the official governamental data base from the cadastral reference that can be found at this webpage https://www1.sedecatastro.gob.es/OVCFrames.aspx?TIPO=CONSULTA and we have available at hand the help page: http://www.catastro.meh.es/ayuda/ayuda_rc.htm
•the data comes form an official source (governement), therefore reliable and up to date. Any modification can be identifyed right ayway.
•the other two websites belong to the most used maps ( google and viamihelin) and you can also get a visual of their location on the map, street view etc.
As it can be observed that the maps have to be cleaned for any imput done by hand and not chosen from a list, therefore we might encounter some problems for this ideea.
•high volume of information therefore proun of type errors
•street name can be the same in two or more cities and the author shouldpay attention when looking for the street to introduce the right city.
Anyhow the task can be completed in this way with reliable information and in the case that the info for the street tyoe is not introduced corectly we can allways run the audit routine above to clean the data.
In [30]:
Image("improvement_openstreetmap.jpg")
Out[30]: