OpenStreetMap is an open project, which means it's free and everyone can use it and edit as they like. OpenStreetMap is direct competitor of Google Maps. How OpenStreetMap can compete with the giant you ask? It's depend completely on crowd sourcing. There's lot of people willingly update the map around the world, most of them fix their map country.
Openstreetmap is so powerful, and rely heavily on the human input. But its strength also the downfall. Everytime there's human input, there's always be human error.It's very error prone.
The problem encountered in the map:
{u'_id': u'021 500505'},
{u'_id': u'021-720-0981209'},
{u'_id': u'62 21 3923810'},
{u'_id': u'+62 21 723 8227'},
{u'_id': u'(021) 7180317'},
{u'_id': u'081807217074'},
{u'_id': u'+62 857 4231 9136'},
{u'_id': u'+6221 80872985'},
{u'_id': u'+62 81222229386'}
Take the name of the street for example. People like to abbreviate the type of the street. Street become St. st. In Indonesia, 'Jalan'(Street-Eng), also abbreviated as Jln, jln, jl, Jln. It maybe get us less attention. But for someone as Data Scientist/Web Developer, they expect the street to have generic format.
'Jalan Sudirman' -> Jalan <name> -> name = Sudirman
'Jln Sudirman' -> Jalan <name> -> ERROR!
We also have inconsistent phone number:
This project tends to fix that, it fix abbreviate name, so it can use more generalize type. Not only it's benefit for professional, But we can also can see more structured words. I choose whole places of Jakarta. Jakarta is the capital of Indonesia.This dataset is huge, over 250,000 examples. It's my hometown, and i somewhat want to help the community.
In [75]:
OSMFILE = 'dataset/jakarta.osm'
To audit the osm file, first we need to know the overview of the data. To get an overview of the data, we count the tag content of the data.
In [109]:
%%writefile 02-codes/audit.py
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
from optparse import OptionParser
# OSMFILE = "sample.osm"
# OSMFILE = "example_audit.osm"
#In Indonesia, type first, then name. So the regex has to be changed.
#street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
street_type_re = re.compile(r'^\b\S+\.?', re.IGNORECASE)
# expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
# "Trail", "Parkway", "Commons"]
expected = ['Jalan', 'Gang','Street', 'Road']
# UPDATE THIS VARIABLE
#Mapping has to sort in length descending.
#languange English-Indonesian{Street: Jalan}.
#{Sudirman Stret:Jalan Sudirman}
mapping = {
'jl.':'Jalan',
'JL.':'Jalan',
'Jl.':'Jalan',
'GG':'Gang',
'gg': 'Gang',
'jl' :'Jalan',
'JL':'Jalan',
'Jl':'Jalan',
}
def audit_street_type(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if street_type not in expected:
street_types[street_type].add(street_name)
#return True if need to be updated
return True
return False
def is_street_name(elem):
"""
Perhaps the addr:full should also included to be fixed
"""
return (elem.attrib['k'] == "addr:street") or (elem.attrib['k'] == "addr:full")
def is_name_is_street(elem):
"""Some people fill the name of the street in k=name.
Should change this"""
s = street_type_re.search(elem.attrib['v'])
#print s
return (elem.attrib['k'] == "name") and s and s.group() in mapping.keys()
def audit(osmfile):
osm_file = open(osmfile, "r")
street_types = defaultdict(set)
# tree = ET.parse(osm_file, events=("start",))
tree = ET.parse(osm_file)
listtree = list(tree.iter())
for elem in listtree:
if elem.tag == "node" or elem.tag == "way":
n_add = None
for tag in elem.iter("tag"):
if is_street_name(tag):
if audit_street_type(street_types, tag.attrib['v']):
#Update the tag attribtue
tag.attrib['v'] = update_name(tag.attrib['v'],mapping)
elif is_name_is_street(tag):
tag.attrib['v'] = update_name(tag.attrib['v'],mapping)
n_add = tag.attrib['v']
elif tag.attrib['k'] == 'phone':
# print tag.attrib['v']
tag.attrib['v'] = update_phone(tag.attrib['v'])
if n_add:
elem.append(ET.Element('tag',{'k':'addr:street', 'v':n_add}))
#write the to the file we've been audit
tree.write(osmfile[:osmfile.find('.osm')]+'_audit.osm')
return street_types
def update_phone(number):
"""Uniform all the incosistent number"""
stripped = re.sub('[^A-Za-z0-9]+', '', number)
replace0to62 = re.sub('^0', '62',stripped)
separate_area_code = re.sub('^6221','6221 ',replace0to62)
tidy_country_code = re.sub('^62', '+62 ', separate_area_code )
fixed = tidy_country_code
return fixed
def update_name(name, mapping):
"""
Fixed abreviate name so the name can be uniform.
The reason why mapping in such particular order, is to prevent the shorter keys get first.
"""
dict_map = sorted(mapping.keys(), key=len, reverse=True)
for key in dict_map:
if name.find(key) != -1:
name = name.replace(key,mapping[key])
return name
#essentially, in Indonesia, you specify the all type of street as Street.
#So if it doesnt have any prefix, add 'Jalan'
return 'Jalan ' + name
def test():
st_types = audit(OSMFILE)
# pprint.pprint(dict(st_types))
#assert len(st_types) == 3
# for st_type, ways in st_types.iteritems():
# for name in ways:
# better_name = update_name(name, mapping)
# print name, "=>", better_name
if __name__ == '__main__':
test()
# parser = OptionParser()
# parser.add_option('-d', '--data', dest='audited_data', help='osm data that want to be audited')
# (opts,args) = parser.parse_args()
# audit(opts.audited_data)
This will save the jakarta osm that has been audited into jakarta_audit.osm Not let's prepare the audited file to be input to the MongoDB instance.
In [104]:
# %load 02-codes/data.py
#!/usr/bin/env python
import xml.etree.ElementTree as ET
import pprint
import re
import codecs
import json
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
addresschars = re.compile(r'addr:(\w+)')
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
OSM_FILE = 'dataset/jakarta_audit.osm'
def shape_element(element):
#node = defaultdict(set)
node = {}
if element.tag == "node" or element.tag == "way" :
#create the dictionary based on exaclty the value in element attribute.
node = {'created':{}, 'type':element.tag}
for k in element.attrib:
try:
v = element.attrib[k]
except KeyError:
continue
if k == 'lat' or k == 'lon':
continue
if k in CREATED:
node['created'][k] = v
else:
node[k] = v
try:
node['pos']=[float(element.attrib['lat']),float(element.attrib['lon'])]
except KeyError:
pass
if 'address' not in node.keys():
node['address'] = {}
#Iterate the content of the tag
for stag in element.iter('tag'):
#Init the dictionry
k = stag.attrib['k']
v = stag.attrib['v']
#Checking if indeed prefix with 'addr' and no ':' afterwards
if k.startswith('addr:'):
if len(k.split(':')) == 2:
content = addresschars.search(k)
if content:
node['address'][content.group(1)] = v
else:
node[k]=v
if not node['address']:
node.pop('address',None)
#Special case when the tag == way, scrap all the nd key
if element.tag == "way":
node['node_refs'] = []
for nd in element.iter('nd'):
node['node_refs'].append(nd.attrib['ref'])
# if 'address' in node.keys():
# pprint.pprint(node['address'])
return node
else:
return None
def process_map(file_in, pretty = False):
"""
Process the osm file to json file to be prepared for input file to monggo
"""
file_out = "{0}.json".format(file_in)
data = []
with codecs.open(file_out, "w") as fo:
for _, element in ET.iterparse(file_in):
el = shape_element(element)
if el:
data.append(el)
if pretty:
fo.write(json.dumps(el, indent=2)+"\n")
else:
fo.write(json.dumps(el) + "\n")
return data
def test():
data = process_map(OSM_FILE)
pprint.pprint(data[500])
if __name__ == "__main__":
test()
The processed map has ben saved to jakarta_audit.osm.json Now that we have process the audited map file into array of JSON, let's put it into mongodb instance. this will take the map that we have been audited. First we load the script to insert the map
In [8]:
from data import *
import pprint
In [105]:
data = process_map('dataset/jakarta_audit.osm')
In [92]:
import json
Okay let's test if the data is something that we expect
In [7]:
pprint.pprint(data[0:2])
The data seems about right. After we verified the data is ready, let's put it into MongoDB
In [81]:
from pymongo import MongoClient
In [82]:
client = MongoClient('mongodb://localhost:27017')
db = client.examples
In [106]:
db.jktosm.remove()
Out[106]:
In [107]:
[db.jktosm.insert(e) for e in data]
Okay, it seems that we have sucessfully insert all of our data into MongoDB instance. Let's test this
In [13]:
pipeline = [
{'$limit' : 2}
]
pprint.pprint(db.jktosm.aggregate(pipeline)['result'])
You can see the filesize about the dataset.
In [1]:
!ls -lh dataset/jakarta*
In [17]:
pipeline = [
{'$match': {'address.street':{'$exists':1}}},
{'$limit' : 1}
]
result = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)
We also can find the top 5 contributed users. These users are count by how they created the point in the map, and sort descent
In [45]:
pipeline = [
{'$match': {'created.user':{'$exists':1}}},
{'$group': {'_id':'$created.user',
'count':{'$sum':1}}},
{'$sort': {'count':-1}},
{'$limit' : 5}
]
result = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)
In [18]:
pipeline = [
{'$match': {'amenity':'restaurant',
'name':{'$exists':1},
'cuisine':{'$exists':1},
'phone':{'$exists':1}
}
},
{'$project':{'_id':'$name',
'cuisine':'$cuisine',
'contact':'$phone'}}
]
result = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)
In [108]:
pipeline = [
{'$match': {
'phone':{'$exists':1}
}
},
{'$project':{'_id':'$phone'}},
{'$limit': 20}
]
result = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)
Earlier, we have found that not only most of the public places can be found, but it also shared what kind of description about these places. Now that we have those data, we can create mobile apps that serve as an assistant where to point the users to their need.
Mobile apps can be opened as a simple text box, and ask the users "What do you want?". If for example, the users say "I want to eat Japanese food", the assistant will find Japanese cuisine restaurant within 1 mile radius. Or if the users ask "I want to eat chicken (Indonesia: Ayam)", the assistant will find restaurant's name with 'Ayam' in it. They will provide these restaurants based on the map location, with user location at the exact center of the map. If the user click on one restaurant, it will be given some pop up that give the description of the restaurant. For example, their kind of cuisine, contact number, and ask users two options, whether they want to be given direction to the restaurant or they want to call the contact number.
Not only restaurant, but other kind of public places. Watch movies, go to gym center, finding a flower, etc. This is a lot of possibilities, but this is not without any risk.
Since this is an open data, that means everyone could become the ones who edit it. What about someone gives false information, and the contact number is a wrong number? Or what about the fake address? A consumer can reach into the location by hours (and in Jakarta, the most traffic jam city in the world, it's possible), turns out that's not the place they're looking for. It will be fatal to our company. So it's good thing to do cross-validation.
I believe that we should do some checking. OpenstreetMap has also some validation check, Osmose. We can actually see that the location/data that we're looking for, in warning color level. If the location is red, or even yellow, we shouldn't incoporate it to our data. It's safe to assume that the green one is pass our validation check. But we also have to account for the fact that Osmose could have false positive.
If we targetting for a big company that rely so much of OpenStreetMap data, it's a good thing to have a team that collaborate to make OpenStreetMap better. Besides of giving back to the community, the team responsible to also do any cross-check of any update location. Either using another map, or call the contact number confirming the owner of the location really has moved.
I actually submit the changes that I made back to the OSM. The changeset is here: http://osmhv.openstreetmap.de/changeset.jsp?id=26730562
But the changes get reverted back, because I violated some rules that stated I can't change the map with machine code. If you see the changes, I actually made a lot of changes that the community will benefit.