OpenStreetMap is an open project, which means it's free and everyone can use it and edit as they like. OpenStreetMap is direct competitor of Google Maps. How OpenStreetMap can compete with the giant you ask? It's depend completely on crowd sourcing. There's lot of people willingly update the map around the world, most of them fix their map country.

Openstreetmap is so powerful, and rely heavily on the human input. But its strength also the downfall. Everytime there's human input, there's always be human error.It's very error prone.

Problems Encountered in the Map

The problem encountered in the map:

  • Street map abbreviations
  • Incosistent phone number:
    {u'_id': u'021 500505'},
    {u'_id': u'021-720-0981209'},
    {u'_id': u'62 21 3923810'},
    {u'_id': u'+62 21 723 8227'},
    {u'_id': u'(021) 7180317'},
    {u'_id': u'081807217074'},
    {u'_id': u'+62 857 4231 9136'},
    {u'_id': u'+6221 80872985'},
    {u'_id': u'+62 81222229386'}

Take the name of the street for example. People like to abbreviate the type of the street. Street become St. st. In Indonesia, 'Jalan'(Street-Eng), also abbreviated as Jln, jln, jl, Jln. It maybe get us less attention. But for someone as Data Scientist/Web Developer, they expect the street to have generic format.

'Jalan Sudirman' -> Jalan <name> -> name = Sudirman
'Jln Sudirman' -> Jalan <name> -> ERROR!

We also have inconsistent phone number:

This project tends to fix that, it fix abbreviate name, so it can use more generalize type. Not only it's benefit for professional, But we can also can see more structured words. I choose whole places of Jakarta. Jakarta is the capital of Indonesia.This dataset is huge, over 250,000 examples. It's my hometown, and i somewhat want to help the community.

TODO

  • Uniformity of phone number

In [75]:
OSMFILE = 'dataset/jakarta.osm'

To audit the osm file, first we need to know the overview of the data. To get an overview of the data, we count the tag content of the data.


In [109]:
%%writefile 02-codes/audit.py

import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
from optparse import OptionParser

# OSMFILE = "sample.osm"
# OSMFILE = "example_audit.osm"
#In Indonesia, type first, then name. So the regex has to be changed.
#street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
street_type_re = re.compile(r'^\b\S+\.?', re.IGNORECASE)


# expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
#             "Trail", "Parkway", "Commons"]
expected = ['Jalan', 'Gang','Street', 'Road']
# UPDATE THIS VARIABLE
#Mapping has to sort in length descending.
#languange English-Indonesian{Street: Jalan}. 
#{Sudirman Stret:Jalan Sudirman}
mapping = {

            'jl.':'Jalan',
            'JL.':'Jalan',
            'Jl.':'Jalan',
            'GG':'Gang',
            'gg': 'Gang',
            'jl' :'Jalan',
            'JL':'Jalan',
            'Jl':'Jalan',
        
        }


def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)
            #return True if need to be updated
            return True
    return False


def is_street_name(elem):
    """
    Perhaps the addr:full should also included to be fixed  
    """
    return (elem.attrib['k'] == "addr:street") or (elem.attrib['k'] == "addr:full")

def is_name_is_street(elem):
    """Some people fill the name of the street in k=name.
    
    Should change this"""
    s = street_type_re.search(elem.attrib['v'])
    #print s
    return (elem.attrib['k'] == "name") and s and s.group() in mapping.keys()

def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
#     tree = ET.parse(osm_file, events=("start",))
    tree = ET.parse(osm_file)
    
    listtree = list(tree.iter())
    for elem in listtree:
        if elem.tag == "node" or elem.tag == "way":
            n_add = None
            
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    if audit_street_type(street_types, tag.attrib['v']):
                        #Update the tag attribtue
                        tag.attrib['v'] = update_name(tag.attrib['v'],mapping)
                elif is_name_is_street(tag):
                    tag.attrib['v'] = update_name(tag.attrib['v'],mapping)
                    n_add = tag.attrib['v']
                elif tag.attrib['k'] == 'phone':
#                     print  tag.attrib['v']
                    tag.attrib['v'] = update_phone(tag.attrib['v'])
                    
                   
            if n_add:
                elem.append(ET.Element('tag',{'k':'addr:street', 'v':n_add}))

            
                
    #write the to the file we've been audit
    tree.write(osmfile[:osmfile.find('.osm')]+'_audit.osm')
    return street_types

def update_phone(number):
    """Uniform all the incosistent number"""
    
    stripped = re.sub('[^A-Za-z0-9]+', '', number)
    replace0to62 = re.sub('^0', '62',stripped)
    separate_area_code  = re.sub('^6221','6221 ',replace0to62)
    tidy_country_code = re.sub('^62', '+62 ', separate_area_code )
    fixed = tidy_country_code
    
    return fixed
        
    
    
def update_name(name, mapping):
    """
    Fixed abreviate name so the name can be uniform.
    
    The reason why mapping in such particular order, is to prevent the shorter keys get first.
    """
    dict_map = sorted(mapping.keys(), key=len, reverse=True)
    for key in dict_map:
        
        if name.find(key) != -1:          
            name = name.replace(key,mapping[key])
            return name

#essentially, in Indonesia, you specify the all type of street as Street. 
#So if it doesnt have any prefix, add 'Jalan'
    return 'Jalan ' + name


def test():
    st_types = audit(OSMFILE)
#     pprint.pprint(dict(st_types))
    #assert len(st_types) == 3
    

#     for st_type, ways in st_types.iteritems():
#         for name in ways:
#             better_name = update_name(name, mapping)
#             print name, "=>", better_name


if __name__ == '__main__':
    test()
#     parser  = OptionParser()
#     parser.add_option('-d', '--data', dest='audited_data', help='osm data that want to be audited')
#     (opts,args) = parser.parse_args()
#     audit(opts.audited_data)


Overwriting 02-codes/audit.py

This will save the jakarta osm that has been audited into jakarta_audit.osm Not let's prepare the audited file to be input to the MongoDB instance.


In [104]:
# %load 02-codes/data.py
#!/usr/bin/env python
import xml.etree.ElementTree as ET
import pprint
import re
import codecs
import json


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
addresschars = re.compile(r'addr:(\w+)')
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
OSM_FILE = 'dataset/jakarta_audit.osm'

def shape_element(element):
    #node = defaultdict(set)
    node = {}
    if element.tag == "node" or element.tag == "way" :
        #create the dictionary based on exaclty the value in element attribute.
        node = {'created':{}, 'type':element.tag}
        for k in element.attrib:
            try:
                v = element.attrib[k]
            except KeyError:
                continue
            if k == 'lat' or k == 'lon':
                continue
            if k in CREATED:
                node['created'][k] = v
            else:
                node[k] = v
        try:
            node['pos']=[float(element.attrib['lat']),float(element.attrib['lon'])]
        except KeyError:
            pass
        
        if 'address' not in node.keys():
            node['address'] = {}
        #Iterate the content of the tag
        for stag in element.iter('tag'):
            #Init the dictionry

            k = stag.attrib['k']
            v = stag.attrib['v']
            #Checking if indeed prefix with 'addr' and no ':' afterwards
            if k.startswith('addr:'):
                if len(k.split(':')) == 2:
                    content = addresschars.search(k)
                    if content:
                        node['address'][content.group(1)] = v
            else:
                node[k]=v
        if not node['address']:
            node.pop('address',None)
        #Special case when the tag == way,  scrap all the nd key
        if element.tag == "way":
            node['node_refs'] = []
            for nd in element.iter('nd'):
                node['node_refs'].append(nd.attrib['ref'])
#         if  'address' in node.keys():
#             pprint.pprint(node['address'])
        return node
    else:
        return None


def process_map(file_in, pretty = False):
    """
    Process the osm file to json file to be prepared for input file to monggo
    """
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

def test():

    data = process_map(OSM_FILE)
    pprint.pprint(data[500])


if __name__ == "__main__":
    test()


{'created': {'changeset': '432205',
             'timestamp': '2008-12-17T06:20:11Z',
             'uid': '76518',
             'user': 'Firman Hadi',
             'version': '17'},
 'id': '94995170',
 'pos': [-6.2917819, 106.7859039],
 'type': 'node'}

The processed map has ben saved to jakarta_audit.osm.json Now that we have process the audited map file into array of JSON, let's put it into mongodb instance. this will take the map that we have been audited. First we load the script to insert the map


In [8]:
from data import *
import pprint

In [105]:
data = process_map('dataset/jakarta_audit.osm')

In [92]:
import json

Okay let's test if the data is something that we expect


In [7]:
pprint.pprint(data[0:2])


[{'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '13'},
  'id': '29938967',
  'pos': [-6.1803929, 106.8226699],
  'type': 'node'},
 {'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '28'},
  'id': '29938968',
  'pos': [-6.1803972, 106.8231199],
  'type': 'node'}]

The data seems about right. After we verified the data is ready, let's put it into MongoDB


In [81]:
from pymongo import MongoClient

In [82]:
client  = MongoClient('mongodb://localhost:27017')
db = client.examples

In [106]:
db.jktosm.remove()


Out[106]:
{u'n': 260760, u'ok': 1}

In [107]:
[db.jktosm.insert(e) for e in data]


---------------------------------------------------------------------------
InvalidDocument                           Traceback (most recent call last)
<ipython-input-107-4a988356e5a9> in <module>()
----> 1 [db.jktosm.insert(e) for e in data]

/Users/Jonathan/anaconda/lib/python2.7/site-packages/pymongo/collection.pyc in insert(self, doc_or_docs, manipulate, safe, check_keys, continue_on_error, **kwargs)
    407             results = message._do_batched_write_command(
    408                     self.database.name + ".$cmd", _INSERT, command,
--> 409                     gen(), check_keys, self.uuid_subtype, client)
    410             _check_write_command_response(results)
    411         else:

InvalidDocument: key 'building.source:roof' must not contain '.'

Okay, it seems that we have sucessfully insert all of our data into MongoDB instance. Let's test this


In [13]:
pipeline = [
    {'$limit' : 2}
]
pprint.pprint(db.jktosm.aggregate(pipeline)['result'])


[{u'_id': ObjectId('56491b33ea2e5e09913ae0eb'),
  u'created': {u'changeset': u'20029239',
               u'timestamp': u'2014-01-16T08:18:23Z',
               u'uid': u'646006',
               u'user': u'Irfan Muhammad',
               u'version': u'13'},
  u'id': u'29938967',
  u'pos': [-6.1803929, 106.8226699],
  u'type': u'node'},
 {u'_id': ObjectId('56491b34ea2e5e09913ae0ec'),
  u'created': {u'changeset': u'20029239',
               u'timestamp': u'2014-01-16T08:18:23Z',
               u'uid': u'646006',
               u'user': u'Irfan Muhammad',
               u'version': u'28'},
  u'id': u'29938968',
  u'pos': [-6.1803972, 106.8231199],
  u'type': u'node'}]

Overview of the data

You can see the filesize about the dataset.


In [1]:
!ls -lh dataset/jakarta*


-rwxrwxrwx@ 1 Jon  staff    79M Nov  5  2014 dataset/jakarta.osm
-rwxrwxrwx  1 Jon  staff    80M Nov 20  2014 dataset/jakarta_audit.osm
-rwxrwxrwx  1 Jon  staff    89M Nov 20  2014 dataset/jakarta_audit.osm.json

Show 5 data that have street


In [17]:
pipeline = [
            {'$match': {'address.street':{'$exists':1}}},
            {'$limit' : 1}
]
result  = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)


[{u'_id': ObjectId('56491b4dea2e5e09913bab80'),
  u'address': {u'housename': u'Gandaria City',
               u'postcode': u'12240',
               u'street': u'Jalan Sultan Iskandar Muda Kebayoran Lama'},
  u'created': {u'changeset': u'7760855',
               u'timestamp': u'2011-04-04T04:16:03Z',
               u'uid': u'431638',
               u'user': u'esoedjasa',
               u'version': u'1'},
  u'id': u'1231819753',
  u'name': u'Gandaria City',
  u'pos': [-6.2446998, 106.7832904],
  u'shop': u'supermarket',
  u'type': u'node'}]

Show the top 5 of contributed users

We also can find the top 5 contributed users. These users are count by how they created the point in the map, and sort descent


In [45]:
pipeline = [
            {'$match': {'created.user':{'$exists':1}}},
            {'$group': {'_id':'$created.user',
                        'count':{'$sum':1}}},
            {'$sort': {'count':-1}},
            {'$limit' : 5}
]
result  = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)


[{u'_id': u'Firman Hadi', u'count': 113770},
 {u'_id': u'dimdim02', u'count': 38860},
 {u'_id': u'riangga_miko', u'count': 36695},
 {u'_id': u'raniedwianugrah', u'count': 30388},
 {u'_id': u'Alex Rollin', u'count': 26496}]

Show the restaurant's name, the food they serve, and contact number


In [18]:
pipeline = [
            {'$match': {'amenity':'restaurant',
                        'name':{'$exists':1},
                        'cuisine':{'$exists':1},
                        'phone':{'$exists':1}
                       }
            },
            {'$project':{'_id':'$name',
                         'cuisine':'$cuisine',
                         'contact':'$phone'}}
]
result  = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)


[{u'_id': u'Warung Tekko',
  u'contact': u'+62 21 5263137',
  u'cuisine': u'Indonesian'},
 {u'_id': u'YaUdah bistro',
  u'contact': u'+62213140343',
  u'cuisine': u'german'},
 {u'_id': u'Wabito Ramen',
  u'contact': u'62 21 3923810',
  u'cuisine': u'japanese'},
 {u'_id': u'Goma ramen', u'contact': u'081807217074', u'cuisine': u'japanese'},
 {u'_id': u'Bluegrass',
  u'contact': u'+62 21 29941660',
  u'cuisine': u'american'}]

In [108]:
pipeline = [
            {'$match': {
                        
                        'phone':{'$exists':1}
                       }
            },
            {'$project':{'_id':'$phone'}},
            {'$limit': 20}
]
result  = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)


[{u'_id': u'+62 21 5263137'},
 {u'_id': u'14045'},
 {u'_id': u'+62 21 78834966'},
 {u'_id': u'+62 21 500505'},
 {u'_id': u'+62 21 3140343'},
 {u'_id': u'+62 21 7200981209'},
 {u'_id': u'+62 21 30422222'},
 {u'_id': u'+62 21 57851819'},
 {u'_id': u'+62 21 3923810'},
 {u'_id': u'+62 21 7238227'},
 {u'_id': u'+62 21 6338288'},
 {u'_id': u'+62 21 7180317'},
 {u'_id': u'+62 81807217074'},
 {u'_id': u'+62 81380748996'},
 {u'_id': u'+62 85742319136'},
 {u'_id': u'+62 8787553090'},
 {u'_id': u'+62 21 80872985'}]

Other ideas about the datasets

Earlier, we have found that not only most of the public places can be found, but it also shared what kind of description about these places. Now that we have those data, we can create mobile apps that serve as an assistant where to point the users to their need.

Mobile apps can be opened as a simple text box, and ask the users "What do you want?". If for example, the users say "I want to eat Japanese food", the assistant will find Japanese cuisine restaurant within 1 mile radius. Or if the users ask "I want to eat chicken (Indonesia: Ayam)", the assistant will find restaurant's name with 'Ayam' in it. They will provide these restaurants based on the map location, with user location at the exact center of the map. If the user click on one restaurant, it will be given some pop up that give the description of the restaurant. For example, their kind of cuisine, contact number, and ask users two options, whether they want to be given direction to the restaurant or they want to call the contact number.

Not only restaurant, but other kind of public places. Watch movies, go to gym center, finding a flower, etc. This is a lot of possibilities, but this is not without any risk.

Since this is an open data, that means everyone could become the ones who edit it. What about someone gives false information, and the contact number is a wrong number? Or what about the fake address? A consumer can reach into the location by hours (and in Jakarta, the most traffic jam city in the world, it's possible), turns out that's not the place they're looking for. It will be fatal to our company. So it's good thing to do cross-validation.

I believe that we should do some checking. OpenstreetMap has also some validation check, Osmose. We can actually see that the location/data that we're looking for, in warning color level. If the location is red, or even yellow, we shouldn't incoporate it to our data. It's safe to assume that the green one is pass our validation check. But we also have to account for the fact that Osmose could have false positive.

If we targetting for a big company that rely so much of OpenStreetMap data, it's a good thing to have a team that collaborate to make OpenStreetMap better. Besides of giving back to the community, the team responsible to also do any cross-check of any update location. Either using another map, or call the contact number confirming the owner of the location really has moved.

I actually submit the changes that I made back to the OSM. The changeset is here: http://osmhv.openstreetmap.de/changeset.jsp?id=26730562

But the changes get reverted back, because I violated some rules that stated I can't change the map with machine code. If you see the changes, I actually made a lot of changes that the community will benefit.