OpenStreetMap is an open project, which means it's free and everyone can use it and edit as they like. OpenStreetMap is direct competitor of Google Maps. How OpenStreetMap can compete with the giant you ask? It's depend completely on crowd sourcing. There's lot of people willingly update the map around the world, most of them fix their map country.

Openstreetmap is so powerful, and rely heavily on the human input. But its strength also the downfall. Everytime there's human input, there's always be human error.It's very error prone.

Problems Encountered in the Map

Take the name of the street for example. People like to abbreviate the type of the street. Street become St. st. In Indonesia, 'Jalan'(Street-Eng), also abbreviated as Jln, jln, jl, Jln. It maybe get us less attention. But for someone as Data Scientist/Web Developer, they expect the street to have generic format.

'Jalan Sudirman' -> Jalan <name> -> name = Sudirman
'Jln Sudirman' -> Jalan <name> -> ERROR!

This project tends to fix that, it fix abbreviate name, so it can use more generalize type. Not only it's benefit for professional, But we can also can see more structured words.

In this project, i want to show you to fix one type of error, that is the address of the street. I choose whole places of Jakarta. Jakarta is the capital of Indonesia.This dataset is huge, over 250,000 examples. It's my hometown, and i somewhat want to help the community. And not only that, i also will show you how to put the data that has been audited into MongoDB instance. We also use MongoDB's Agregation Framework to get overview and analysis of the data.


In [8]:
OSM_FILE = 'jakarta.osm'

In [11]:
%load mapparser.py

To audit the osm file, first we need to know the overview of the data. To get an overview of the data, we count the tag content of the data.


In [14]:
%load users.py

In [2]:
import xml.etree.cElementTree as ET

In [1]:
%load audit.py

This will save the jakarta osm that has been audited into jakarta_audit.osm Not let's prepare the audited file to be input to the MongoDB instance.


In [ ]:
%load submission/data.py

The processed map has ben saved to jakarta_audit.osm.json Now that we have process the audited map file into array of JSON, let's put it into mongodb instance. this will take the map that we have been audited. First we load the script to insert the map


In [3]:
from data import *
import pprint

In [39]:
data = process_map('jakarta_audit.osm')

Okay let's test if the data is something that we expect


In [33]:
pprint.pprint(data[0:6])


[{'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '13'},
  'id': '29938967',
  'pos': [-6.1803929, 106.8226699],
  'type': 'node'},
 {'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '28'},
  'id': '29938968',
  'pos': [-6.1803972, 106.8231199],
  'type': 'node'},
 {'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '9'},
  'id': '29938969',
  'pos': [-6.1809102, 106.8230928],
  'type': 'node'},
 {'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '15'},
  'id': '29938970',
  'pos': [-6.1808689, 106.8226461],
  'type': 'node'},
 {'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '10'},
  'id': '29938971',
  'pos': [-6.1805893, 106.8225613],
  'type': 'node'},
 {'created': {'changeset': '20029239',
              'timestamp': '2014-01-16T08:18:23Z',
              'uid': '646006',
              'user': 'Irfan Muhammad',
              'version': '11'},
  'id': '29938972',
  'pos': [-6.1805659, 106.8232191],
  'type': 'node'}]

The data seems about right. After we verified the data is ready, let's put it into MongoDB


In [4]:
from pymongo import MongoClient

In [5]:
client  = MongoClient('mongodb://localhost:27017')
db = client.examples

In [ ]:
[db.jktosm.insert(e) for e in data]

Okay, it seems that we have sucessfully insert all of our data into MongoDB instance. Let's test this


In [6]:
pipeline = [
    {'$limit' : 6}
]
pprint.pprint(db.jktosm.aggregate(pipeline)['result'])


[{u'_id': ObjectId('546d9d818cbd2f060eb432f2'),
  u'created': {u'changeset': u'11134443',
               u'timestamp': u'2012-03-29T07:25:28Z',
               u'uid': u'642271',
               u'user': u'ragunan',
               u'version': u'1'},
  u'id': u'1695812051',
  u'pos': [-6.2949894, 106.8198961],
  u'type': u'node'},
 {u'_id': ObjectId('546d9d818cbd2f060eb432f3'),
  u'created': {u'changeset': u'11134443',
               u'timestamp': u'2012-03-29T07:25:28Z',
               u'uid': u'642271',
               u'user': u'ragunan',
               u'version': u'1'},
  u'id': u'1695812052',
  u'pos': [-6.2950642, 106.8199212],
  u'type': u'node'},
 {u'_id': ObjectId('546d9d818cbd2f060eb432f4'),
  u'created': {u'changeset': u'11134444',
               u'timestamp': u'2012-03-29T07:25:27Z',
               u'uid': u'642195',
               u'user': u'tebet_timur',
               u'version': u'1'},
  u'id': u'1695812053',
  u'pos': [-6.2300963, 106.855384],
  u'type': u'node'},
 {u'_id': ObjectId('546d9d818cbd2f060eb432f5'),
  u'created': {u'changeset': u'11134443',
               u'timestamp': u'2012-03-29T07:25:28Z',
               u'uid': u'642271',
               u'user': u'ragunan',
               u'version': u'1'},
  u'id': u'1695812054',
  u'pos': [-6.2950931, 106.8189926],
  u'type': u'node'},
 {u'_id': ObjectId('546d9d818cbd2f060eb432f6'),
  u'created': {u'changeset': u'11134444',
               u'timestamp': u'2012-03-29T07:25:28Z',
               u'uid': u'642195',
               u'user': u'tebet_timur',
               u'version': u'1'},
  u'id': u'1695812055',
  u'pos': [-6.2301173, 106.8553364],
  u'type': u'node'},
 {u'_id': ObjectId('546d9d818cbd2f060eb432f7'),
  u'created': {u'changeset': u'11134443',
               u'timestamp': u'2012-03-29T07:25:28Z',
               u'uid': u'642271',
               u'user': u'ragunan',
               u'version': u'1'},
  u'id': u'1695812056',
  u'pos': [-6.2950931, 106.8211174],
  u'type': u'node'}]

Show 5 data that have street


In [7]:
pipeline = [
            {'$match': {'address.street':{'$exists':1}}},
            {'$limit' : 5}
]
result  = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)


[{u'_id': ObjectId('546d9d758cbd2f060eb3916d'),
  u'address': {u'housename': u'Pasar Festival',
               u'street': u'Jalan HR Rasuna Said'},
  u'building': u'yes',
  u'created': {u'changeset': u'16848088',
               u'timestamp': u'2013-07-06T12:21:11Z',
               u'uid': u'76518',
               u'user': u'Firman Hadi',
               u'version': u'2'},
  u'id': u'1394516071',
  u'leisure': u'sports_centre',
  u'name': u'Soemantri Brojonegoro',
  u'pos': [-6.2213611, 106.8329498],
  u'sport': u'basketball',
  u'type': u'node'},
 {u'_id': ObjectId('546d9d768cbd2f060eb39c64'),
  u'address': {u'city': u'Jakarta',
               u'country': u'ID',
               u'housename': u'Meruvian Camp - Cempaka Baru',
               u'housenumber': u'39',
               u'street': u'Jalan Swadaya 2 No. 39'},
  u'created': {u'changeset': u'9758314',
               u'timestamp': u'2011-11-06T18:44:31Z',
               u'uid': u'70696',
               u'user': u'xybot',
               u'version': u'2'},
  u'id': u'1493006911',
  u'pos': [-6.1700951, 106.8655072],
  u'type': u'node'},
 {u'_id': ObjectId('546d9d6d8cbd2f060eb32173'),
  u'address': {u'housename': u'Gandaria City',
               u'postcode': u'12240',
               u'street': u'Jalan Sultan Iskandar Muda Kebayoran Lama'},
  u'created': {u'changeset': u'7760855',
               u'timestamp': u'2011-04-04T04:16:03Z',
               u'uid': u'431638',
               u'user': u'esoedjasa',
               u'version': u'1'},
  u'id': u'1231819753',
  u'name': u'Gandaria City',
  u'pos': [-6.2446998, 106.7832904],
  u'shop': u'supermarket',
  u'type': u'node'},
 {u'_id': ObjectId('546d9d6d8cbd2f060eb323e9'),
  u'address': {u'street': u'Jalan Sahari'},
  u'created': {u'changeset': u'11638099',
               u'timestamp': u'2012-05-18T22:26:16Z',
               u'uid': u'445671',
               u'user': u'flierfy',
               u'version': u'2'},
  u'highway': u'bus_stop',
  u'id': u'1278972435',
  u'name': u'Halte Sahari',
  u'pos': [-6.1277779, 106.8464371],
  u'type': u'node'},
 {u'_id': ObjectId('546d9d758cbd2f060eb39153'),
  u'address': {u'housename': u'Pasar Festival',
               u'housenumber': u'Kav C.22 Unit GF 05-06',
               u'postcode': u'12960',
               u'street': u'Jalan HR Rasuna Said'},
  u'amenity': u'restaurant',
  u'created': {u'changeset': u'10024298',
               u'timestamp': u'2011-12-03T18:57:09Z',
               u'uid': u'92274',
               u'user': u'adjuva',
               u'version': u'5'},
  u'cuisine': u'Indonesian',
  u'id': u'1394496957',
  u'name': u'Warung Tekko',
  u'phone': u'+62 21 5263137',
  u'phone2': u'+62 21 5263278',
  u'pos': [-6.2216971, 106.8328855],
  u'type': u'node',
  u'website': u'www.facebook.com/warungtekko'}]

Show the top 5 of contributed users


In [45]:
pipeline = [
            {'$match': {'created.user':{'$exists':1}}},
            {'$group': {'_id':'$created.user',
                        'count':{'$sum':1}}},
            {'$sort': {'count':-1}},
            {'$limit' : 5}
]
result  = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)


[{u'_id': u'Firman Hadi', u'count': 113770},
 {u'_id': u'dimdim02', u'count': 38860},
 {u'_id': u'riangga_miko', u'count': 36695},
 {u'_id': u'raniedwianugrah', u'count': 30388},
 {u'_id': u'Alex Rollin', u'count': 26496}]

Show the restaurant's name, the food they serve, and contact number


In [9]:
pipeline = [
            {'$match': {'amenity':'restaurant',
                        'name':{'$exists':1}}},
            {'$project':{'_id':'$name',
                         'cuisine':'$cuisine',
                         'contact':'$phone'}}
]
result  = db.jktosm.aggregate(pipeline)['result']
pprint.pprint(result)


[{u'_id': u'Taman Hek'},
 {u'_id': u'3 House'},
 {u'_id': u'Jimbaran'},
 {u'_id': u'Death by Chocolate'},
 {u'_id': u"McDonald's"},
 {u'_id': u"Chef's Kitchen"},
 {u'_id': u'Planet Hollywood Jakarta', u'cuisine': u'american'},
 {u'_id': u'Soto kudus'},
 {u'_id': u'KFC Cikini', u'cuisine': u'chicken'},
 {u'_id': u'Mc Donald Cikini'},
 {u'_id': u'Pempek Cuko'},
 {u'_id': u'Warung Tekko',
  u'contact': u'+62 21 5263137',
  u'cuisine': u'Indonesian'},
 {u'_id': u'Kafe Betawi', u'cuisine': u'asian'},
 {u'_id': u'QBox Cafe', u'cuisine': u'asian'},
 {u'_id': u'Comics Cafe', u'cuisine': u'american'},
 {u'_id': u'Pizza Hut', u'cuisine': u'pizza'},
 {u'_id': u'Otel Lobby', u'cuisine': u'international'},
 {u'_id': u'Loewy', u'cuisine': u'french'},
 {u'_id': u'Food Court Passer Kuningan', u'cuisine': u'asian'},
 {u'_id': u'Pastis', u'cuisine': u'italian'},
 {u'_id': u'Pizza Hut', u'cuisine': u'pizza'},
 {u'_id': u'Dunkin donuts'},
 {u'_id': u'Warung Pasta'},
 {u'_id': u'Ayam Balphuss'},
 {u'_id': u'Riung Tenda'},
 {u'_id': u'Ayam Bakar Gilimanuk'},
 {u'_id': u'Ikan Bakar Banyuwangi'},
 {u'_id': u'Dim Sum Inc'},
 {u'_id': u'Heartz Chicken Buffet'},
 {u'_id': u'Ko he Noor'},
 {u'_id': u'de Resto'},
 {u'_id': u'Bakmi GM'},
 {u'_id': u'Caho Mung Qui Khach'},
 {u'_id': u'Dapur Melayu', u'cuisine': u'asian'},
 {u'_id': u'E Corner'},
 {u'_id': u'Ho Lung Sechan Cuisine', u'cuisine': u'asian'},
 {u'_id': u'Madam Kwok'},
 {u'_id': u'Mangotree Bistro'},
 {u'_id': u'Talaga'},
 {u'_id': u'Tgrill'},
 {u'_id': u'Usselsspring'},
 {u'_id': u'Eastern Promise'},
 {u'_id': u'Bubur Angke', u'cuisine': u'chinese'},
 {u'_id': u'Kembang Goela', u'cuisine': u'indonesia'},
 {u'_id': u'Kantin Mega Rasa', u'cuisine': u'indonesian'},
 {u'_id': u'Mbah Jingkrak Setiabudi', u'cuisine': u'indonesian'},
 {u'_id': u'Makan Babi'},
 {u'_id': u'3 house'},
 {u'_id': u'YaUdah bistro',
  u'contact': u'+62213140343',
  u'cuisine': u'german'},
 {u'_id': u'Mamink Daeng Tata', u'cuisine': u'regional'},
 {u'_id': u'Restoran Putri Duyung'},
 {u'_id': u'Le Bridge Restaurant'},
 {u'_id': u'Lanna Thai', u'cuisine': u'thai'},
 {u'_id': u'The Goods Diner'},
 {u'_id': u'Taco Local', u'cuisine': u'mexican'},
 {u'_id': u"Chili's", u'cuisine': u'american'},
 {u'_id': u'Hacienda ', u'cuisine': u'mexican'},
 {u'_id': u'Sederhana', u'cuisine': u'regional'},
 {u'_id': u'Dim Sum Restaurant', u'cuisine': u'international'},
 {u'_id': u'Warung Desa', u'cuisine': u'asian'},
 {u'_id': u'PEPeNERO'},
 {u'_id': u'Sakura Japanese Restaurant'},
 {u'_id': u'Pelangi Seafood ', u'cuisine': u'indonesian'},
 {u'_id': u'Restoran Kurnia Jaya'},
 {u'_id': u'Rumah Makan Padang Sederhana'},
 {u'_id': u'Wabito Ramen',
  u'contact': u'62 21 3923810',
  u'cuisine': u'japanese'},
 {u'_id': u'Rava House'},
 {u'_id': u'Musketeers'},
 {u'_id': u'Kebab Baba Rafi'},
 {u'_id': u'Bakul TUkul'},
 {u'_id': u'Nasi Bebek'},
 {u'_id': u'Ayam Panggang Rawamangun', u'cuisine': u'chicken'},
 {u'_id': u'Goma ramen', u'contact': u'081807217074', u'cuisine': u'japanese'},
 {u'_id': u'Takigawa', u'cuisine': u'Japanese'},
 {u'_id': u'Warung Pasta', u'cuisine': u'italian'},
 {u'_id': u'Rumah Solo'},
 {u'_id': u'Amigos', u'cuisine': u'mexican'},
 {u'_id': u'Amigos', u'cuisine': u'mexican'},
 {u'_id': u'Amigos', u'cuisine': u'mexican'},
 {u'_id': u'Koi'},
 {u'_id': u'sop janda', u'cuisine': u'regional'},
 {u'_id': u'Waroeng Kito', u'cuisine': u'chicken,_juice'},
 {u'_id': u'Sate Senayan'},
 {u'_id': u'Holy Cow'},
 {u'_id': u'Holy Cow'},
 {u'_id': u'MM Juice'},
 {u'_id': u'Abuba Steak'},
 {u'_id': u'Bubur Mangga Besar', u'cuisine': u'congee'},
 {u'_id': u'Pia Jakarta', u'cuisine': u'bakpia,hopia,pia'},
 {u'_id': u'Awen Seafood', u'cuisine': u'seafood'},
 {u'_id': u'Bluegrass',
  u'contact': u'+62 21 29941660',
  u'cuisine': u'american'},
 {u'_id': u'Warung Bang Hoody'},
 {u'_id': u'Bakmi Toko Tiga', u'cuisine': u'chinese'},
 {u'_id': u'Ayam Goreng Berkah Rachmat'},
 {u'_id': u'Ayam Goreng Suharti'},
 {u'_id': u'Bushido Restaurant'},
 {u'_id': u'Restoran Caping Gunung'},
 {u'_id': u'Bakso Lapangan tembak', u'cuisine': u'regional'},
 {u'_id': u'Baruna'},
 {u'_id': u'Pizza Hut Matraman', u'cuisine': u'pizza'},
 {u'_id': u'RM. Handayani'},
 {u'_id': u'kintamani'},
 {u'_id': u'sentral'},
 {u'_id': u'Kantin Umum', u'cuisine': u'variety_of_cuisines'},
 {u'_id': u'RM Raja Rasa', u'cuisine': u'regional'},
 {u'_id': u'RM Sederhana', u'cuisine': u'regional'},
 {u'_id': u'Sate Tomang', u'cuisine': u'regional'},
 {u'_id': u'warkop asep'},
 {u'_id': u'Iga Bakar Mas Giri', u'cuisine': u'regional'},
 {u'_id': u'Ayam Presto', u'cuisine': u'regional'},
 {u'_id': u'Masakan Rumah Ibu Endang', u'cuisine': u'regional'},
 {u'_id': u'Oenpao', u'cuisine': u'chinese'},
 {u'_id': u'rumah makan ibu ida'},
 {u'_id': u'fix me', u'cuisine': u'chinese'},
 {u'_id': u'Sederhana', u'cuisine': u'padang'},
 {u'_id': u'Saung Elbuston'},
 {u'_id': u'Rumah Makan Soto Betawi'},
 {u'_id': u'Warung Kopi'},
 {u'_id': u'Kampung Kandang'},
 {u'_id': u'La Codefin'},
 {u'_id': u'Kantin Prima Salemba'},
 {u'_id': u'DeJons Burger'},
 {u'_id': u'Bebek Kaleyo', u'cuisine': u'regional'},
 {u'_id': u'Q Smokehouse'},
 {u'_id': u'Kemang Food Fest'},
 {u'_id': u'Rumah Makan Padang', u'cuisine': u'international'},
 {u'_id': u'Bakmi Fajar', u'cuisine': u'regional'},
 {u'_id': u'Foof Court Pinang Ranti'},
 {u'_id': u'RM Sederhana'},
 {u'_id': u'warung soto'},
 {u'_id': u'Pizza Hut'},
 {u'_id': u'AYAM GORENG SUHARTI'},
 {u'_id': u'Ayam Goreng Ny. Suharti'}]

Conlusion