Wrangling OpenStreetMap Data

Map area: Jakarta, Indonesia

Data source: https://s3.amazonaws.com/metro-extracts.mapzen.com/jakarta_indonesia.osm.bz2

Overview

After downloading the map data of Jakarta, I do some initial checking on the document manually. I then run some code to get the feel of the data.



In [1]:

    
from osm_dataauditor import OSMDataAuditor

osm_data = OSMDataAuditor('jakarta_indonesia.osm')



In [2]:

    
# Basic element check
osm_data.count_element()









    Out[2]:





{'bounds': 1,
 'member': 2083,
 'nd': 2522657,
 'node': 1994318,
 'osm': 1,
 'relation': 420,
 'tag': 700156,
 'way': 364030}

OSM allows a very flexible tagging system, which gives user freedom but causing problem with consistency. I count the number for all tag in the document.



In [3]:

    
# Check the tag key and element
tag_keys = osm_data.get_tag_keys()

Below I list the top 20 tag keys sorted descending, as the whole list will be too exhaustive to read.



In [4]:

    
sorted(tag_keys, key=lambda x: x[1], reverse=True)[:20]









    Out[4]:





[('building', 557134),
 ('highway', 124402),
 ('created_by', 84423),
 ('name', 79005),
 ('source', 29848),
 ('building:levels', 25345),
 ('building:roof', 25059),
 ('building:walls', 25051),
 ('amenity', 24781),
 ('building:structure', 24658),
 ('flood_prone', 21051),
 ('natural', 19025),
 ('admin_level', 17723),
 ('boundary', 15606),
 ('building:use', 15113),
 ('oneway', 12414),
 ('landuse', 12296),
 ('addr:full', 10038),
 ('kab_name', 10015),
 ('kec_name', 10004)]

Looking through the data I see several things:

There are only a handful usage of postal code, total of 900. This is not suprising, since in Indonesia we don't use this postal code that much.
There are only a handful entry for telephone, total of 212.
There are quite numerous flood_prone tag, 21000+.
Many of the tags are actually duplicate.
Etc.

On the topic of duplicate tags, below I show several tag that has similar names.



In [5]:

    
import re

# Name vs Nama (Nama is Indonesian for Name)
[item for item in tag_keys if re.match('(name|nama)$', item[0], re.I)]









    Out[5]:





[('NAMA', 2), ('Nama', 92), ('Name', 2), ('nama', 28), ('name', 79005)]



In [6]:

    
# Province vs propinsi (Propinsi is Indonesian for Province)
[item for item in tag_keys if re.match('(province|propinsi)$', item[0], re.I)]









    Out[6]:





[('Propinsi', 2), ('Province', 2)]



In [7]:

    
# Alamat vs address, (Alamat is Indonesian for address)
sorted([item for item in tag_keys if re.match('(addr|alamat)', item[0], re.I)], key=lambda x: x[1], reverse=True)









    Out[7]:





[('addr:full', 10038),
 ('addr:housenumber', 1475),
 ('addr:street', 1441),
 ('addr:city', 1057),
 ('addr:postcode', 858),
 ('addr:housename', 492),
 ('addr:country', 52),
 ('addr:province', 30),
 ('addr:district', 26),
 ('addr:suburb', 22),
 ('addr:subdistrict', 20),
 ('addr:interpolation', 16),
 ('addr:floor', 4),
 ('addr:state', 4),
 ('ALAMAT', 2),
 ('addr:county', 2),
 ('addr:housenumber_1', 2),
 ('addr:place', 2),
 ('alamat', 2)]

We see for address there are three related tags: 'ALAMAT, 'addr:street', 'addr:full'. Both 'addr:street' and 'addr:full' is valid tag, so we can not merge them. OSM wiki imply that the using 'addr:street' and other supporting field is better then using 'addr:full' but our data shows that we have more 'addr:full' then 'addr:street' (10038 vs 1441).

Indonesia has a bit complex administrative subdivision. It is divided as follow:

Province
Regency (Kabupaten) or City (Kota)
District (Kecamatan)
Village

For Jakarta (a province), it is divided into 4 cities: Jakarta Selatan, Jakarta Utara, Jakarta Barat and Jakarta Timur.

And then there are non-administrative division like RT and RW. While RT and RW is considered a non-administrative subdivision, it is widely use (The ID card has and requires this information).

I believe this is what lead user to just simply put the whole address in 'addr:full', as this is much simpler. Separating it into different bits is often difficult and it does not actually matched OSM address tag (it does not have things like 'Kabupaten' or 'Kecamatan'). But as OSM warns, putting everything in 'addr:full' makes it harder to parse by software.

There are some effort by the community to try to add the division into the data, but the result is not consistent. For instance for Regency we have 'KAB_NAME', 'kab_name', 'Kabupaten', 'kab.', etc. And some way node uses 'admin_level' tag and then put 'kabupaten' in the value.

Problems encountered in Map

Address prefixes problem

The problems with the address prefixes:

Abbreviated street names (Jl. Masjid Almunawarah, Jln Perintis, etc)
Abbreviated alley names (Gg. Kembang)

For the street name there are several variation, that is: 'jl.', 'jln.', 'jl', jln'. And then some use all upper case, some all lower case and some are mixed.



In [8]:

    
abbreviated_st = osm_data.audit_abbreviated_street_prefixes()
print "Total abbreviated street names:", len(abbreviated_st)

# Print the first 10 rows
abbreviated_st[:10]









    



Total abbreviated street names: 2041






    Out[8]:





['Jl. Patra Kuningan',
 'Jl. HR. Rasuna Said Blok X-5 Kav. 5-9',
 'Jl Alteri Permata Hijau',
 'Jl. Bojana Tirta',
 'jl taman ratu indah',
 'Jl Danau Sunter',
 'Jl Raya Bogor',
 'Jl. Raya Bogor Rt 008/02',
 'Jl. Raya Bogor Km 20',
 'Jl. Raya Bogor']



In [9]:

    
abbreviated_alley = osm_data.audit_abbreviated_alley_prefixes()
print "Total abbreviated alley names:", len(abbreviated_alley)

# Print the first 10 rows
abbreviated_alley[:10]









    



Total abbreviated alley names: 18






    Out[9]:





['Gg. 3 Blok R',
 'Gg. Kembang Sepatu (dari arah Jl. Rawa Selatan)',
 'Gg Kwista Rt 03 Rw 05',
 'Gg 6 Rt 06 Rw 04',
 'Gg Kwista Rt 01 Rw 05',
 'Gg Rawa Tengah Rt 02 rw 07',
 'Gg. Panca Marga',
 'Gg. Sepakat',
 'Gg. Lontar RT 008 RW 006',
 'Gg. Trikora RT 005 RW 005']

Address spelling problem

Several street names is actually the same, but spell out differently. Wikipedia has the name of most of important street in Jakarta. So I will use that as a reference.

To check the similarity between the reference address name with the one we have in our MongoDB, I utilize Python built-in difflib library. SequenceMatcher class can compare and give out ratio of similarity. We are intrested at strings that generate ratio above 0.65 and less then 1. We also ignore the prefix and the suffix like 'raya' which is common suffix for street names in Indonesia.



In [10]:

    
closely_matched = osm_data.audit_address_similar_names()
closely_matched = sorted(list(closely_matched), key=lambda x: x[2], reverse=True)

print "Total similiar item found: ", len(closely_matched)

# Display top 10 with score
[(reference, found, score) for reference, found, score in closely_matched[:10]]









    



Total similiar item found:  295






    Out[10]:





[(u'jalan jenderal sudirman', 'jalan jendral sudirman', 0.9777777777777777),
 (u'jalan kramat raya', 'jalan kamal raya', 0.9090909090909091),
 (u'jalan jenderal gatot subroto',
  'jalan jend. gatot subroto',
  0.9056603773584906),
 (u'jalan hr rasuna said', 'jl hr. rasuna said', 0.8947368421052632),
 (u'jalan hr rasuna said', 'jl. hr rasuna said', 0.8947368421052632),
 (u'jalan jenderal ahmad yani', 'jl. jendral ahmad yani', 0.8936170212765957),
 (u'jalan senen raya', 'jalan menteng raya', 0.8823529411764706),
 (u'jalan pasar minggu', 'jalan raya pasar minggu', 0.8780487804878049),
 (u'jalan raya bogor', 'jl. raya bogor', 0.8666666666666667),
 (u'jalan jembatan dua', 'jalan jembatan batu', 0.8648648648648649)]

We still need to manually check and replace the street name, but it is a much simpler task.

Incomplete and incorrect city problem

Since I am using the reference from Wikipedia which contain group streets by cities. I can probably update the city information using this reference as well. First let's audit the city information.



In [12]:

    
city_names = osm_data.audit_city()

print "Different city names: ", len(city_names)









    



Different city names:  44

There are 44 unique different names for city, after looking at the result, many of them are invalid. Some only have Jakarta as the listed city, what is surprising though is that there are cities from the surroundings area as well. For instance Tangerang, Bekasi and Bogor are included.

Since it is common in Indonesia to use the same street name for multiple city we need to be conservative here. I will only update the city if there is already a city field and it only says 'Jakarta'.

Overuse of addr:full tag

Unfortunately most of the the address is in the addr:full tag instead of addr:street (4515 vs 461). Looking through the content of the addr:full, we can see several variation.

RT RW only no street name, example:
- RT 002 RW 08
- RT 0014 RW 010
Street name with RT no RW, example:
- Jalan sumur batu raya, Rt 07/01
Street name with house number, example:
- Jalan Medan Merdeka Barat No. 12
- Jalan Kartini 8 dalam no 18
etc.

I won't be cleaning addr:full further (other then the prefixes fix) on this project but I think I need to address this here.

Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them.

File sizes:

jakarta_indonesia.osm ......... 449.2 MB jakarta_indonesia.osm.json .... 524.3 MB



In [13]:

    
from pymongo import MongoClient

client = MongoClient()
db = client['osm_data_import']



In [14]:

    
# Number of document
db.jakarta.find().count()









    Out[14]:





2358348



In [15]:

    
# Number of nodes
db.jakarta.find({'type': 'node'}).count()









    Out[15]:





1994318



In [16]:

    
# Number of way
db.jakarta.find({'type': 'way'}).count()









    Out[16]:





363954



In [17]:

    
# Number of unique user
len(db.jakarta.distinct('created.user'))









    Out[17]:





1365



In [18]:

    
# Top 10 contributing user
list(db.jakarta.aggregate([{'$group': {'_id': '$created.user', 'count': {'$sum': 1}}}, {'$sort':{'count':-1}}, {'$limit':10}]))









    Out[18]:





[{u'_id': u'Alex Rollin', u'count': 409359},
 {u'_id': u'PutriRachiemnys', u'count': 171520},
 {u'_id': u'zahrabanu', u'count': 124793},
 {u'_id': u'Dosandriani', u'count': 114818},
 {u'_id': u'miftajnh', u'count': 114544},
 {u'_id': u'dfo', u'count': 110296},
 {u'_id': u'naomiangelia', u'count': 104560},
 {u'_id': u'Firman Hadi', u'count': 96807},
 {u'_id': u'anisa berliana', u'count': 89299},
 {u'_id': u'ceyockey', u'count': 70948}]



In [19]:

    
# Place of worship breakdown
list(db.jakarta.aggregate([
        {"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship"}},
        {"$group":{"_id":"$religion", "count":{"$sum":1}}},
        {"$sort":{"count":-1}}
    ]))









    Out[19]:





[{u'_id': u'muslim', u'count': 3438},
 {u'_id': u'christian', u'count': 374},
 {u'_id': u'buddhist', u'count': 68},
 {u'_id': None, u'count': 68},
 {u'_id': u'hindu', u'count': 15},
 {u'_id': u'confucian', u'count': 4}]

Additional ideas

Jakarta experience flooding issue every year. There is this cycle the citizen believe take place, small flood every year and a big one every 5 years.

Looking at the tag list, I saw this:

('flood:overflow', 2619),
('flood:rain', 4859),
('flood:rob', 1049),
('flood:send', 3362),
('flood_cause:overflowing_river', 2),
('flood_depth', 5860),
('flood_duration', 5696),
('flood_latest', 5845),
('flood_prone', 21051),
('floodprone', 19)

Which is great, so we have flooding information. But, I imagine it will be difficult to manually add this information.

Fortunately, Indonesian loves Twitter, and they tweet about the event everytime this happens. Some of the user turn on their geolocation. So we can probably use that to populate more flooding information into our data. Use Twitter API to fetch user flood information, get the geolocation, if needed use Google API to do geo reverse and add entry to OSM data and update the data.

Conclusion and Notes

The data I obtain from OSM is far from perfect. For the purpose of this exercise, however, I have clean up the address.

I wish I could clean the full address ('address.full' key) a bit, but it is in free format, which makes it really painful to parse.

Capitalization is also a problem in the data set. But I can not find the list of address to be the reference. We can simply perform capitilization on the address but I don't think this is accurate. For instance one of the address is 'kh mas mansyur' which we can not immediately capitalize to 'Kh Mas Mansyur' as the correct capitalization is 'KH Mas Mansyur'.

The data set also includes surrounding city like Tangerang, Bekasi, Bogor, etc. So a better name for the dataset will be Greater Jakarta.