Data source: https://s3.amazonaws.com/metro-extracts.mapzen.com/jakarta_indonesia.osm.bz2
In [1]:
from osm_dataauditor import OSMDataAuditor
osm_data = OSMDataAuditor('jakarta_indonesia.osm')
In [2]:
# Basic element check
osm_data.count_element()
Out[2]:
OSM allows a very flexible tagging system, which gives user freedom but causing problem with consistency. I count the number for all tag in the document.
In [3]:
# Check the tag key and element
tag_keys = osm_data.get_tag_keys()
Below I list the top 20 tag keys sorted descending, as the whole list will be too exhaustive to read.
In [4]:
sorted(tag_keys, key=lambda x: x[1], reverse=True)[:20]
Out[4]:
Looking through the data I see several things:
On the topic of duplicate tags, below I show several tag that has similar names.
In [5]:
import re
# Name vs Nama (Nama is Indonesian for Name)
[item for item in tag_keys if re.match('(name|nama)$', item[0], re.I)]
Out[5]:
In [6]:
# Province vs propinsi (Propinsi is Indonesian for Province)
[item for item in tag_keys if re.match('(province|propinsi)$', item[0], re.I)]
Out[6]:
In [7]:
# Alamat vs address, (Alamat is Indonesian for address)
sorted([item for item in tag_keys if re.match('(addr|alamat)', item[0], re.I)], key=lambda x: x[1], reverse=True)
Out[7]:
We see for address there are three related tags: 'ALAMAT, 'addr:street', 'addr:full'. Both 'addr:street' and 'addr:full' is valid tag, so we can not merge them. OSM wiki imply that the using 'addr:street' and other supporting field is better then using 'addr:full' but our data shows that we have more 'addr:full' then 'addr:street' (10038 vs 1441).
Indonesia has a bit complex administrative subdivision. It is divided as follow:
For Jakarta (a province), it is divided into 4 cities: Jakarta Selatan, Jakarta Utara, Jakarta Barat and Jakarta Timur.
And then there are non-administrative division like RT and RW. While RT and RW is considered a non-administrative subdivision, it is widely use (The ID card has and requires this information).
I believe this is what lead user to just simply put the whole address in 'addr:full', as this is much simpler. Separating it into different bits is often difficult and it does not actually matched OSM address tag (it does not have things like 'Kabupaten' or 'Kecamatan'). But as OSM warns, putting everything in 'addr:full' makes it harder to parse by software.
There are some effort by the community to try to add the division into the data, but the result is not consistent. For instance for Regency we have 'KAB_NAME', 'kab_name', 'Kabupaten', 'kab.', etc. And some way node uses 'admin_level' tag and then put 'kabupaten' in the value.
The problems with the address prefixes:
For the street name there are several variation, that is: 'jl.', 'jln.', 'jl', jln'. And then some use all upper case, some all lower case and some are mixed.
In [8]:
abbreviated_st = osm_data.audit_abbreviated_street_prefixes()
print "Total abbreviated street names:", len(abbreviated_st)
# Print the first 10 rows
abbreviated_st[:10]
Out[8]:
In [9]:
abbreviated_alley = osm_data.audit_abbreviated_alley_prefixes()
print "Total abbreviated alley names:", len(abbreviated_alley)
# Print the first 10 rows
abbreviated_alley[:10]
Out[9]:
To check the similarity between the reference address name with the one we have in our MongoDB, I utilize Python built-in difflib library. SequenceMatcher class can compare and give out ratio of similarity. We are intrested at strings that generate ratio above 0.65 and less then 1. We also ignore the prefix and the suffix like 'raya' which is common suffix for street names in Indonesia.
In [10]:
closely_matched = osm_data.audit_address_similar_names()
closely_matched = sorted(list(closely_matched), key=lambda x: x[2], reverse=True)
print "Total similiar item found: ", len(closely_matched)
# Display top 10 with score
[(reference, found, score) for reference, found, score in closely_matched[:10]]
Out[10]:
We still need to manually check and replace the street name, but it is a much simpler task.
In [12]:
city_names = osm_data.audit_city()
print "Different city names: ", len(city_names)
There are 44 unique different names for city, after looking at the result, many of them are invalid. Some only have Jakarta as the listed city, what is surprising though is that there are cities from the surroundings area as well. For instance Tangerang, Bekasi and Bogor are included.
Since it is common in Indonesia to use the same street name for multiple city we need to be conservative here. I will only update the city if there is already a city field and it only says 'Jakarta'.
Unfortunately most of the the address is in the addr:full tag instead of addr:street (4515 vs 461). Looking through the content of the addr:full, we can see several variation.
RT RW only no street name, example:
Street name with RT no RW, example:
Street name with house number, example:
etc.
I won't be cleaning addr:full further (other then the prefixes fix) on this project but I think I need to address this here.
This section contains basic statistics about the dataset and the MongoDB queries used to gather them.
File sizes:
jakarta_indonesia.osm ......... 449.2 MB jakarta_indonesia.osm.json .... 524.3 MB
In [13]:
from pymongo import MongoClient
client = MongoClient()
db = client['osm_data_import']
In [14]:
# Number of document
db.jakarta.find().count()
Out[14]:
In [15]:
# Number of nodes
db.jakarta.find({'type': 'node'}).count()
Out[15]:
In [16]:
# Number of way
db.jakarta.find({'type': 'way'}).count()
Out[16]:
In [17]:
# Number of unique user
len(db.jakarta.distinct('created.user'))
Out[17]:
In [18]:
# Top 10 contributing user
list(db.jakarta.aggregate([{'$group': {'_id': '$created.user', 'count': {'$sum': 1}}}, {'$sort':{'count':-1}}, {'$limit':10}]))
Out[18]:
In [19]:
# Place of worship breakdown
list(db.jakarta.aggregate([
{"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship"}},
{"$group":{"_id":"$religion", "count":{"$sum":1}}},
{"$sort":{"count":-1}}
]))
Out[19]:
Jakarta experience flooding issue every year. There is this cycle the citizen believe take place, small flood every year and a big one every 5 years.
Looking at the tag list, I saw this:
('flood:overflow', 2619),
('flood:rain', 4859),
('flood:rob', 1049),
('flood:send', 3362),
('flood_cause:overflowing_river', 2),
('flood_depth', 5860),
('flood_duration', 5696),
('flood_latest', 5845),
('flood_prone', 21051),
('floodprone', 19)
Which is great, so we have flooding information. But, I imagine it will be difficult to manually add this information.
Fortunately, Indonesian loves Twitter, and they tweet about the event everytime this happens. Some of the user turn on their geolocation. So we can probably use that to populate more flooding information into our data. Use Twitter API to fetch user flood information, get the geolocation, if needed use Google API to do geo reverse and add entry to OSM data and update the data.
The data I obtain from OSM is far from perfect. For the purpose of this exercise, however, I have clean up the address.
I wish I could clean the full address ('address.full' key) a bit, but it is in free format, which makes it really painful to parse.
Capitalization is also a problem in the data set. But I can not find the list of address to be the reference. We can simply perform capitilization on the address but I don't think this is accurate. For instance one of the address is 'kh mas mansyur' which we can not immediately capitalize to 'Kh Mas Mansyur' as the correct capitalization is 'KH Mas Mansyur'.
The data set also includes surrounding city like Tangerang, Bekasi, Bogor, etc. So a better name for the dataset will be Greater Jakarta.