Number of unique users 953
The schema of the data found in mongo db after inserting the data using https://github.com/variety/variety
In [20]:
r'''<pre>
+----------------------------------------------------------------------------+
|
key | types | occurrences | percents |
| ---------------------- | -------- | ----------- | ------------------------ |
| _id | ObjectId | 4063611 | 100.00000000000000000000 |
| created | Object | 4063611 | 100.00000000000000000000 |
| created.changeset | String | 4063611 | 100.00000000000000000000 |
| created.timestamp | String | 4063611 | 100.00000000000000000000 |
| created.uid | String | 4063611 | 100.00000000000000000000 |
| created.user | String | 4063611 | 100.00000000000000000000 |
| created.version | String | 4063611 | 100.00000000000000000000 |
| id | String | 4063611 | 100.00000000000000000000 |
| type | String | 4063611 | 100.00000000000000000000 |
| pos | Array | 3374750 | 83.04805750353564519628 |
| node_refs | Array | 688861 | 16.95194249646435125101 |
| address | Object | 2733 | 0.06725545333940674553 |
| address.housenumber | String | 1759 | 0.04328662364581649380 |
| address.street | String | 1022 | 0.02515004511996842343 |
| address.city | String | 922 | 0.02268917964834724424 |
| address.postcode | String | 766 | 0.01885022951261821136 |
| address.interpolation | String | 533 | 0.01311641296374086926 |
| address.country | String | 388 | 0.00954815802989016429 |
| address.housename | String | 180 | 0.00442955784891811699 |
| address.state | String | 89 | 0.00219017026974284695 |
| address.full | String | 60 | 0.00147651928297270574 |
| address.inclusion | String | 28 | 0.00068904233205392934 |
| address.buildingnumber | String | 23 | 0.00056599905847287051 |
| address.suburb | String | 12 | 0.00029530385659454114 |
| address.place | String | 8 | 0.00019686923772969410 |
| address.locality | String | 3 | 0.00007382596414863528 |
| address.district | String | 2 | 0.00004921730943242352 |
| address.area | String | 1 | 0.00002460865471621176 |
| address.block_number | String | 1 | 0.00002460865471621176 |
| address.city_1 | String | 1 | 0.00002460865471621176 |
| address.province | String | 1 | 0.00002460865471621176 |
| address.street_1 | String | 1 | 0.00002460865471621176 |
| address.street_2 | String | 1 | 0.00002460865471621176 |
| address.street_3 | String | 1 | 0.00002460865471621176 |
| address.subdistrict | String | 1 | 0.00002460865471621176 |
| address.unit | String | 1 | 0.00002460865471621176 |
+----------------------------------------------------------------------------+'''
None
I was stuck on getting started with this project for while so I will follow a train of thought approach to this project's code. All final thoughts will be summarised before this heading.
Let me start with adding some general functions that I will use for SAX iterating and making a sample file to work with.
In [2]:
from collections import defaultdict
import xml.etree.cElementTree as ET
import re
def get_element(osm_file, tags=('node', 'way', 'relation')):
"""Yield element if it is the right type of tag
Reference:
http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
"""
context = iter(ET.iterparse(osm_file, events=('start', 'end')))
_, root = next(context)
for event, elem in context:
if tags is not None and elem.tag not in tags:
continue
if event == 'end':
yield elem
root.clear()
In [3]:
def take_sample(k, osm_file, sample_file):
with open(sample_file, 'wb') as output:
output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
output.write('<osm>\n ')
# Write every kth top level element
for i, element in enumerate(get_element(osm_file)):
if i % k == 0:
# print "i is {}".format(i)
output.write(ET.tostring(element, encoding='utf-8'))
output.write('</osm>')
In [3]:
#take_sample(10, "new-delhi_india.osm", "sample_10.osm")
Now that we have sample files let me try and understand exactly what kind of data we have in our tags
In [12]:
#OSM_FILE = "new-delhi_india.osm"
OSM_FILE = "sample_100.osm"
In [4]:
def get_tag_types():
tag_types = set()
for element in get_element(OSM_FILE, tags=None):
tag_types.add(element.tag)
return tag_types
#get_tag_types()
In [5]:
def tag_attributes(osm_file, tags):
for element in get_element(osm_file, tags):
print element.attrib
In [ ]:
#tag_attributes(OSM_FILE, ('node',))
In [ ]:
#tag_attributes(OSM_FILE, ('nd',))
In [ ]:
#tag_attributes(OSM_FILE, ('member',))
In [ ]:
#tag_attributes(OSM_FILE, ('tag',))
In [ ]:
#tag_attributes(OSM_FILE, ('relation',))
In [ ]:
#tag_attributes(OSM_FILE, ('way',))
Now that we have an idea about what kind of data we have in our sample file let us start with finding whether the kets that we have are fine or not
In [4]:
import re
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
"""
Your task is to explore the data a bit more.
Before you process the data and add it into your database, you should check the
"k" value for each "<tag>" and see if there are any potential problems.
We have provided you with 3 regular expressions to check for certain patterns
in the tags. As we saw in the quiz earlier, we would like to change the data
model and expand the "addr:street" type of keys to a dictionary like this:
{"address": {"street": "Some value"}}
So, we have to see if we have such tags, and if we have any tags with
problematic characters.
Please complete the function 'key_type', such that we have a count of each of
four tag categories in a dictionary:
"lower", for tags that contain only lowercase letters and are valid,
"lower_colon", for otherwise valid tags with a colon in their names,
"problemchars", for tags with problematic characters, and
"other", for other tags that do not fall into the other three categories.
See the 'process_map' and 'test' functions for examples of the expected format.
"""
def _key_type(element, keys):
if element.tag == "tag":
k = element.attrib['k']
if problem_chars.search(k):
print "problemchars {}".format(k)
keys['problemchars'] += 1
elif lower_colon.search(k):
keys['lower_colon'] += 1
elif lower.search(k):
keys['lower'] += 1
else:
#print "other {}".format(k)
keys['other'] += 1
return keys
def keys_type():
keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for element in get_element(OSM_FILE, ('tag',)):
keys = _key_type(element, keys)
return keys
In [ ]:
keys_type()
In [5]:
"""
Your task is to explore the data a bit more.
The first task is a fun one - find out how many unique users
have contributed to the map in this particular area!
The function process_map should return a set of unique user IDs ("uid")
"""
def unique_user_contributed(tags = ('node','relation',)):
users = set()
for element in get_element(OSM_FILE, tags):
users.add(element.attrib['user'])
return users
#len(unique_user_contributed())
In [11]:
CREATED = ["version", "changeset", "timestamp", "user", "uid"]
def ensure_key_value(_dict, key, val):
if key not in _dict:
_dict[key] = val
return _dict[key]
STATE_MAPPING = {
'delhi': 'DL',
'uttar pradesh': 'UP',
'u.p.': 'UP',
'ncr': 'DL'
}
CITY_MAPPING = {
'gurugram': 'Gurgaon',
'gurgram': 'Gurgaon',
'faridabad': 'Faridabad',
'delh': 'Delhi',
'new delhi': 'Delhi',
'neew delhi': 'Delhi',
'delhi': 'Delhi',
'old delhi': 'Delhi',
'noida': 'Noida',
'greater noida': 'Noida',
'ghaziabad': 'Ghaziabad',
'bahadurgarh': 'Bahadurgarh',
'meerut': 'Meerut'
}
CITY_TO_STATE = {
'Gurgaon': 'HR',
'Faridabad': 'HR',
'Delhi': 'DL',
'Noida': 'UP',
'Ghaziabad': 'UP',
'Bahadurgarh': 'HR',
'Meerut': 'UP'
}
def fix_address_value(address_type, value):
def if_lower_in_mapping_then_replace(value, mapping):
if value.lower() in mapping:
value = mapping[value.lower()]
if value not in set(mapping.values()):
#print "{} = {}".format(address_type, value)
pass
return value
if address_type == 'state':
value = if_lower_in_mapping_then_replace(value, STATE_MAPPING)
elif address_type == 'city':
value = if_lower_in_mapping_then_replace(value, CITY_MAPPING)
return value
def ensure_address(element_map):
if 'address' not in element_map:
element_map['address'] = {
'country': 'IN'
}
return element_map['address']
def map_city_to_states(address_map):
if 'city' in address_map:
city = address_map['city']
if city in CITY_TO_STATE:
address_map['state'] = CITY_TO_STATE[city]
def fix_address(element_map):
"""
After we are done with general processing of individual address fields
we process it as a whole
"""
address_map = ensure_address(element_map)
map_city_to_states(address_map)
def process_tags(element, node):
for tag in element.iter('tag'):
key = tag.attrib['k']
value = tag.attrib['v']
if problem_chars.search(key):
continue
if key.startswith("addr:"):
_parts = key.split(":")
if len(_parts) > 2:
continue
obj = ensure_key_value(node, 'address', {})
address_type = _parts[1]
value = fix_address_value(address_type, value)
obj[address_type] = value
else:
node[key] = value
fix_address(node)
def shape_element(element):
"""
Takes an element and shapes it to be ready for insertion into the database
"""
node = {}
if element.tag == "node" or element.tag == "way":
node['type'] = element.tag
process_tags(element, node)
for nd in element.iter('nd'):
obj = ensure_key_value(node, 'node_refs', [])
obj.append(nd.attrib['ref'])
for key, value in element.attrib.iteritems():
if key in CREATED:
ensure_key_value(node, 'created', {})
node['created'][key] = value
elif key == 'lat':
ensure_key_value(node, 'pos', [0, 0])
node['pos'][0] = float(value)
elif key == 'lon':
ensure_key_value(node, 'pos', [0, 0])
node['pos'][1] = float(value)
else:
node[key] = value
return node
else:
return None
In [13]:
for element in get_element(OSM_FILE):
node = shape_element(element)
In [5]:
import pprint
def get_client():
from pymongo import MongoClient
return MongoClient('mongodb://localhost:27017/')
def get_collection():
collection = get_client().examples.osm
return collection
In [9]:
import codecs
import json
def process_map(file_in, pretty = False):
"""
Saves file as a json ready for insertion into mongoDB using mongoimport
"""
# You do not need to change this file
file_out = "{0}.json".format(file_in)
#data = []
with codecs.open(file_out, "w") as fo:
for element in get_element(file_in):
el = shape_element(element)
if el:
#data.append(el)
if pretty:
fo.write(json.dumps(el, indent=2)+"\n")
else:
fo.write(json.dumps(el) + "\n")
#return data
In [10]:
process_map(OSM_FILE)
In [6]:
collection = get_collection()
collection.count()
Out[6]:
In [71]:
len(collection.distinct("created.user"))
Out[71]:
Not a lot of users seem to be contributing to India's map
In [1]:
# some helper functions for running mongo DB queries
def aggregate_to_list(collection, query):
result = collection.aggregate(query)
return list(r for r in result)
def aggregate_and_show(collection, query, limit = True):
_query = query[:]
if limit:
_query.append({"$limit": 5})
pprint.pprint(aggregate_to_list(collection, query))
def aggregate(query):
aggregate_and_show(collection, query, False)
def aggregate_distincts(field, limit = False):
query = [
{"$match": {field: {"$exists": 1}}},
{"$group": {"_id": "$" + field,
"count": {"$sum": 1}}},
{"$sort": {"count": -1}}
]
if limit:
query.append({"$limit": 10})
aggregate(query)
In [73]:
def contribution_of_top(n):
result = aggregate_to_list(collection, [
{"$group": {"_id": "$created.user",
"count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": n},
{"$group": {"_id": 1,
"count": {"$sum": "$count"}}}
])
return result[0]['count']
def contributions_of(top):
"""
Given a list of numbers returns a dictionary of contributions of those number of user
"""
result = {}
for count in top:
result[count] = float(contribution_of_top(count) * 100) / collection.count()
return result
In [75]:
pprint.pprint(contributions_of([1, 5, 15, 30, 50]))
Out of 953 total users
In [78]:
aggregate([
{"$group": {"_id": "$created.user",
"count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}
])
In [103]:
aggregate([
{"$group": {"_id": "$created.user",
"count": {"$sum": 1}}},
{"$group": {"_id": "$count",
"num_users": {"$sum": 1}}},
{"$sort": {"_id": 1}},
{"$limit": 1}
])
In [58]:
collection.count({"type":"node"})
Out[58]:
In [60]:
collection.count({"type":"way"})
Out[60]:
In [82]:
collection.distinct("address.country")
Out[82]:
In [66]:
collection.distinct("address.state")
Out[66]:
Let's look at the cases where state is mentioned as NCR. That will need to fixed on a case to case basis rather than a simple mapping
In [89]:
ncr_cases = list(r for r in collection.find({"address.state": "NCR"}))
In [94]:
ncr_cases
Out[94]:
In [90]:
len(ncr_cases)
Out[90]:
These are small number of cases so probably done by the same user
In [93]:
set(element['created']['user'] for element in ncr_cases)
Out[93]:
My thoughts were right. Looking at the data, all the cases are in New Delhi so I can map these to New Delhi.
So to clean this data I can map
In [82]:
collection.distinct("address.city")
Out[82]:
I was hoping that as this is a map of Delhi the city will be Delhi, Gurgaon and Faridabad. Maybe the spelling and case would be different but still only these.
But this data needs to be cleaned. There are sector names, area names, state names etc. which should not have been there.
In [83]:
collection.distinct("address.street")
Out[83]:
In [85]:
aggregate_distincts("address.country")
In [13]:
aggregate_distincts("address.state")
In [7]:
aggregate_distincts("address.city", True)
In [8]:
aggregate_distincts("address.street", True)
In [15]:
aggregate_to_list(collection, [
{"$match": {"address.city": 'Hira Colony, Siraspur, Delhi'}}
])
Out[15]:
In [9]:
aggregate_distincts("amenity", True)
In [10]:
aggregate_distincts("landuse", True)
In [34]:
aggregate_distincts("place")