Data Wrangling Project

Map Area: New Delhi, India

Overview of the data

  • new-delhi_india.osm 710 MB
  • new-delhi_india.osm.json 816 MB
  • Number of records = 4063611
  • Number of unique users 953

  • The schema of the data found in mongo db after inserting the data using https://github.com/variety/variety


In [20]:
r'''<pre>
+----------------------------------------------------------------------------+
|

  key                    | types    | occurrences | percents                 |
| ---------------------- | -------- | ----------- | ------------------------ |
| _id                    | ObjectId |     4063611 | 100.00000000000000000000 |
| created                | Object   |     4063611 | 100.00000000000000000000 |
| created.changeset      | String   |     4063611 | 100.00000000000000000000 |
| created.timestamp      | String   |     4063611 | 100.00000000000000000000 |
| created.uid            | String   |     4063611 | 100.00000000000000000000 |
| created.user           | String   |     4063611 | 100.00000000000000000000 |
| created.version        | String   |     4063611 | 100.00000000000000000000 |
| id                     | String   |     4063611 | 100.00000000000000000000 |
| type                   | String   |     4063611 | 100.00000000000000000000 |
| pos                    | Array    |     3374750 |  83.04805750353564519628 |
| node_refs              | Array    |      688861 |  16.95194249646435125101 |
| address                | Object   |        2733 |   0.06725545333940674553 |
| address.housenumber    | String   |        1759 |   0.04328662364581649380 |
| address.street         | String   |        1022 |   0.02515004511996842343 |
| address.city           | String   |         922 |   0.02268917964834724424 |
| address.postcode       | String   |         766 |   0.01885022951261821136 |
| address.interpolation  | String   |         533 |   0.01311641296374086926 |
| address.country        | String   |         388 |   0.00954815802989016429 |
| address.housename      | String   |         180 |   0.00442955784891811699 |
| address.state          | String   |          89 |   0.00219017026974284695 |
| address.full           | String   |          60 |   0.00147651928297270574 |
| address.inclusion      | String   |          28 |   0.00068904233205392934 |
| address.buildingnumber | String   |          23 |   0.00056599905847287051 |
| address.suburb         | String   |          12 |   0.00029530385659454114 |
| address.place          | String   |           8 |   0.00019686923772969410 |
| address.locality       | String   |           3 |   0.00007382596414863528 |
| address.district       | String   |           2 |   0.00004921730943242352 |
| address.area           | String   |           1 |   0.00002460865471621176 |
| address.block_number   | String   |           1 |   0.00002460865471621176 |
| address.city_1         | String   |           1 |   0.00002460865471621176 |
| address.province       | String   |           1 |   0.00002460865471621176 |
| address.street_1       | String   |           1 |   0.00002460865471621176 |
| address.street_2       | String   |           1 |   0.00002460865471621176 |
| address.street_3       | String   |           1 |   0.00002460865471621176 |
| address.subdistrict    | String   |           1 |   0.00002460865471621176 |
| address.unit           | String   |           1 |   0.00002460865471621176 |
+----------------------------------------------------------------------------+'''

None

Other ideas about the datasets

Analysis and code starts here

I was stuck on getting started with this project for while so I will follow a train of thought approach to this project's code. All final thoughts will be summarised before this heading.

Let me start with adding some general functions that I will use for SAX iterating and making a sample file to work with.


In [2]:
from collections import defaultdict

import xml.etree.cElementTree as ET
import re

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if tags is not None and elem.tag not in tags:
            continue
        if event == 'end':
            yield elem
            root.clear()

In [3]:
def take_sample(k, osm_file, sample_file):
    with open(sample_file, 'wb') as output:
        output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        output.write('<osm>\n  ')

        # Write every kth top level element
        for i, element in enumerate(get_element(osm_file)):
            if i % k == 0:
                # print "i is {}".format(i)
                output.write(ET.tostring(element, encoding='utf-8'))

        output.write('</osm>')

In [3]:
#take_sample(10, "new-delhi_india.osm", "sample_10.osm")

Now that we have sample files let me try and understand exactly what kind of data we have in our tags


In [12]:
#OSM_FILE = "new-delhi_india.osm"
OSM_FILE = "sample_100.osm"

In [4]:
def get_tag_types():
    tag_types = set()
    for element in get_element(OSM_FILE, tags=None):
        tag_types.add(element.tag)
    return tag_types

#get_tag_types()

In [5]:
def tag_attributes(osm_file, tags):
    for element in get_element(osm_file, tags):
        print element.attrib

In [ ]:
#tag_attributes(OSM_FILE, ('node',))

In [ ]:
#tag_attributes(OSM_FILE, ('nd',))

In [ ]:
#tag_attributes(OSM_FILE, ('member',))

In [ ]:
#tag_attributes(OSM_FILE, ('tag',))

In [ ]:
#tag_attributes(OSM_FILE, ('relation',))

In [ ]:
#tag_attributes(OSM_FILE, ('way',))

Now that we have an idea about what kind of data we have in our sample file let us start with finding whether the kets that we have are fine or not


In [4]:
import re

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

"""
Your task is to explore the data a bit more.
Before you process the data and add it into your database, you should check the
"k" value for each "<tag>" and see if there are any potential problems.

We have provided you with 3 regular expressions to check for certain patterns
in the tags. As we saw in the quiz earlier, we would like to change the data
model and expand the "addr:street" type of keys to a dictionary like this:
{"address": {"street": "Some value"}}
So, we have to see if we have such tags, and if we have any tags with
problematic characters.

Please complete the function 'key_type', such that we have a count of each of
four tag categories in a dictionary:
  "lower", for tags that contain only lowercase letters and are valid,
  "lower_colon", for otherwise valid tags with a colon in their names,
  "problemchars", for tags with problematic characters, and
  "other", for other tags that do not fall into the other three categories.
See the 'process_map' and 'test' functions for examples of the expected format.
"""

def _key_type(element, keys):
    if element.tag == "tag":
        k = element.attrib['k']
        if problem_chars.search(k):
            print "problemchars {}".format(k)
            keys['problemchars'] += 1
        elif lower_colon.search(k):
            keys['lower_colon'] += 1
        elif lower.search(k):
            keys['lower'] += 1
        else:
            #print "other {}".format(k)
            keys['other'] += 1
        
    return keys

def keys_type():
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for element in get_element(OSM_FILE, ('tag',)):
        keys = _key_type(element, keys)
        
    return keys

In [ ]:
keys_type()

In [5]:
"""
Your task is to explore the data a bit more.
The first task is a fun one - find out how many unique users
have contributed to the map in this particular area!

The function process_map should return a set of unique user IDs ("uid")
"""
def unique_user_contributed(tags = ('node','relation',)):
    users = set()
    for element in get_element(OSM_FILE, tags):
        users.add(element.attrib['user'])
    return users
        
#len(unique_user_contributed())

In [11]:
CREATED = ["version", "changeset", "timestamp", "user", "uid"]

def ensure_key_value(_dict, key, val):
    if key not in _dict:
        _dict[key] = val
    return _dict[key]

STATE_MAPPING = {
    'delhi': 'DL',
    'uttar pradesh': 'UP',
    'u.p.': 'UP',
    'ncr': 'DL'
}

CITY_MAPPING = {
    'gurugram': 'Gurgaon',
    'gurgram': 'Gurgaon',
    'faridabad': 'Faridabad',
    'delh': 'Delhi',
    'new delhi': 'Delhi',
    'neew delhi': 'Delhi',
    'delhi': 'Delhi',
    'old delhi': 'Delhi',
    'noida': 'Noida',
    'greater noida': 'Noida',
    'ghaziabad': 'Ghaziabad',
    'bahadurgarh': 'Bahadurgarh',
    'meerut': 'Meerut'
}



CITY_TO_STATE = {
    'Gurgaon': 'HR',
    'Faridabad': 'HR',
    'Delhi': 'DL',
    'Noida': 'UP',
    'Ghaziabad': 'UP',
    'Bahadurgarh': 'HR',
    'Meerut': 'UP'
}


def fix_address_value(address_type, value):
    
    def if_lower_in_mapping_then_replace(value, mapping):
        if value.lower() in mapping:
            value = mapping[value.lower()]
        
        if value not in set(mapping.values()):
            #print "{} = {}".format(address_type, value)
            pass
        return value
    
    if address_type == 'state':
        value = if_lower_in_mapping_then_replace(value, STATE_MAPPING)
    elif address_type == 'city':
        value = if_lower_in_mapping_then_replace(value, CITY_MAPPING)
        
    return value


def ensure_address(element_map):
    if 'address' not in element_map:
        element_map['address'] = {
                'country': 'IN'
            }
    return element_map['address']


def map_city_to_states(address_map):
    if 'city' in address_map:
        city = address_map['city']
        if city in CITY_TO_STATE:
            address_map['state'] = CITY_TO_STATE[city]
            

def fix_address(element_map):
    """
    After we are done with general processing of individual address fields
    we process it as a whole
    """
    address_map = ensure_address(element_map)
    
    map_city_to_states(address_map)


def process_tags(element, node):
    for tag in element.iter('tag'):
        key = tag.attrib['k']
        value = tag.attrib['v']

        if problem_chars.search(key):
            continue

        if key.startswith("addr:"):
            _parts = key.split(":")
            if len(_parts) > 2:
                continue

            obj = ensure_key_value(node, 'address', {})

            address_type = _parts[1]
            value = fix_address_value(address_type, value)

            obj[address_type] = value
        else:
            node[key] = value

    fix_address(node)

def shape_element(element):
    """
    Takes an element and shapes it to be ready for insertion into the database
    """
    node = {}

    if element.tag == "node" or element.tag == "way":

        node['type'] = element.tag
        process_tags(element, node)
        
        for nd in element.iter('nd'):
            obj = ensure_key_value(node, 'node_refs', [])
            obj.append(nd.attrib['ref'])

        for key, value in element.attrib.iteritems():
            if key in CREATED:
                ensure_key_value(node, 'created', {})
                node['created'][key] = value
            elif key == 'lat':
                ensure_key_value(node, 'pos', [0, 0])
                node['pos'][0] = float(value)
            elif key == 'lon':
                ensure_key_value(node, 'pos', [0, 0])
                node['pos'][1] = float(value)
            else:
                node[key] = value

        return node
    else:
        return None

In [13]:
for element in get_element(OSM_FILE):
    node = shape_element(element)

In [5]:
import pprint

def get_client():
    from pymongo import MongoClient
    return MongoClient('mongodb://localhost:27017/')
    
def get_collection():    
    collection = get_client().examples.osm
    return collection

Let's load data


In [9]:
import codecs
import json

def process_map(file_in, pretty = False):
    """
    Saves file as a json ready for insertion into mongoDB using mongoimport
    """
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    #data = []
    with codecs.open(file_out, "w") as fo:
        for element in get_element(file_in):
            el = shape_element(element)
            if el:
                #data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    #return data

In [10]:
process_map(OSM_FILE)


bigger address addr:housename:source
bigger address addr:housename:source
bigger address addr:housename:source

In [6]:
collection = get_collection()
collection.count()


Out[6]:
4063611

Number of unique users


In [71]:
len(collection.distinct("created.user"))


Out[71]:
953

Not a lot of users seem to be contributing to India's map

Analysis start


In [1]:
# some helper functions for running mongo DB queries

def aggregate_to_list(collection, query):
    result = collection.aggregate(query)
    return list(r for r in result)

def aggregate_and_show(collection, query, limit = True):
    _query = query[:]
    if limit:
        _query.append({"$limit": 5})

    pprint.pprint(aggregate_to_list(collection, query))
    
def aggregate(query):
    aggregate_and_show(collection, query, False)
    
def aggregate_distincts(field, limit = False):
    query = [
            {"$match": {field: {"$exists": 1}}},
            {"$group": {"_id": "$" + field,
                        "count": {"$sum": 1}}},
            {"$sort": {"count": -1}}
        ]
    if limit:
        query.append({"$limit": 10})
    aggregate(query)

How much people are contributing


In [73]:
def contribution_of_top(n):
    result = aggregate_to_list(collection, [
            {"$group": {"_id": "$created.user",
                        "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit": n},
            {"$group": {"_id": 1,
                        "count": {"$sum": "$count"}}}
    ])
    
    return result[0]['count']

def contributions_of(top):
    """
    Given a list of numbers returns a dictionary of contributions of those number of user 
    """
    result = {}
    for count in top:
        result[count] = float(contribution_of_top(count) * 100)  / collection.count()
    return result

In [75]:
pprint.pprint(contributions_of([1, 5, 15, 30, 50]))


{1: 6.18341667054253,
 5: 20.902025316891798,
 15: 48.18089133039555,
 30: 74.26864923832522,
 50: 90.9610688621524}

Out of 953 total users

  • top 5 users have contributed 20%
  • top 15 users have contributed 50%
  • top 40 users have contributed 75%
  • top 50 users have contributed 90%

Top contributing users


In [78]:
aggregate([
        {"$group": {"_id": "$created.user",
                    "count": {"$sum": 1}}},
        {"$sort": {"count": -1}},
        {"$limit": 10}
    ])


[{u'_id': u'Oberaffe', u'count': 251270},
 {u'_id': u'premkumar', u'count': 165252},
 {u'_id': u'saikumar', u'count': 160227},
 {u'_id': u'Naresh08', u'count': 137957},
 {u'_id': u'anushap', u'count': 134671},
 {u'_id': u'sdivya', u'count': 130198},
 {u'_id': u'anthony1', u'count': 126172},
 {u'_id': u'himabindhu', u'count': 124236},
 {u'_id': u'sathishshetty', u'count': 122461},
 {u'_id': u'Apreethi', u'count': 114575}]

Number of users making only 1 contribution


In [103]:
aggregate([
        {"$group": {"_id": "$created.user",
                    "count": {"$sum": 1}}},
        {"$group": {"_id": "$count", 
                    "num_users": {"$sum": 1}}},
        {"$sort": {"_id": 1}},
        {"$limit": 1}
    ])


[{u'_id': 1, u'num_users': 191}]

Number of nodes and ways


In [58]:
collection.count({"type":"node"})


Out[58]:
3374750

In [60]:
collection.count({"type":"way"})


Out[60]:
688861

Looking at the other problems in the data

Country


In [82]:
collection.distinct("address.country")


Out[82]:
[u'IN']

Country is correct.

State

It is possible that the state is not a single one for New Delhi because the map of what we call "New Delhi" is usually of "National Capital Region" which includes New Delhi and some adjoining cities.


In [66]:
collection.distinct("address.state")


Out[66]:
[u'DL', u'UP']
  • Delhi, DL mean the same thing
  • UP, uttar pradesh, U.P. mean the same thing
  • NCR is not a city but a region encompassing many cities

Let's look at the cases where state is mentioned as NCR. That will need to fixed on a case to case basis rather than a simple mapping


In [89]:
ncr_cases = list(r for r in collection.find({"address.state": "NCR"}))

In [94]:
ncr_cases


Out[94]:
[{u'_id': ObjectId('57cd648ab366e2ec273eddfe'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'279',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:50Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'2'},
  u'id': u'1267734091',
  u'pos': [28.5676165, 77.1988599],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede01'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'263',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:50Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'2'},
  u'id': u'1267734104',
  u'pos': [28.5689172, 77.1987795],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede25'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'275',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:48Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299728',
  u'pos': [28.5679417, 77.1988398],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede26'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'278',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:48Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299731',
  u'pos': [28.5676978, 77.1988548],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede27'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'273',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299732',
  u'pos': [28.5681043, 77.1988297],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede28'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'270',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:48Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299729',
  u'pos': [28.5683482, 77.1988147],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede29'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'268',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299733',
  u'pos': [28.5685107, 77.1988046],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede2a'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'276',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299734',
  u'pos': [28.5678604, 77.1988448],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede2b'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'271',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299735',
  u'pos': [28.5682669, 77.1988197],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede2c'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'265',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:48Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299730',
  u'pos': [28.5687546, 77.1987895],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede2d'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'266',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299736',
  u'pos': [28.5686733, 77.1987946],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede2e'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'264',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299739',
  u'pos': [28.5688359, 77.1987845],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede2f'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'277',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299740',
  u'pos': [28.5677791, 77.1988498],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede30'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'272',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299741',
  u'pos': [28.5681856, 77.1988247],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede33'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'269',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299738',
  u'pos': [28.5684294, 77.1988096],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede34'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'267',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:50Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299742',
  u'pos': [28.568592, 77.1987996],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede35'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'145',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017125',
   u'timestamp': u'2011-05-01T00:35:53Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268300229',
  u'pos': [28.5675465, 77.1960276],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede37'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'274',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017112',
   u'timestamp': u'2011-05-01T00:30:49Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268299737',
  u'pos': [28.568023, 77.1988348],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede3b'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'118',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017132',
   u'timestamp': u'2011-05-01T00:38:22Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268300460',
  u'pos': [28.5668484, 77.1990527],
  u'type': u'node'},
 {u'_id': ObjectId('57cd648ab366e2ec273ede3d'),
  u'address': {u'city': u'New Delhi',
   u'country': u'IN',
   u'housenumber': u'282',
   u'state': u'NCR',
   u'street': u'NIT'},
  u'created': {u'changeset': u'8017132',
   u'timestamp': u'2011-05-01T00:38:22Z',
   u'uid': u'5456',
   u'user': u'H_S_Rai',
   u'version': u'1'},
  u'id': u'1268300463',
  u'pos': [28.5668455, 77.1988184],
  u'type': u'node'}]

In [90]:
len(ncr_cases)


Out[90]:
20

These are small number of cases so probably done by the same user


In [93]:
set(element['created']['user'] for element in ncr_cases)


Out[93]:
{u'H_S_Rai'}

My thoughts were right. Looking at the data, all the cases are in New Delhi so I can map these to New Delhi.

So to clean this data I can map

  • [u'Delhi', u'DL', u'NCR'] => u'DL'
  • [u'UP', u'uttar pradesh', u'U.P.'] => u'UP'

In [82]:
collection.distinct("address.city")


Out[82]:
[u'Gurgaon',
 u'Delhi',
 u'Sohna Road',
 u'Ghaziabad',
 u'Noida',
 u'Shahbad Daulatpur, Delhi',
 u'Meerut',
 u'Pandav Nagar, New Delhi',
 u'Gurgaon, Haryana',
 u'Faridabad',
 u'Uttar Pradesh',
 u'Pitam Pura, New Delhi',
 u'Noida , Uttar Pradesh',
 u'Hira Colony, Siraspur, Delhi',
 u'Pratap Colony, Siraspur, Delhi',
 u'Sector - 10, Rohini,, Delhi',
 u'Sector-11, Rohini, Delhi',
 u'Sector - 12, Rohini, Delhi',
 u'Libaspur, Delhi',
 u'Alipur, Delhi',
 u'Siraspur, Delhi',
 u'Mukhrejee Nagar, Delhi',
 u'Naya Band, Khera',
 u'Sector - 5, Rohini, Delhi',
 u'West Karawal Nagar, New Delhi',
 u'Noida (U.P)',
 u'Muradnagar',
 u'Dwarka',
 u'ad',
 u'Indirapuram',
 u'Nueva Delhi',
 u'Bahadurgarh',
 u'Bijwasan',
 u'Khekra',
 u'Badli Industrial Area, Badli, Delhi',
 u'Rohini, Delhi',
 u'Sector - 15, Rohini, Delhi',
 u'Ghaziabazd',
 u'Austin',
 u'Sahibabad, Ghaziabad',
 u'Janakpuri',
 u'Sector- 10, Rohini, Delhi',
 u'Sector - 17, Rohini, Delhi',
 u'Sector - 28, Rohini, Delhi',
 u'Rohini Delhi',
 u'Sector - 11, Rohini, Delhi',
 u'Chanakyapuri, New Delhi',
 u'Gaziabad',
 u'Dadri',
 u'Sector 32 (Pi 1) Greater Noida',
 u'Ghaziabad, UP, India',
 u'Village- Barola, Sector- 49, Noida',
 u'Sohna-Gurgaon',
 u'Damdama']

I was hoping that as this is a map of Delhi the city will be Delhi, Gurgaon and Faridabad. Maybe the spelling and case would be different but still only these.

But this data needs to be cleaned. There are sector names, area names, state names etc. which should not have been there.


In [83]:
collection.distinct("address.street")


Out[83]:
[u'Block A1',
 u'Old Delhi Gurgaon Road',
 u'Sujan Singh Park, Subramania Bharti Marg,Behind Khan Market',
 u'Abul Fazal Road',
 u'Aurangzeb Road',
 u'Bhavani Kunj, Vasant Kunj',
 u'Block B',
 u'South City 2',
 u'Sushant Lok',
 u'Block A',
 u'DDA Flats, Munirka',
 u'Sector 46',
 u'South City II',
 u'5a , Ansari Road',
 u'Sector 17C',
 u'Pamposh Road',
 u'h/no 1/55 sadar bazar delhi cantt 10',
 u'Main Bazaar',
 u'S1',
 u'Palam Vihar',
 u'SDF',
 u'Vinay Marg',
 u'Neeti Khand III',
 u'Sector 65',
 u'janakpuri',
 u'DCE College',
 u'NIT',
 u'Panchsheel Colony',
 u'swami Narayan Marg, Ashok vihar',
 u'Najafgarh Road (Next to the Tilak Nagar Police Station)',
 u'Ajmal Khan Road, Karol Bagh',
 u'Hudson Lines, Kingsway Camp',
 u'Janpath',
 u'Sham Nath Marg',
 u'Golf course',
 u'H33, Bali Nagar',
 u'Dwarka',
 u'chaudhary fateh singh marg',
 u'649-6th Floor, Tower A Spaze iTechPark',
 u'Arakashan Road',
 u'Lohia Nagar',
 u'Sushant Lok 2',
 u'Chander Nagar',
 u'Gali No. 1',
 u'Saraswati Vihar',
 u'Sector 16 NOIDA',
 u'UDSC',
 u'170, Phase-1 Udyog Vihar',
 u'I.P.Extension',
 u'rail vihar',
 u'G-7 Sector-16, Rohini,Delhi',
 u'Windmill Place',
 u'shastri nagar',
 u'Bahadur shah Zafar Marg',
 u'Dr. Bishamber das marg',
 u'gali no 5',
 u'gali number 9',
 u'W-1 Lane',
 u'Pankha Road',
 u'40,arunodaya appts vikaspuri',
 u'Mohan Nagar Link Road',
 u'Lodi Road',
 u'MVL Coral, Alwar Bypass Road,',
 u'sector 11, block- p',
 u'Sector 12',
 u'sector 11',
 u'Green Street Society',
 u'Ahuja Sons Shalwale, Karol Bagh',
 u'26 A, 2nd Floor Hauz Khas Village Road',
 u'moti',
 u'sant kabir daas',
 u'Shastri Nagar',
 u'DLF Phase 4',
 u'delhi',
 u'Safdarjung Enclave',
 u'Chuna Mandi, Paharganj',
 u'Sector-63',
 u'Greater Noida',
 u'Shipra Sun City Road',
 u'south city 1',
 u'sai nagar 4',
 u'Lodhi Estate',
 u'dilshad garden',
 u'DLF Phase 2',
 u'sector 120',
 u'Naoroji Nagar Market',
 u'Hauz Khas Village Road',
 u'Khan Market',
 u'South Extension',
 u'Gandhi Vihar',
 u'110/105',
 u'Sec-8',
 u'Lodhi Road',
 u'new friends colony',
 u'South End Road',
 u'Chattarpur Main Road',
 u'Greater Kailash 1',
 u'sector 37, Faridabad',
 u'Vasundhara Enclave',
 u'noida',
 u'Udyog Vihar Phase -1',
 u'shankar vihar',
 u'ansari road',
 u'Sector 27A',
 u'Karol Bagh',
 u'Vaishali Road',
 u'Carterpuri Village',
 u'Defence Colony Market',
 u'Sanjay Nagar',
 u'Ansal Majestic Tower, Vikaspuri',
 u'Noida',
 u'Complex Pitampura',
 u'Amrita Shergil Marg',
 u'Amrita Shergil Lane',
 u'AP Market, shop no.41, Maurya Enclave',
 u'Netaji Subhash Chandra Bose Road',
 u'Sector 43',
 u'Radial Road 6',
 u'B-Block, Sector 63',
 u'Sector-29, Near Vyapar Kendra',
 u'gurgaon',
 u'Vikas Marg',
 u'Ganga Shopping Complex',
 u'Sector-64,',
 u'Gali Chandi Wali',
 u'udhyog vihar, phase - 2, Gurgaon, Haryana',
 u'Under Hill Lane, Civil Lines',
 u'Ansari Nagar',
 u'Sadhbhawna Marg',
 u'prakash mohalla',
 u'TughlakabadNew Delhi-110044 ',
 u'Sector \u2013 39',
 u'HPR School Main Road, Hira Colony, Siraspur, Delhi',
 u'HPR School Main Road',
 u'Siraspur Khera Road',
 u'Pratap Gali',
 u'Lower Ground Floor, Sector 45',
 u'Sardhanand College',
 u'Sector 11 Dwarka',
 u'East Gorakh Park, Shahdara',
 u'Gali No 12',
 u'Building 5, DLF City II, Sector 25',
 u'Asaf Ali Road',
 u'Mathura Road',
 u'Wazirabad Road',
 u'Loni Road',
 u'Birbal Road',
 u'Central Market, Sector 50',
 u'Kherli Hafizpur, Noida (U.P)',
 u'Kherli Hafizpur, UP',
 u'Nithnri Road',
 u'Shiv Nagar',
 u'Khanpur',
 u'Sarita Vihar',
 u'MAMURA Road',
 u'Meera Bagh',
 u'Hanuman Market Harola',
 u'Kendriya Vihar',
 u'Mayur Vihar',
 u'Dadri Road',
 u'Shree Balaji Complex',
 u'Pritampura',
 u'Sector NO. 22, nr A- 126',
 u'Sector 10',
 u'Mint Market Nankpura, Nr Moti Bagh',
 u'Bhangel',
 u'Vaishali Express Green Society',
 u'Purani Tanki, Bhangel',
 u'Dadari Road',
 u'G.B.Nagar, Bharat Ghar',
 u'Accher',
 u'Khora colony',
 u'Mamura, Gali No 5',
 u'Jalebi Chowk',
 u'Sector 47',
 u'Sector 24',
 u'Chandni Chowk',
 u'Swaroop Park',
 u'Khari Baoli Road',
 u'GT Road',
 u'Ansal majestic tower, Vikaspuri',
 u'Deshbandhu Gupta Road',
 u'Sec - 19, Poket 3, Dwarka',
 u'Bhairon Marg',
 u'Jawaharlal Nehru University',
 u'Nyaya Marg',
 u'Sector-18, Rohini',
 u'Outer Ring Road',
 u'Building Materials Market, Ecotech-II, Udyog Vihar',
 u'Hauz Rani',
 u'Road Number 59',
 u'Lothian Road, Kashmere Gate',
 u'Indra',
 u'Indra Vihar, north delhi',
 u'Batra, north delhi',
 u'Sohna Road',
 u'Choudhary Hukum Chand Marg',
 u'Sector 44',
 u'Service Road',
 u'Naya Bans, Sector 15',
 u'147',
 u'Asif Ali Road',
 u'P.O. Hasanpur Tauru, Mewat',
 u'Rama Road, Industrial Area',
 u'Anand Lok',
 u'Gali 13',
 u'A Block',
 u'sdf',
 u'Madan Mohan Malviya Marg',
 u'Dr. Sushila Naiyar Marg',
 u'Builder Area',
 u'Jal Vayu Vihar, Plot-8, Pocket-4, Builder area',
 u'August Kranti Marg',
 u'Rao Tula Ram Marg',
 u'Prithviraj Marg',
 u'Lohia nagar market, Ghaziabad 201001, up',
 u'Block G, Patel Nagar 3, Ghaziabad 201001, UP',
 u'Nabi Karim Road',
 u'Have Khas, New Delhi,  Delhi 110016',
 u'B block, lohia nagar, Ghaziabad',
 u'Sector 18',
 u'Qutab Road',
 u'24',
 u'23',
 u'Sector 22',
 u'ambedkar road',
 u'Modern Industrial Estate, Part-A',
 u'lohia nagar, Ghaziabad - 201001',
 u'Pocket-C Sidhartha Extension',
 u'Kirti Nagar',
 u'Sector Phi-02',
 u'Deshbandu Gupta Road',
 u'Connaught Place Inner Circle',
 u'Golf Course Extension Road',
 u'Golf Course Road',
 u'G-5, Sector-16, Rohini, New Delhi-110089',
 u'Sector 23A',
 u'Sector-18',
 u'100 Feet Road',
 u'Surajkund Road',
 u'Saidulajab',
 u'Westend Marg',
 u'GK-1',
 u'b-1, green wood city, sector -45',
 u'Abhay Khand II, Indirapuram ',
 u'Sector 32',
 u'rezang la marg',
 u'U-13',
 u'Guru Ravidas Mg',
 u'Ring Road',
 u'Bada Bazaar',
 u'Bada Bazar',
 u'Chota Bazar',
 u'Ashok Vihar phase-2',
 u'Sansanwal Marg',
 u'Arya Samaj Rd',
 u'Aruna Asif Ali Marg',
 u'Maulana Azad Road',
 u'Lothian Road, Kashmere gate',
 u'Sector 58',
 u'Sector 52',
 u'Phase 1 Ashok Vihar Rd, Delhi',
 u'Ashok Vihar Road',
 u'SECTOR-11',
 u'Sector - 11,',
 u'Satsang Vihar Marg',
 u'Tibetan New Camp',
 u'Sector 6',
 u'DLF City II, Sector 25',
 u'HSIIDC Industrial Estate',
 u'Sector49',
 u'Press Enclave Marg',
 u'Sector 13',
 u'Rajinder Nagar',
 u'Sector 7',
 u'Sector 56',
 u'Adhyatmik Nagar',
 u'Aurobindo Marg',
 u'Kasturba Gandhi Marg',
 u'Tees January Marg',
 u'Prachin Shiv Mandir Road',
 u'Okhla Industrial Area Phase 3',
 u'2nd Cross Ave Road',
 u'Sector 29',
 u'Suite 100',
 u'Sarita Vihar Institutional Area',
 u'Faridabad NH-IV',
 u'Nelson Mandela Marg',
 u'G T Road',
 u'Nelson Mandela Road',
 u'Indraprastha Marg',
 u'Mahatma Gandhi Marg',
 u'Kaushambu',
 u'Sector 45',
 u'Old Faridabad-Jasana Road',
 u'Sector 10, Rohini Twin District Centre',
 u'Bakshi Marg',
 u'Okhla Phase 3',
 u'Mehrauli Gurgaon Road',
 u'Ring road',
 u'C Block, Janakpuri',
 u'DLF Cyber City Phase III',
 u'Plot 9, Jasola District Centre',
 u'Marg 22',
 u'Sector 16',
 u'F Block, Sanjay Nagar',
 u'Govindpuram',
 u'Ramakrishna Ashram Marg',
 u'Arya Samaj Road',
 u'Vaibhav Khand, Indirapuram',
 u'Prithviraj Road',
 u'Benito Juarez Road',
 u'Sector 82',
 u'Sector-58',
 u'Hailey Road',
 u'Modern Industrial Estate, Part-B',
 u'Rajpur Road',
 u'Gharoli Road, Mayur Vihar Phase - 3',
 u'Subroto Park',
 u'Lane K, RBI Staff Quarters, Sarojini Nagar',
 u'Lane E, Sarojini Nagar',
 u'Lawrence Road',
 u'Bahadur Shah Zafar Marg',
 u'Barakhamba Road',
 u'Mansingh Road',
 u'Sahakarita Marg',
 u'Africa Avenue Marg',
 u'Knowledge Park 3',
 u'Knowledge Park III',
 u'Panchgaon, Manesar',
 u'Bhagwan Mahavir Marg',
 u'Sector - 11',
 u'Sector - 12',
 u'HPR School Back Road',
 u'Siraspur Firni',
 u'Siraspur Samaypur Road',
 u'Hauz Khas',
 u'Shanti Path',
 u'Kherli Hafizpur, Noida (UP)',
 u'Village Chhalera & Sadarpur, Sadarpur, Sector 45, Noida,',
 u'Anarwali Masjid, Block J, New Delhi.',
 u'Hayat Nagar, Khoda Colony, Gaziabad',
 u'Arya Mandir Samaj, Railway Road',
 u'Mayur Vihar, Ganesh Temple, 92, Pocket D, Phase 2, Model Town',
 u'Vinod Nagar West, New Delhi, East Delhi',
 u'I-64, Laxmi Nagar, New Delhi',
 u'Sector 19',
 u'Kalesh Complex, Pandav Nagar',
 u'Rohini sector 13',
 u'Sector 12, Noida, Gautam Buddh Nagar',
 u'Mayur Vihar Phase III New Delhi',
 u'Sunder Vihar',
 u'I. P. Extension',
 u'Block B, Ashok Nagar Extension, New Ashok Nagar',
 u'Main Rd, Block B, Nanakpura, Shakarpur',
 u'New Ashok Nagar Rd, Block B, New Ashok Nagar',
 u'Sector 27',
 u'Tilak Nagar Round About, Ashok Nagar',
 u'Sector 5, Harola',
 u'Pitampura',
 u'Shivaji Road',
 u'Sector 50',
 u'Block B, Ashok Nagar Extension',
 u'New Krishna Park, Najafgarh Road',
 u'Street C, Munirka DDA Flats',
 u'Street E, Munirka DDA Flats',
 u'Abdul Gaffar Khan Marg',
 u'Sector - 3',
 u'Ecotech-II, Udyog Vihar',
 u'Vasundhara',
 u'Greater Noida Expressway',
 u'G C Narang Marg',
 u'Benito Juarez Marg',
 u'Pandara Road',
 u'lothian road, Kashmere gate',
 u'indra vihar, mukherji nagar, north delhi',
 u'Sector 30',
 u"Lawyer's Street, Green Park (Main)",
 u'L Block, Anand Vihar, Hari Nagar',
 u'Mehrauli Badarpur Road',
 u'Sector 16 A',
 u'Dadri Road, Surajpur',
 u'Sector-23',
 u'Gail Society',
 u'Sector 2',
 u'Greater Kailash 2',
 u'sai hotel',
 u'Sec 62',
 u'Sector-21A',
 u'sector 37',
 u'Sector 86 Faridabad',
 u'Lala Rewti Wali Gali',
 u'Purana Bazaar',
 u'Shiv Mandir Wali Gali',
 u'sunaro wali gali, surajpur',
 u'Purana Bazaar, Surajpur',
 u'Lakhnawali Road, Surajpur',
 u'barahi road, surajpur',
 u'Sector 5A, Chiranjeev Vihar',
 u'Chiranjeev Vihar',
 u'Hauz Khas Enclave',
 u'Basant lane',
 u'Sector 62',
 u'Brahm Colony',
 u'Chaudhary Dalip Singh Marg',
 u'Hauzkhas Enclave',
 u'Alaknanda Road',
 u'lohia nagar',
 u'P.O. Box Dhaula, Karanki Road',
 u'Near Hari Masjid, Jogabai Extn, Okhla',
 u'Near An-noor Masjid, JOgabai Extn',
 u'Jogabai Extn, Okhla',
 u'Sidhartha Extension',
 u'Vardhman Times Plaza, Plot 13, DDA Community Centre, Road 44, Pitampura',
 u'SCOPE Complex',
 u'Sector- CHI-4',
 u'Off Sohna-Gurgaon Road',
 u'Opposite Shastri Park, Near Zero Pusta',
 u'Sector-62',
 u'Dr. Mukherjee Nagar, Near Batra Cinema',
 u'Dr. Mukherjee Nagar']

In [85]:
aggregate_distincts("address.country")


[{u'_id': u'IN', u'count': 4061266}]

In [13]:
aggregate_distincts("address.state")


[{u'_id': u'DL', u'count': 377},
 {u'_id': u'HR', u'count': 293},
 {u'_id': u'UP', u'count': 113}]

In [7]:
aggregate_distincts("address.city", True)


[{u'_id': u'Delhi', u'count': 377},
 {u'_id': u'Gurgaon', u'count': 274},
 {u'_id': u'Noida', u'count': 87},
 {u'_id': u'Ghaziabad', u'count': 21},
 {u'_id': u'Hira Colony, Siraspur, Delhi', u'count': 17},
 {u'_id': u'Faridabad', u'count': 16},
 {u'_id': u'Sector - 11, Rohini, Delhi', u'count': 15},
 {u'_id': u'Pandav Nagar, New Delhi', u'count': 14},
 {u'_id': u'Sohna Road', u'count': 12},
 {u'_id': u'Indirapuram', u'count': 7}]

In [8]:
aggregate_distincts("address.street", True)


[{u'_id': u'Sector 46', u'count': 142},
 {u'_id': u'Palam Vihar', u'count': 68},
 {u'_id': u'S1', u'count': 56},
 {u'_id': u'Block A1', u'count': 53},
 {u'_id': u'Sector 6', u'count': 44},
 {u'_id': u'NIT', u'count': 20},
 {u'_id': u'Bhavani Kunj, Vasant Kunj', u'count': 17},
 {u'_id': u'Sector 17C', u'count': 16},
 {u'_id': u'Barakhamba Road', u'count': 12},
 {u'_id': u'Jawaharlal Nehru University', u'count': 12}]

In [15]:
aggregate_to_list(collection, [
        {"$match": {"address.city": 'Hira Colony, Siraspur, Delhi'}}
    ])


Out[15]:
[{u'_id': ObjectId('57d044e39ab88bb63256dc26'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road, Hira Colony, Siraspur, Delhi'},
  u'created': {u'changeset': u'25004661',
   u'timestamp': u'2014-08-25T14:11:58Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'4'},
  u'id': u'3008663907',
  u'pos': [28.7551761, 77.1344924],
  u'type': u'node'},
 {u'_id': ObjectId('57d044e39ab88bb63256dc28'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'housenumber': u'Near RWA Office, Hira Colony, Siraspur',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road'},
  u'created': {u'changeset': u'25004654',
   u'timestamp': u'2014-08-25T14:11:26Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'3'},
  u'id': u'3008661514',
  u'pos': [28.7553062, 77.1343328],
  u'type': u'node'},
 {u'_id': ObjectId('57d044e49ab88bb63256e541'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road, Hira Colony, Siraspur, Delhi'},
  u'created': {u'changeset': u'25004639',
   u'timestamp': u'2014-08-25T14:10:54Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'3'},
  u'id': u'3012222965',
  u'pos': [28.7552526, 77.1335128],
  u'type': u'node'},
 {u'_id': ObjectId('57d044e49ab88bb63256e548'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042'},
  u'created': {u'changeset': u'24700214',
   u'timestamp': u'2014-08-12T12:12:08Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'3012225252',
  u'pos': [28.7565713, 77.1325005],
  u'type': u'node'},
 {u'_id': ObjectId('57d045459ab88bb63284a351'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road'},
  u'created': {u'changeset': u'25005498',
   u'timestamp': u'2014-08-25T14:45:14Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'7'},
  u'id': u'297028433',
  u'node_refs': [u'3008677439',
   u'3041512633',
   u'3008677440',
   u'3008677441',
   u'3008677442',
   u'3008677443',
   u'3008677444',
   u'3008677445',
   u'3008677446',
   u'3014096551',
   u'3041490405',
   u'3041482946',
   u'3008677448',
   u'3041457749',
   u'3008677439'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a826'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road'},
  u'created': {u'changeset': u'25005071',
   u'timestamp': u'2014-08-25T14:29:27Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300042333',
  u'node_refs': [u'3041459869',
   u'3041459870',
   u'3041459871',
   u'3041459872',
   u'3041459869'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a827'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road'},
  u'created': {u'changeset': u'25005096',
   u'timestamp': u'2014-08-25T14:30:21Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300042476',
  u'node_refs': [u'3041467430',
   u'3041467431',
   u'3041467432',
   u'3041468633',
   u'3041467430'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a829'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road'},
  u'created': {u'changeset': u'25005463',
   u'timestamp': u'2014-08-25T14:43:42Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300047927',
  u'node_refs': [u'3041457747',
   u'3041509293',
   u'3041509294',
   u'3041509295',
   u'3041457747'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a82a'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042'},
  u'created': {u'changeset': u'25005540',
   u'timestamp': u'2014-08-25T14:47:04Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300048194',
  u'node_refs': [u'3041511123',
   u'3041511124',
   u'3041457747',
   u'3041457746',
   u'3041511123'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a82d'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Back Road'},
  u'created': {u'changeset': u'25005869',
   u'timestamp': u'2014-08-25T15:00:53Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300051299',
  u'node_refs': [u'3041541719',
   u'3041542317',
   u'3041542318',
   u'3041541720',
   u'3041541719'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a82e'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Back Road'},
  u'created': {u'changeset': u'25005869',
   u'timestamp': u'2014-08-25T15:00:53Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300051300',
  u'node_refs': [u'3041542317',
   u'3041542319',
   u'3041542320',
   u'3041542318',
   u'3041542317'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a82f'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Back Road'},
  u'created': {u'changeset': u'25005888',
   u'timestamp': u'2014-08-25T15:01:48Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300051343',
  u'node_refs': [u'3041542838',
   u'3041542839',
   u'3041542840',
   u'3041542841',
   u'3041542838'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a830'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'Siraspur Firni'},
  u'created': {u'changeset': u'25005962',
   u'timestamp': u'2014-08-25T15:05:09Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300051526',
  u'node_refs': [u'3041547958',
   u'3041547959',
   u'3041547960',
   u'3041547961',
   u'3041547962',
   u'3041547963',
   u'3041547964',
   u'3041547965',
   u'3041547966',
   u'3041547967',
   u'3041547968',
   u'3041547969',
   u'3041547970',
   u'3041547971',
   u'3041547972',
   u'3041547973',
   u'3041547958'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a831'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'Siraspur Samaypur Road'},
  u'created': {u'changeset': u'25005937',
   u'timestamp': u'2014-08-25T15:04:08Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300051465',
  u'node_refs': [u'3041544560',
   u'3041544561',
   u'3041544562',
   u'3041544563',
   u'3041544560'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a833'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'HPR School Main Road'},
  u'created': {u'changeset': u'25005198',
   u'timestamp': u'2014-08-25T14:33:44Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'2'},
  u'id': u'300042005',
  u'node_refs': [u'3041457746',
   u'3041457747',
   u'3041457748',
   u'3041457749',
   u'3041457746'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a834'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042'},
  u'created': {u'changeset': u'25006005',
   u'timestamp': u'2014-08-25T15:06:39Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300051820',
  u'node_refs': [u'3041548507',
   u'3041548508',
   u'3041548509',
   u'3041548510',
   u'3041548507'],
  u'type': u'way'},
 {u'_id': ObjectId('57d045459ab88bb63284a835'),
  u'address': {u'city': u'Hira Colony, Siraspur, Delhi',
   u'postcode': u'110042',
   u'street': u'Siraspur Firni'},
  u'created': {u'changeset': u'25005990',
   u'timestamp': u'2014-08-25T15:06:09Z',
   u'uid': u'2249730',
   u'user': u"zezo's frnd",
   u'version': u'1'},
  u'id': u'300051621',
  u'node_refs': [u'3041550533',
   u'3041550534',
   u'3041550535',
   u'3041550536',
   u'3041550537',
   u'3041550538',
   u'3041550539',
   u'3041550540',
   u'3041550541',
   u'3041550542',
   u'3041550543',
   u'3041550544',
   u'3041550533'],
  u'type': u'way'}]

In [9]:
aggregate_distincts("amenity", True)


[{u'_id': u'school', u'count': 901},
 {u'_id': u'place_of_worship', u'count': 331},
 {u'_id': u'parking', u'count': 327},
 {u'_id': u'fuel', u'count': 212},
 {u'_id': u'hospital', u'count': 185},
 {u'_id': u'restaurant', u'count': 166},
 {u'_id': u'atm', u'count': 151},
 {u'_id': u'bank', u'count': 135},
 {u'_id': u'college', u'count': 128},
 {u'_id': u'fast_food', u'count': 106}]

In [10]:
aggregate_distincts("landuse", True)


[{u'_id': u'residential', u'count': 2093},
 {u'_id': u'commercial', u'count': 576},
 {u'_id': u'basin', u'count': 287},
 {u'_id': u'industrial', u'count': 278},
 {u'_id': u'grass', u'count': 259},
 {u'_id': u'retail', u'count': 193},
 {u'_id': u'military', u'count': 89},
 {u'_id': u'reservoir', u'count': 81},
 {u'_id': u'forest', u'count': 47},
 {u'_id': u'meadow', u'count': 32}]

In [34]:
aggregate_distincts("place")


[{u'_id': u'locality', u'count': 932},
 {u'_id': u'village', u'count': 306},
 {u'_id': u'suburb', u'count': 177},
 {u'_id': u'neighbourhood', u'count': 66},
 {u'_id': u'hamlet', u'count': 32},
 {u'_id': u'town', u'count': 20},
 {u'_id': u'city', u'count': 6},
 {u'_id': u'yes', u'count': 4},
 {u'_id': u'county', u'count': 4},
 {u'_id': u'Vasant_Kunj', u'count': 1},
 {u'_id': u'state', u'count': 1},
 {u'_id': u'islet', u'count': 1},
 {u'_id': u'Pharma_exporter,_delhi', u'count': 1}]