Wrangling OpenStreetMap Data with MongoDB

by Duc Vu in fulfillment of Udacity’s Data Analyst Nanodegree, Project 3

OpenStreetMap is an open project that lets eveyone use and create a free editable map of the world.

1. Chosen Map Area

In this project, I choose to analyze data from Boston, Massachusetts want to show you to fix one type of error, that is the address of the street. And not only that, I also will show you how to put the data that has been audited into MongoDB instance. We also use MongoDB's Agregation Framework to get overview and analysis of the data


In [81]:
from IPython.display import HTML
HTML('<iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" \
src="http://www.openstreetmap.org/export/embed.html?bbox=-71.442,42.1858,-70.6984,42.4918&amp;layer=mapnik"></iframe><br/>')


Out[81]:


In [2]:
filename = 'boston_massachusetts.osm'

I used the Overpass API to download the OpenStreetMap XML for the corresponding bounding box:

2. Auditing the Data

In this project, I will parse through the downloaded OSM XML file with ElementTree and find the number of each type of element since the XML file are too large to work with in memory.


In [3]:
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    '''
    this function will return a dictionary with the tag name as the key 
    and number of times this tag can be encountered in the map as value.
    '''
    tags = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag in tags:
            tags[elem.tag] +=1
        else:
            tags[elem.tag]= 1
                
    return tags
tags = count_tags(filename)
pprint.pprint(tags)


{'bounds': 1,
 'member': 9586,
 'nd': 2242045,
 'node': 1886391,
 'osm': 1,
 'relation': 1186,
 'tag': 846441,
 'way': 294246}

Before processing the data and add it into MongoDB, I should check the "k" value for each 'tag' and see if they can be valid keys in MongoDB, as well as see if there are any other potential problems.

I have built 3 regular expressions to check for certain patterns in the tags to change the data model and expand the "addr:street" type of keys to a dictionary like this:

Here are three regular expressions: lower, lower_colon, and problemchars.

  • lower: matches strings containing lower case characters
  • lower_colon: matches strings containing lower case characters and a single colon within the string
  • problemchars: matches characters that cannot be used within keys in MongoDB
  • example: {"address": {"street": "Some value"}}

So, we have to see if we have such tags, and if we have any tags with problematic characters. Please complete the function 'key_type'.


In [4]:
import re

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    '''
     this function counts number of times the unusual tag element can be encountered in the map.
    Args:
        element(string): tag element in the map.
        keys(int): number of that encountered tag in the map
    '''
    if element.tag == "tag":

        if lower.search(element.attrib['k']):
            keys["lower"] += 1
        elif lower_colon.search(element.attrib['k']):
            keys["lower_colon"] += 1
        elif problemchars.search(element.attrib['k']):
            keys["problemchars"] +=1
        else:
            keys["other"] +=1
        

    return keys


def process_map(filename):
    '''
     this function will return a dictionary with the unexpexted tag element as the key 
     and number of times this string can be encountered in the map as value.
    Args:
        filename(osm): openstreetmap file.
    '''
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

'''
#Below unit testing runs process_map with file example.osm
def test():
    keys = process_map('example.osm')
    pprint.pprint(keys)
    assert keys == {'lower': 5, 'lower_colon': 0, 'other': 1, 'problemchars': 1}


if __name__ == "__main__":
    test()

'''

keys = process_map(filename)
pprint.pprint(keys)


{'lower': 749756, 'lower_colon': 56532, 'other': 40149, 'problemchars': 4}

Now I will redefine process_map to build a set of unique userid's found within the XML. I will then output the length of this set, representing the number of unique users making edits in the chosen map area.


In [5]:
def process_map(filename):
    '''
    This function will return a set of unique user IDs ("uid") 
    making edits in the chosen map area (i.e Boston area).
    Args:
        filename(osm): openstreetmap file.
    '''
    users = set()
    for _, element in ET.iterparse(filename):
        #print element.attrib
        
        try:
            users.add(element.attrib['uid'])
        except KeyError:
            continue
        '''
        if "uid" in element.attrib:
            users.add(element.attrib['uid'])
        '''
    return users

'''
#Below unit testing runs process_map with file example.osm
def test():

    users = process_map('example.osm')
    pprint.pprint(users)
    assert len(users) == 6

if __name__ == "__main__":
    test()
'''

    
    
users = process_map(filename)
#pprint.pprint(users)
print len(users)


1016

3. Problems Encountered in the Map

3.1 Street name

The majority of this project will be devoted to auditing and cleaning street names in the OSM XML file by changing the variable 'mapping' to reflect the changes needed to fix the unexpected or abbreviated street types to the appropriate ones in the expected list. I will find these abbreviations and replace them with their full text form.


In [6]:
from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)


expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

In [7]:
def audit_street_type(street_types, street_name, rex):
    '''
     This function will take in the dictionary of street types, a string of street name to audit, 
     a regex to match against that string, and the list of expected street types.
     Args:
        street_types(dictionary): dictionary of street types.
        street_name(string):  a string of street name to audit.
        rex(regex): a compiled regular expression to match against the street_name.
    '''
    #m = street_type_re.search(street_name)
    m = rex.search(street_name)
    #print m
    #print m.group()
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

Let's define a function that not only audits tag elements where k="addr:street", but whichever tag elements match the is_street_name function. The audit function also takes in a regex and the list of expected matches


In [8]:
def audit(osmfile,rex):
    '''
     This function changes the variable 'mapping' to reflect the changes needed to fix
    the unexpected street types to the appropriate ones in the expected list.
     Args:
        filename(osm): openstreetmap file.
        rex(regex): a compiled regular expression to match against the street_name.
    '''
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'],rex)

    return street_types

The function is_street_name determines if an element contains an attribute k="addr:street". I will use is_street_name as the tag filter when I call the audit function to audit street names.


In [9]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

Now print the output of audit


In [10]:
st_types = audit(filename, rex = street_type_re)
pprint.pprint(dict(st_types))


{'1100': set(['First Street, Suite 1100']),
 '1702': set(['Franklin Street, Suite 1702']),
 '303': set(['First Street, Suite 303']),
 '6': set(['South Station, near Track 6']),
 '846028': set(['PO Box 846028']),
 'Artery': set(['Southern Artery']),
 'Ave': set(['360 Huntington Ave',
             '738 Commonwealth Ave',
             'Blue Hill Ave',
             'Boston Ave',
             'College Ave',
             'Commonwealth Ave',
             'Concord Ave',
             'Francesca Ave',
             'Harrison Ave',
             'Highland Ave',
             'Huntington Ave',
             'Josephine Ave',
             'Lexington Ave',
             'Massachusetts Ave',
             'Morrison Ave',
             'Mystic Ave',
             'Sagamore Ave',
             'Somerville Ave',
             "St. Paul's Ave",
             'Washington Ave',
             'Western Ave',
             'Willow Ave']),
 'Ave.': set(['Brighton Ave.',
              'Huntington Ave.',
              'Massachusetts Ave.',
              'Somerville Ave.',
              'Spaulding Ave.']),
 'Boylston': set(['Boylston']),
 'Broadway': set(['Broadway', 'West Broadway']),
 'Brook': set(['Furnace Brook']),
 'Cambrdige': set(['Cambrdige']),
 'Center': set(['Cambridge Center']),
 'Circle': set(['Edgewood Circle']),
 'Corner': set(['Webster Street, Coolidge Corner']),
 'Ct': set(['Kelley Ct']),
 'Elm': set(['Elm']),
 'Federal': set(['Federal']),
 'Fellsway': set(['Fellsway']),
 'Fenway': set(['Fenway']),
 'Floor': set(['Boylston Street, 5th Floor']),
 'HIghway': set(['American Legion HIghway']),
 'Hall': set(['Faneuil Hall']),
 'Hampshire': set(['Hampshire']),
 'Highway': set(['American Legion Highway',
                 'Cummins Highway',
                 "Monsignor O'Brien Highway",
                 'Providence Highway',
                 'Santilli Highway']),
 'Holland': set(['Holland']),
 'Hwy': set(["Monsignor O'Brien Hwy"]),
 'LEVEL': set(['LOMASNEY WAY, ROOF LEVEL']),
 'Lafayette': set(['Avenue De Lafayette']),
 'Mall': set(['Cummington Mall']),
 'Newbury': set(['Newbury']),
 'Park': set(['Austin Park',
              'Batterymarch Park',
              'Canal Park',
              'Exeter Park',
              'Giles Park',
              'Malden Street Park',
              'Monument Park']),
 'Pkwy': set(['Birmingham Pkwy']),
 'Pl': set(['Longfellow Pl']),
 'Rd': set(['Abby Rd',
            'Aberdeen Rd',
            'Bristol Rd',
            'Goodnough Rd',
            'Oakland Rd',
            'Soldiers Field Rd',
            'Squanto Rd']),
 'Row': set(['Assembly Row', 'East India Row', 'Professors Row']),
 'ST': set(['Newton ST']),
 'South': set(['Charles Street South']),
 'Sq.': set(['1 Kendall Sq.']),
 'St': set(['1629 Cambridge St',
            '644 Beacon St',
            'Adams St',
            'Antwerp St',
            'Arsenal St',
            'Athol St',
            'Bagnal St',
            'Beacon St',
            'Brentwood St',
            'Broad St',
            'Cambridge St',
            'Congress St',
            'Court St',
            'Cummington St',
            'Dane St',
            'Duval St',
            'E 4th St',
            'Elm St',
            'Everett St',
            'George St',
            'Grove St',
            'Hampshire St',
            'Holton St',
            'Kirkland St',
            'Leighton St',
            'Litchfield St',
            'Lothrop St',
            'Mackin St',
            'Maverick St',
            'Medford St',
            'Merrill St',
            'N Beacon St',
            'Newbury St',
            'Norfolk St',
            'Portsmouth St',
            'Richardson St',
            'Salem St',
            'Sea St',
            'South Waverly St',
            'Stewart St',
            'Summer St',
            'Ware St',
            'Waverly St',
            'Winter St']),
 'St,': set(['Walnut St,']),
 'St.': set(['Albion St.',
             'Banks St.',
             'Boylston St.',
             'Brookline St.',
             'Centre St.',
             'Elm St.',
             'Main St.',
             'Marshall St.',
             'Maverick St.',
             'Pearl St.',
             'Prospect St.',
             "Saint Mary's St.",
             'Stuart St.',
             'Tremont St.']),
 'Street.': set(['Hancock Street.']),
 'Terrace': set(['Alberta Terrace', 'Norfolk Terrace', 'Westbourne Terrace']),
 'Way': set(['Artisan Way',
             'Courthouse Way',
             'David G Mugar Way',
             'Davidson Way',
             'Evans Way',
             'Harry Agganis Way',
             'Ross Way',
             'Yawkey Way']),
 'Wharf': set(['Long Wharf', 'Rowes Wharf']),
 'Windsor': set(['Windsor']),
 'Winsor': set(['Winsor']),
 'ave': set(['Massachusetts ave']),
 'floor': set(['First Street, 18th floor', 'Sidney Street, 2nd floor']),
 'place': set(['argus place']),
 'rd.': set(['Corey rd.']),
 'st': set(['Church st']),
 'street': set(['Boston street'])}

From the results of the audit, I will create a dictionary to map abbreviations to their full, clean representations.


In [11]:
# UPDATE THIS VARIABLE
mapping = { "ave" : "Avenue",
            "Ave" : "Avenue",
            "Ave.": "Avenue",
            "Ct" : "Court",
            "HIghway": "Highway",
            "Hwy": "Highway",
            "LEVEL": "Level",
            "Pkwy": "Parkway",
            "Pl": "Place",
            "rd." : "Road",
            "Rd" : "Road",
            "Rd." : "Road",
            "Sq." : "Square", 
            "st": "Street",
            "St": "Street",
            "St.": "Street",
            "St,": "Street", 
            "ST": "Street",
            "Street." : "Street",               
            }

The first result of audit gives me a list of some abbreviated street types (as well as unexpected clean street types, cardinal directions, and highway numbers). So I need to build an update_name function to replace these abbreviated street types.


In [12]:
def update_name(name, mapping,rex):
    '''
     This function takes a string with street name as an argument 
     and replace these abbreviated street types with the fixed name.
     Args:
        name(string): street name to update.
        mapping(dictionary): a mapping dictionary.
        rex(regex): a compiled regular expression to match against the street_name.
    '''

    #m = street_type_re.search(name)
    m = rex.search(name)
    if m:
        street_type = m.group()
        new_street_type = mapping[street_type]
        name = re.sub(rex, new_street_type, name) # re.sub(old_pattern, new_pattern, file)
        #name = street_type_re.sub(new_street_type, name)
    return name

Let's see how this update_name works.


In [13]:
for st_type, ways in st_types.iteritems():
    if st_type in mapping:
        for name in ways:
            better_name = update_name(name, mapping, rex = street_type_re)
            print name, "=>", better_name


Walnut St, => Walnut Street
Maverick St. => Maverick Street
Pearl St. => Pearl Street
Banks St. => Banks Street
Tremont St. => Tremont Street
Centre St. => Centre Street
Marshall St. => Marshall Street
Prospect St. => Prospect Street
Main St. => Main Street
Albion St. => Albion Street
Saint Mary's St. => Saint Mary's Street
Boylston St. => Boylston Street
Stuart St. => Stuart Street
Elm St. => Elm Street
Brookline St. => Brookline Street
Oakland Rd => Oakland Road
Abby Rd => Abby Road
Bristol Rd => Bristol Road
Squanto Rd => Squanto Road
Goodnough Rd => Goodnough Road
Soldiers Field Rd => Soldiers Field Road
Aberdeen Rd => Aberdeen Road
Massachusetts ave => Massachusetts Avenue
Newton ST => Newton Street
Longfellow Pl => Longfellow Place
Hancock Street. => Hancock Street
Monsignor O'Brien Hwy => Monsignor O'Brien Highway
Corey rd. => Corey Road
Brighton Ave. => Brighton Avenue
Spaulding Ave. => Spaulding Avenue
Massachusetts Ave. => Massachusetts Avenue
Somerville Ave. => Somerville Avenue
Huntington Ave. => Huntington Avenue
LOMASNEY WAY, ROOF LEVEL => LOMASNEY WAY, ROOF Level
Salem St => Salem Street
Brentwood St => Brentwood Street
Medford St => Medford Street
Athol St => Athol Street
Everett St => Everett Street
South Waverly St => South Waverly Street
Litchfield St => Litchfield Street
Hampshire St => Hampshire Street
George St => George Street
Winter St => Winter Street
Broad St => Broad Street
Cambridge St => Cambridge Street
Arsenal St => Arsenal Street
Merrill St => Merrill Street
Maverick St => Maverick Street
Antwerp St => Antwerp Street
Beacon St => Beacon Street
1629 Cambridge St => 1629 Cambridge Street
E 4th St => E 4th Street
Elm St => Elm Street
Congress St => Congress Street
Lothrop St => Lothrop Street
Stewart St => Stewart Street
Dane St => Dane Street
Norfolk St => Norfolk Street
Bagnal St => Bagnal Street
Cummington St => Cummington Street
Holton St => Holton Street
Mackin St => Mackin Street
Waverly St => Waverly Street
Court St => Court Street
Summer St => Summer Street
Duval St => Duval Street
Kirkland St => Kirkland Street
Adams St => Adams Street
644 Beacon St => 644 Beacon Street
N Beacon St => N Beacon Street
Grove St => Grove Street
Leighton St => Leighton Street
Richardson St => Richardson Street
Newbury St => Newbury Street
Sea St => Sea Street
Ware St => Ware Street
Portsmouth St => Portsmouth Street
Massachusetts Ave => Massachusetts Avenue
Highland Ave => Highland Avenue
Lexington Ave => Lexington Avenue
Huntington Ave => Huntington Avenue
Francesca Ave => Francesca Avenue
Willow Ave => Willow Avenue
360 Huntington Ave => 360 Huntington Avenue
Harrison Ave => Harrison Avenue
Somerville Ave => Somerville Avenue
Mystic Ave => Mystic Avenue
Blue Hill Ave => Blue Hill Avenue
Washington Ave => Washington Avenue
Morrison Ave => Morrison Avenue
Boston Ave => Boston Avenue
738 Commonwealth Ave => 738 Commonwealth Avenue
Josephine Ave => Josephine Avenue
Sagamore Ave => Sagamore Avenue
Commonwealth Ave => Commonwealth Avenue
St. Paul's Ave => St. Paul's Avenue
Concord Ave => Concord Avenue
Western Ave => Western Avenue
College Ave => College Avenue
Birmingham Pkwy => Birmingham Parkway
Church st => Church Street
1 Kendall Sq. => 1 Kendall Square
American Legion HIghway => American Legion Highway
Kelley Ct => Kelley Court

It seems that all the abbreviated street types updated as expected.

3.2 Cardinal direction

But I can see there is still an issue: cardinal directions (North, South, East, and West) appear to be universally abbreviated. Therefore , I will traverse the cardinal_directions dictionary and apply the updates for both street type and cardinal direction


In [14]:
cardinal_dir_re = re.compile(r'^[NSEW]\b\.?', re.IGNORECASE)

Here is the result of audit the cardinal directions with this new regex 'cardinal_dir_re'


In [15]:
dir_st_types = audit(filename, rex = cardinal_dir_re)
pprint.pprint(dict(dir_st_types))


{'E': set(['E 4th St', 'E Elm Avenue']), 'N': set(['N Beacon St'])}

I will create a dictionary to map abbreviations (N, S, E and W) to their full representations of cardinal directions.


In [16]:
cardinal_directions_mapping = \
    {
        "E" : "East",
        "N" : "North",
        "S" : "South",
        "W" : "West"
    }

Look like all expected cardinal directions have been replaced.


In [17]:
for st_type, ways in dir_st_types.iteritems():
    if st_type in cardinal_directions_mapping:
        for name in ways:
            better_name = update_name(name, cardinal_directions_mapping, rex = cardinal_dir_re)
            print name, "=>", better_name


E Elm Avenue => East Elm Avenue
E 4th St => East 4th St
N Beacon St => North Beacon St

3.3 Postal codes

Let's exam the postal codes, we can see that there are still some invalid postal codes, so we also need to clean postal codes. I will use regular expressions to identify invalid postal codes and return standardized results. For example, if postal codes like 'MA 02131-4931' and '02131-2460' should be mapped to '02131'.


In [26]:
badZipCode = ["MA", "Mass Ave"]


zip_code_re = re.compile(r"(\d{5})(-\d{4})?$") #5 digits in a row @ end of string
                                           #and optionally dash plus 4 digits
                                           #return different parts of the match and an optional clause (?) 
                                           #for the dash and four digits at the end of the string ($)

# find the zipcodes
def get_postcode(element):
    if (element.attrib['k'] == "addr:postcode"):
        postcode = element.attrib['v']
        return postcode

    
# update zipcodes
def update_postal(postcode, rex):
     '''
     This function takes a string with zip code as an argument 
     and replace these wrong zip code with the fixed zip code.
     Args:
        postcode(string): zip code to update.
        rex(regex): a compiled regular expression to match against the zip code.
    '''
    if postcode is not None:
        zip_code = re.search(rex,postcode)
        if zip_code:
            postcode = zip_code.group(1)
                   
    return postcode

In [27]:
def audit(osmfile):
    '''
     This function return a dictionary with the key is the zip code 
     and the value is the number of that zip code in osm file.
     Args:
        filename(osm): openstreetmap file.
    '''
    osm_file = open(osmfile, "r")
    data_dict = defaultdict(int)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if get_postcode(tag):
                    postcode = get_postcode(tag)
                    data_dict[postcode] += 1 
    return data_dict

In [28]:
zip_code_types = audit(filename)
pprint.pprint(dict(zip_code_types))


{' 02472': 1,
 '01125': 1,
 '01238': 1,
 '01240': 1,
 '01250': 1,
 '01821': 1,
 '01944': 1,
 '02026': 1,
 '02026-5036': 1,
 '02043': 3,
 '02108': 11,
 '02109': 8,
 '02110': 7,
 '02110-1301': 1,
 '02111': 16,
 '02113': 1,
 '02114': 58,
 '02114-3203': 1,
 '02115': 9,
 '02116': 41,
 '02118': 5,
 '02119': 1,
 '02120': 3,
 '02121': 2,
 '02122': 6,
 '02124': 8,
 '02125': 5,
 '02126': 7,
 '02127': 11,
 '02128': 22,
 '02129': 2,
 '02130': 43,
 '02130-4803': 1,
 '02131': 8,
 '02131-3025': 2,
 '02131-4931': 1,
 '02132': 17,
 '02132-1239': 1,
 '02132-3226': 1,
 '02134': 48,
 '02134-1305': 9,
 '02134-1306': 2,
 '02134-1307': 29,
 '02134-1311': 4,
 '02134-1312': 2,
 '02134-1313': 4,
 '02134-1316': 3,
 '02134-1317': 4,
 '02134-1318': 2,
 '02134-1319': 5,
 '02134-1321': 4,
 '02134-1322': 4,
 '02134-1327': 1,
 '02134-1409': 4,
 '02134-1420': 9,
 '02134-1433': 11,
 '02134-1442': 5,
 '02135': 249,
 '02136': 6,
 '02136-2460': 1,
 '02138': 49,
 '02138-1901': 1,
 '02138-2701': 8,
 '02138-2706': 3,
 '02138-2724': 1,
 '02138-2735': 1,
 '02138-2736': 2,
 '02138-2742': 1,
 '02138-2762': 1,
 '02138-2763': 1,
 '02138-2801': 4,
 '02138-2901': 4,
 '02138-2903': 8,
 '02138-2933': 3,
 '02138-3003': 1,
 '02138-3824': 1,
 '02139': 227,
 '02140': 14,
 '02140-1340': 1,
 '02140-2215': 1,
 '02141': 16,
 '02142': 31,
 '02143': 50,
 '02144': 89,
 '02145': 18,
 '02148': 1,
 '02149': 15,
 '02150': 8,
 '02151': 4,
 '02152': 2,
 '02155': 22,
 '02159': 1,
 '02169': 52,
 '02170': 3,
 '02171': 5,
 '02174': 1,
 '02184': 3,
 '02186': 8,
 '02205': 1,
 '02210': 26,
 '02215': 59,
 '02228': 1,
 '02284-6028': 1,
 '02445': 11,
 '02445-5841': 1,
 '02446': 36,
 '02458': 4,
 '02459': 5,
 '02467': 23,
 '02472': 29,
 '02474': 21,
 '02474-8735': 1,
 '02476': 8,
 '02478': 9,
 'MA': 6,
 'MA 02116': 4,
 'MA 02186': 1,
 'Mass Ave': 1}

In [29]:
for raw_zip_code in zip_code_types:
    if raw_zip_code not in badZipCode:
        better_zip_code = update_postal(raw_zip_code, rex = zip_code_re)
        print raw_zip_code, "=>", better_zip_code


02186 => 02186
02184 => 02184
02134-1327 => 02134
02130 => 02130
02134-1322 => 02134
02134-1321 => 02134
02138-1901 => 02138
02132-3226 => 02132
01821 => 01821
02134-1433 => 02134
02108 => 02108
02026 => 02026
02476 => 02476
02474 => 02474
02472 => 02472
02139 => 02139
02134-1319 => 02134
02478 => 02478
02136-2460 => 02136
02131-3025 => 02131
02136 => 02136
02140-1340 => 02140
02134 => 02134
02205 => 02205
02132 => 02132
02131 => 02131
 02472 => 02472
02110-1301 => 02110
02138 => 02138
02138-2903 => 02138
02138-2901 => 02138
02134-1442 => 02134
01250 => 01250
02132-1239 => 02132
02446 => 02446
02445 => 02445
02138-2742 => 02138
02120 => 02120
02121 => 02121
02210 => 02210
02124 => 02124
02125 => 02125
02126 => 02126
02215 => 02215
02128 => 02128
02129 => 02129
02474-8735 => 02474
01240 => 01240
02114-3203 => 02114
02458 => 02458
02459 => 02459
MA 02116 => 02116
01125 => 01125
02109 => 02109
02228 => 02228
MA 02186 => 02186
02155 => 02155
02151 => 02151
02150 => 02150
02152 => 02152
02159 => 02159
02026-5036 => 02026
01238 => 01238
02138-2724 => 02138
02445-5841 => 02445
02138-2933 => 02138
02144 => 02144
02145 => 02145
02142 => 02142
02143 => 02143
02140 => 02140
02141 => 02141
02148 => 02148
02149 => 02149
02138-2735 => 02138
02138-2736 => 02138
02131-4931 => 02131
02138-3824 => 02138
02134-1316 => 02134
02134-1317 => 02134
02134-1312 => 02134
02134-1313 => 02134
02134-1311 => 02134
02134-1318 => 02134
01944 => 01944
02171 => 02171
02134-1409 => 02134
02174 => 02174
02170 => 02170
02122 => 02122
02138-2706 => 02138
02138-2701 => 02138
02140-2215 => 02140
02134-1420 => 02134
02127 => 02127
02114 => 02114
02134-1305 => 02134
02134-1307 => 02134
02134-1306 => 02134
02043 => 02043
02169 => 02169
02130-4803 => 02130
02138-2801 => 02138
02284-6028 => 02284
02119 => 02119
02118 => 02118
02138-3003 => 02138
02111 => 02111
02110 => 02110
02113 => 02113
02115 => 02115
02135 => 02135
02116 => 02116
02467 => 02467
02138-2762 => 02138
02138-2763 => 02138

In [ ]:

3.4 The total number of nodes and ways

Then I will count the total number of nodes and ways that contain a tag child with k="addr:street"


In [30]:
osm_file = open(filename, "r")
address_count = 0

for event, elem in ET.iterparse(osm_file, events=("start",)):
    if elem.tag == "node" or elem.tag == "way":
        for tag in elem.iter("tag"): 
            if is_street_name(tag):
                address_count += 1

address_count


Out[30]:
2976

4. Preparing for MongoDB

Before importing the XML data into MongoDB, I have to transform the shape of data into json documents structured (a list of dictionaries) like this

{ "id": "2406124091", "type: "node", "visible":"true", "created": { "version":"2", "changeset":"17206049", "timestamp":"2013-08-03T16:43:42Z", "user":"linuxUser16", "uid":"1219059" }, "pos": [41.9757030, -87.6921867], "address": { "housenumber": "5157", "postcode": "60625", "street": "North Lincoln Ave" }, "amenity": "restaurant", "cuisine": "mexican", "name": "La Cabana De Don Luis", "phone": "1 (773)-271-5176" }

Here are the rules:

  • process only 2 types of top level tags: "node" and "way"
  • all attributes of "node" and "way" should be turned into regular key/value pairs, except:
    • attributes in the CREATED array should be added under a key "created"
    • attributes for latitude and longitude should be added to a "pos" array, for use in geospacial indexing. Make sure the values inside "pos" array are floats and not strings.
  • if second level tag "k" value contains problematic characters, it should be ignored
  • if second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
  • if second level tag "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
  • if there is a second ":" that separates the type/direction of a street, the tag should be ignored, for example:
should be turned into: {... "address": { "housenumber": 5158, "street": "North Lincoln Avenue" } "amenity": "pharmacy", ... } - for "way" specifically:
should be turned into
# "node_refs": ["305896090", "1719825889"]

In [31]:
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):
    '''
     This function will parse the map file and return a dictionary, 
     containing the shaped data for that element.
     Args:
        element(string): element in the map.
    '''
    node = {}
    # create an address dictionary
    address = {}
    
    if element.tag == "node" or element.tag == "way" :
        # YOUR CODE HERE
        node["type"] = element.tag

        #for key in element.attrib.keys()
        for key in element.attrib:
            #print key

            if key in CREATED:
                if "created" not in node:
                    node["created"] = {}
                node["created"][key] = element.attrib[key]

            elif key in ["lat","lon"]:
                if "pos" not in node:
                    node["pos"] = [None, None]
                if key == "lat":
                    node["pos"][0] = float(element.attrib[key])
                elif key == "lon":
                    node["pos"][1] = float(element.attrib[key])
            else:
                node[key] = element.attrib[key]

            for tag in element.iter("tag"):
                tag_key = tag.attrib["k"]   # key
                tag_value = tag.attrib["v"] # value
                if not problemchars.match(tag_key):
                    if tag_key.startswith("addr:"):# Single colon beginning with addr
                        if "address" not in node:
                            node["address"] = {}
                        sub_addr = tag_key[len("addr:"):]
                        if not lower_colon.match(sub_addr): # Tags with no colon
                            address[sub_addr] = tag_value 
                            node["address"] = address
                            #node["address"][sub_addr] = tag_value
                    elif lower_colon.match(tag_key): # Single colon not beginnning with "addr:"
                        node[tag_key] = tag_value
                    else:
                        node[tag_key] = tag_value # Tags with no colon, not beginnning with "addr:"

        for nd in element.iter("nd"):
            if "node_refs" not in node:
                node["node_refs"] = []
            node["node_refs"].append(nd.attrib["ref"])

        return node
    else:
        return None

It's time to parse the XML, shape the elements, and write to a json file


In [ ]:
import codecs
import json

def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

process_map(filename)

5. Data Overview

Check the size of XML and JSON files


In [33]:
import os
print "The downloaded XML file is {} MB".format(os.path.getsize(filename)/1.0e6) # convert from bytes to megabytes


The downloaded XML file is 421.230253 MB

In [34]:
print "The json file is {} MB".format(os.path.getsize(filename + ".json")/1.0e6) # convert from bytes to megabytes


The json file is 484.237902 MB

Execute mongod to run MongoDB

Use the subprocess module to run shell commands.


In [35]:
import signal
import subprocess

# The os.setsid() is passed in the argument preexec_fn so
# it's run after the fork() and before  exec() to run the shell.
pro = subprocess.Popen("mongod", preexec_fn = os.setsid)

Connect database with pymongo


In [36]:
from pymongo import MongoClient

db_name = "osm"

client = MongoClient('localhost:27017')
db = client[db_name]

When we have to import a large amounts of data, mongoimport is recommended.

First build a mongoimport command, then use subprocess.call to execute


In [37]:
# Build mongoimport command
collection = filename[:filename.find(".")]
#print collection
working_directory = "/Users/ducvu/Documents/ud032-master/final_project/"

json_file = filename + ".json"
#print json_file
mongoimport_cmd = "mongoimport --db " + db_name + \
                  " --collection " + collection + \
                  " --file " + working_directory + json_file
#print mongoimport_cmd 

# Before importing, drop collection if it exists
if collection in db.collection_names():
    print "dropping collection"
    db[collection].drop()

# Execute the command
print "Executing: " + mongoimport_cmd
subprocess.call(mongoimport_cmd.split())


dropping collection
Executing: mongoimport --db osm --collection boston_massachusetts --file /Users/ducvu/Documents/ud032-master/final_project/boston_massachusetts.osm.json
Out[37]:
0

Get the collection from the database


In [38]:
boston_db = db[collection]

Number of Documents


In [39]:
boston_db.find().count()


Out[39]:
2180637

Number of Unique Users


In [40]:
len(boston_db.distinct('created.user'))


Out[40]:
1001

Number of Nodes and Ways


In [41]:
node_way = boston_db.aggregate([
        {"$group" : {"_id" : "$type", "count" : {"$sum" : 1}}}])

pprint.pprint(list(node_way))


[{u'_id': u'multipolygon', u'count': 1},
 {u'_id': u'video', u'count': 1},
 {u'_id': u'chain_link', u'count': 1},
 {u'_id': u'way', u'count': 294189},
 {u'_id': u'Collaborative Program', u'count': 6},
 {u'_id': u'charter', u'count': 1},
 {u'_id': u'special', u'count': 1},
 {u'_id': u'Approved Special Education', u'count': 12},
 {u'_id': u'Academic', u'count': 34},
 {u'_id': u'Special-Law', u'count': 3},
 {u'_id': u'civil', u'count': 1},
 {u'_id': u'node', u'count': 1885954},
 {u'_id': u'Special', u'count': 49},
 {u'_id': u'Charter', u'count': 17},
 {u'_id': u'Public', u'count': 182},
 {u'_id': u'broad_leaved', u'count': 4},
 {u'_id': u'School', u'count': 77},
 {u'_id': u'private', u'count': 3},
 {u'_id': u'County', u'count': 3},
 {u'_id': u'Private', u'count': 87},
 {u'_id': u'State', u'count': 2},
 {u'_id': u'Special-Medical', u'count': 7},
 {u'_id': u'Special-Institutional', u'count': 2}]

Number of Nodes


In [42]:
boston_db.find({"type":"node"}).count()


Out[42]:
1885954

Number of Ways


In [43]:
boston_db.find({"type":"way"}).count()


Out[43]:
294189

Top Contributing User


In [44]:
top_user = boston_db.aggregate([
    {"$match":{"type":"node"}},
    {"$group":{"_id":"$created.user","count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":1}
])

#print(list(top_user))
pprint.pprint(list(top_user))


[{u'_id': u'crschmidt', u'count': 1229402}]

Number of users having only 1 post


In [45]:
type_buildings = boston_db.aggregate([
    {"$group":{"_id":"$created.user","count":{"$sum":1}}},
    {"$group":{"_id":{"postcount":"$count"},"num_users":{"$sum":1}}},
    {"$project":{"_id":0,"postcount":"$_id.postcount","num_users":1}},
    {"$sort":{"postcount":1}},
    {"$limit":1}
])

pprint.pprint(list(type_buildings))


[{u'num_users': 244, u'postcount': 1}]

Number of Documents Containing a Street Address


In [46]:
boston_db.find({"address.street" : {"$exists" : 1}}).count()


Out[46]:
3026

Zip codes


In [47]:
zipcodes = boston_db.aggregate([
        {"$match" : {"address.postcode" : {"$exists" : 1}}}, \
        {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}, \
        {"$sort" : {"count" : -1}}])

#for document in zipcodes:
#    print(document)
    
pprint.pprint(list(zipcodes))


[{u'_id': u'02135', u'count': 250},
 {u'_id': u'02139', u'count': 230},
 {u'_id': u'02144', u'count': 92},
 {u'_id': u'02215', u'count': 61},
 {u'_id': u'02114', u'count': 59},
 {u'_id': u'02143', u'count': 52},
 {u'_id': u'02169', u'count': 52},
 {u'_id': u'02134', u'count': 49},
 {u'_id': u'02138', u'count': 49},
 {u'_id': u'02130', u'count': 44},
 {u'_id': u'02116', u'count': 41},
 {u'_id': u'02446', u'count': 38},
 {u'_id': u'02142', u'count': 31},
 {u'_id': u'02472', u'count': 30},
 {u'_id': u'02134-1307', u'count': 29},
 {u'_id': u'02210', u'count': 28},
 {u'_id': u'02467', u'count': 26},
 {u'_id': u'02155', u'count': 23},
 {u'_id': u'02474', u'count': 22},
 {u'_id': u'02128', u'count': 22},
 {u'_id': u'02145', u'count': 20},
 {u'_id': u'02132', u'count': 17},
 {u'_id': u'02141', u'count': 16},
 {u'_id': u'02111', u'count': 16},
 {u'_id': u'02149', u'count': 15},
 {u'_id': u'02140', u'count': 14},
 {u'_id': u'02108', u'count': 11},
 {u'_id': u'02134-1433', u'count': 11},
 {u'_id': u'02127', u'count': 11},
 {u'_id': u'02445', u'count': 11},
 {u'_id': u'02115', u'count': 9},
 {u'_id': u'02134-1420', u'count': 9},
 {u'_id': u'02478', u'count': 9},
 {u'_id': u'02134-1305', u'count': 9},
 {u'_id': u'02150', u'count': 8},
 {u'_id': u'02109', u'count': 8},
 {u'_id': u'02124', u'count': 8},
 {u'_id': u'02476', u'count': 8},
 {u'_id': u'02138-2903', u'count': 8},
 {u'_id': u'02131', u'count': 8},
 {u'_id': u'02186', u'count': 8},
 {u'_id': u'02138-2701', u'count': 8},
 {u'_id': u'02126', u'count': 7},
 {u'_id': u'02110', u'count': 7},
 {u'_id': u'02122', u'count': 6},
 {u'_id': u'02459', u'count': 6},
 {u'_id': u'02134-1322', u'count': 6},
 {u'_id': u'02118', u'count': 6},
 {u'_id': u'02136', u'count': 6},
 {u'_id': u'MA', u'count': 6},
 {u'_id': u'02134-1319', u'count': 5},
 {u'_id': u'02171', u'count': 5},
 {u'_id': u'02125', u'count': 5},
 {u'_id': u'02134-1442', u'count': 5},
 {u'_id': u'02138-2901', u'count': 4},
 {u'_id': u'02134-1311', u'count': 4},
 {u'_id': u'02138-2801', u'count': 4},
 {u'_id': u'02134-1321', u'count': 4},
 {u'_id': u'02458', u'count': 4},
 {u'_id': u'02134-1313', u'count': 4},
 {u'_id': u'02134-1409', u'count': 4},
 {u'_id': u'MA 02116', u'count': 4},
 {u'_id': u'02134-1317', u'count': 4},
 {u'_id': u'02151', u'count': 4},
 {u'_id': u'02138-2933', u'count': 3},
 {u'_id': u'02134-1316', u'count': 3},
 {u'_id': u'02138-2706', u'count': 3},
 {u'_id': u'02170', u'count': 3},
 {u'_id': u'02120', u'count': 3},
 {u'_id': u'02184', u'count': 3},
 {u'_id': u'02043', u'count': 3},
 {u'_id': u'02138-2736', u'count': 2},
 {u'_id': u'02134-1306', u'count': 2},
 {u'_id': u'02152', u'count': 2},
 {u'_id': u'02134-1318', u'count': 2},
 {u'_id': u'02129', u'count': 2},
 {u'_id': u'02131-3025', u'count': 2},
 {u'_id': u'02121', u'count': 2},
 {u'_id': u'02119', u'count': 2},
 {u'_id': u'02134-1312', u'count': 2},
 {u'_id': u'01250', u'count': 1},
 {u'_id': u'02138-2735', u'count': 1},
 {u'_id': u'02138-2763', u'count': 1},
 {u'_id': u'02138-2762', u'count': 1},
 {u'_id': u'02174', u'count': 1},
 {u'_id': u'01821', u'count': 1},
 {u'_id': u'02134-1327', u'count': 1},
 {u'_id': u'02130-4803', u'count': 1},
 {u'_id': u'02131-4931', u'count': 1},
 {u'_id': u'01125', u'count': 1},
 {u'_id': u'02140-2215', u'count': 1},
 {u'_id': u'02474-8735', u'count': 1},
 {u'_id': u'02138-2742', u'count': 1},
 {u'_id': u'02138-3003', u'count': 1},
 {u'_id': u'02445-5841', u'count': 1},
 {u'_id': u'02138-3824', u'count': 1},
 {u'_id': u'02138-1901', u'count': 1},
 {u'_id': u'02284-6028', u'count': 1},
 {u'_id': u'02140-1340', u'count': 1},
 {u'_id': u'01238', u'count': 1},
 {u'_id': u'02110-1301', u'count': 1},
 {u'_id': u'02132-3226', u'count': 1},
 {u'_id': u'01240', u'count': 1},
 {u'_id': u'02026-5036', u'count': 1},
 {u'_id': u'01944', u'count': 1},
 {u'_id': u'MA 02186', u'count': 1},
 {u'_id': u'02113', u'count': 1},
 {u'_id': u'Mass Ave', u'count': 1},
 {u'_id': u'02205', u'count': 1},
 {u'_id': u' 02472', u'count': 1},
 {u'_id': u'02136-2460', u'count': 1},
 {u'_id': u'02132-1239', u'count': 1},
 {u'_id': u'02159', u'count': 1},
 {u'_id': u'02114-3203', u'count': 1},
 {u'_id': u'02228', u'count': 1},
 {u'_id': u'02138-2724', u'count': 1},
 {u'_id': u'02148', u'count': 1},
 {u'_id': u'02026', u'count': 1}]

Top 5 Most Common Cities


In [48]:
cities = boston_db.aggregate([{"$match" : {"address.city" : {"$exists" : 1}}}, \
                           {"$group" : {"_id" : "$address.city", "count" : {"$sum" : 1}}}, \
                           {"$sort" : {"count" : -1}}, \
                           {"$limit" : 5}])

#for city in cities :
#    print city
    
pprint.pprint(list(cities))


[{u'_id': u'Boston', u'count': 594},
 {u'_id': u'Malden', u'count': 413},
 {u'_id': u'Cambridge', u'count': 323},
 {u'_id': u'Somerville', u'count': 153},
 {u'_id': u'Quincy', u'count': 51}]

Top 10 Amenities


In [49]:
amenities = boston_db.aggregate([
        {"$match" : {"amenity" : {"$exists" : 1}}}, \
        {"$group" : {"_id" : "$amenity", "count" : {"$sum" : 1}}}, \
        {"$sort" : {"count" : -1}}, \
        {"$limit" : 10}])

#for document in amenities:
#    print document

pprint.pprint(list(amenities))


[{u'_id': u'parking', u'count': 1192},
 {u'_id': u'bench', u'count': 946},
 {u'_id': u'school', u'count': 774},
 {u'_id': u'restaurant', u'count': 532},
 {u'_id': u'parking_space', u'count': 444},
 {u'_id': u'place_of_worship', u'count': 407},
 {u'_id': u'library', u'count': 342},
 {u'_id': u'bicycle_parking', u'count': 238},
 {u'_id': u'cafe', u'count': 199},
 {u'_id': u'fast_food', u'count': 169}]

In [50]:
amenities = boston_db.aggregate([
    {"$match":{"amenity":{"$exists":1},"type":"node"}},
    {"$group":{"_id":"$amenity","count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":10}
])

pprint.pprint(list(amenities))


[{u'_id': u'bench', u'count': 938},
 {u'_id': u'restaurant', u'count': 493},
 {u'_id': u'school', u'count': 370},
 {u'_id': u'place_of_worship', u'count': 317},
 {u'_id': u'bicycle_parking', u'count': 238},
 {u'_id': u'cafe', u'count': 182},
 {u'_id': u'fast_food', u'count': 154},
 {u'_id': u'library', u'count': 141},
 {u'_id': u'bicycle_rental', u'count': 141},
 {u'_id': u'fire_station', u'count': 96}]

Most common building types


In [51]:
type_buildings = boston_db.aggregate([
    {'$match': {'building': {'$exists': 1}}}, 
    {'$group': { '_id': '$building','count': {'$sum': 1}}},
    {'$sort': {'count': -1}}, {'$limit': 20}
])

pprint.pprint(list(type_buildings))


[{u'_id': u'yes', u'count': 248378},
 {u'_id': u'garage', u'count': 673},
 {u'_id': u'house', u'count': 629},
 {u'_id': u'apartments', u'count': 422},
 {u'_id': u'university', u'count': 295},
 {u'_id': u'shed', u'count': 131},
 {u'_id': u'roof', u'count': 109},
 {u'_id': u'dormitory', u'count': 98},
 {u'_id': u'school', u'count': 88},
 {u'_id': u'residential', u'count': 76},
 {u'_id': u'commercial', u'count': 69},
 {u'_id': u'retail', u'count': 53},
 {u'_id': u'entrance', u'count': 37},
 {u'_id': u'storage_tank', u'count': 30},
 {u'_id': u'church', u'count': 27},
 {u'_id': u'home', u'count': 26},
 {u'_id': u'office', u'count': 20},
 {u'_id': u'industrial', u'count': 16},
 {u'_id': u'hospital', u'count': 7},
 {u'_id': u'hotel', u'count': 5}]

Top Religions with Denominations


In [52]:
religions = boston_db.aggregate([
        {"$match" : {"amenity" : "place_of_worship"}}, \
        {"$group" : {"_id" : {"religion" : "$religion", "denomination" : "$denomination"}, "count" : {"$sum" : 1}}}, \
        {"$sort" : {"count" : -1}}])

#for document in religions:
#    print document

pprint.pprint(list(religions))


[{u'_id': {u'religion': u'christian'}, u'count': 189},
 {u'_id': {u'denomination': u'baptist', u'religion': u'christian'},
  u'count': 53},
 {u'_id': {u'denomination': u'methodist', u'religion': u'christian'},
  u'count': 22},
 {u'_id': {u'denomination': u'catholic', u'religion': u'christian'},
  u'count': 22},
 {u'_id': {}, u'count': 16},
 {u'_id': {u'denomination': u'roman_catholic', u'religion': u'christian'},
  u'count': 10},
 {u'_id': {u'denomination': u'presbyterian', u'religion': u'christian'},
  u'count': 10},
 {u'_id': {u'denomination': u'lutheran', u'religion': u'christian'},
  u'count': 9},
 {u'_id': {u'religion': u'jewish'}, u'count': 9},
 {u'_id': {u'denomination': u'pentecostal', u'religion': u'christian'},
  u'count': 8},
 {u'_id': {u'denomination': u'episcopal', u'religion': u'christian'},
  u'count': 8},
 {u'_id': {u'religion': u'unitarian'}, u'count': 6},
 {u'_id': {u'denomination': u'orthodox', u'religion': u'christian'},
  u'count': 4},
 {u'_id': {u'religion': u'unitarian_universalist'}, u'count': 3},
 {u'_id': {u'denomination': u'mormon', u'religion': u'christian'},
  u'count': 3},
 {u'_id': {u'denomination': u'anglican', u'religion': u'christian'},
  u'count': 2},
 {u'_id': {u'denomination': u'zen', u'religion': u'buddhist'}, u'count': 2},
 {u'_id': {u'denomination': u'congregational', u'religion': u'christian'},
  u'count': 2},
 {u'_id': {u'denomination': u'reform', u'religion': u'jewish'}, u'count': 2},
 {u'_id': {u'denomination': u'reformed', u'religion': u'jewish'}, u'count': 2},
 {u'_id': {u'denomination': u'jehovahs_witness', u'religion': u'christian'},
  u'count': 2},
 {u'_id': {u'denomination': u'evangelical', u'religion': u'christian'},
  u'count': 2},
 {u'_id': {u'denomination': u'non-denominational', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'seventh_day_adventist',
           u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'greek_orthodox', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'conservative', u'religion': u'jewish'},
  u'count': 1},
 {u'_id': {u'denomination': u'union church of christ',
           u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'Congregational', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'sunni', u'religion': u'muslim'}, u'count': 1},
 {u'_id': {u'denomination': u'UUA', u'religion': u'christian'}, u'count': 1},
 {u'_id': {u'denomination': u'united_church_of_christ',
           u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'salvation_army', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'christ_scientist', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'quaker', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'swedenborgian', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'denomination': u'hasidic', u'religion': u'jewish'}, u'count': 1},
 {u'_id': {u'denomination': u'roman_catholic'}, u'count': 1},
 {u'_id': {u'denomination': u'greek_catholic', u'religion': u'christian'},
  u'count': 1},
 {u'_id': {u'religion': u'buddhist'}, u'count': 1},
 {u'_id': {u'religion': u'muslim'}, u'count': 1},
 {u'_id': {u'denomination': u'orthodox', u'religion': u'jewish'}, u'count': 1},
 {u'_id': {u'denomination': u'non-denominational', u'religion': u'jewish'},
  u'count': 1},
 {u'_id': {u'denomination': u'unitarian', u'religion': u'christian'},
  u'count': 1}]

Top 10 Leisures


In [53]:
leisures = boston_db.aggregate([{"$match" : {"leisure" : {"$exists" : 1}}}, \
                           {"$group" : {"_id" : "$leisure", "count" : {"$sum" : 1}}}, \
                           {"$sort" : {"count" : -1}}, \
                           {"$limit" : 10}])

#for document in leisures:
#    print document

pprint.pprint(list(leisures))


[{u'_id': u'park', u'count': 750},
 {u'_id': u'pitch', u'count': 545},
 {u'_id': u'recreation_ground', u'count': 350},
 {u'_id': u'playground', u'count': 334},
 {u'_id': u'nature_reserve', u'count': 105},
 {u'_id': u'garden', u'count': 35},
 {u'_id': u'sports_centre', u'count': 33},
 {u'_id': u'picnic_table', u'count': 33},
 {u'_id': u'swimming_pool', u'count': 28},
 {u'_id': u'common', u'count': 19}]

Top 15 Universities


In [54]:
universities = boston_db.aggregate([
        {"$match" : {"amenity" : "university"}}, \
        {"$group" : {"_id" : {"name" : "$name"}, "count" : {"$sum" : 1}}}, \
        {"$sort" : {"count" : -1}},
        {"$limit":15}
    ])

pprint.pprint(list(universities))


[{u'_id': {u'name': u'Boston University'}, u'count': 41},
 {u'_id': {u'name': u'Massachusetts Institute of Technology'}, u'count': 10},
 {u'_id': {u'name': u'Suffolk University'}, u'count': 8},
 {u'_id': {u'name': u'Harvard University'}, u'count': 6},
 {u'_id': {u'name': None}, u'count': 4},
 {u'_id': {u'name': u'University of Massachusetts Boston'}, u'count': 3},
 {u'_id': {u'name': u'Boston University Medical Campus'}, u'count': 3},
 {u'_id': {u'name': u'University Hall'}, u'count': 2},
 {u'_id': {u'name': u'Harvard Medical School'}, u'count': 2},
 {u'_id': {u'name': u'Benjamin Franklin Institute of Technology'},
  u'count': 2},
 {u'_id': {u'name': u'Littauer Center'}, u'count': 2},
 {u'_id': {u'name': u'Northeastern University'}, u'count': 2},
 {u'_id': {u'name': u'Boston College'}, u'count': 2},
 {u'_id': {u'name': u'Radcliffe Gym'}, u'count': 1},
 {u'_id': {u'name': u'Agassiz House'}, u'count': 1}]

Top 10 Schools


In [55]:
schools = boston_db.aggregate([
        {"$match" : {"amenity" : "school"}}, \
        {"$group" : {"_id" : {"name" : "$name"}, "count" : {"$sum" : 1}}}, \
        {"$sort" : {"count" : -1}},
        {"$limit":10}
    ])

pprint.pprint(list(schools))


[{u'_id': {u'name': None}, u'count': 13},
 {u'_id': {u'name': u'Milton Academy'}, u'count': 4},
 {u'_id': {u'name': u'Lincoln School'}, u'count': 3},
 {u'_id': {u'name': u'Boston Community Leadership Academy'}, u'count': 3},
 {u'_id': {u'name': u'New Mission High School'}, u'count': 3},
 {u'_id': {u'name': u'Dexter School'}, u'count': 3},
 {u'_id': {u'name': u'Boston Middle School Academy'}, u'count': 2},
 {u'_id': {u'name': u'Clark Avenue School'}, u'count': 2},
 {u'_id': {u'name': u'Kennedy Day School'}, u'count': 2},
 {u'_id': {u'name': u'Phillips School'}, u'count': 2}]

Top Prisons


In [56]:
prisons = boston_db.aggregate([
        {"$match" : {"amenity" : "prison"}}, \
        {"$group" : {"_id" : {"name" : "$name"}, "count" : {"$sum" : 1}}}, \
        {"$sort" : {"count" : -1}}])

pprint.pprint(list(prisons))


[{u'_id': {u'name': u'Norfolk County Jail'}, u'count': 1},
 {u'_id': {u'name': u'Middlesex County Jail (Cambridge)'}, u'count': 1},
 {u'_id': {u'name': u'Suffolk County House of Correction'}, u'count': 1},
 {u'_id': {u'name': u'Suffolk County Jail'}, u'count': 1},
 {u'_id': {u'name': u'Boston Pre-Release Center'}, u'count': 1},
 {u'_id': {u'name': u'Lemuel Shattuck Hospital Correctional Unit'},
  u'count': 1}]

Top 10 Hospitals


In [57]:
hospitals = boston_db.aggregate([
        {"$match" : {"amenity" : "hospital"}}, \
        {"$group" : {"_id" : {"name" : "$name"}, "count" : {"$sum" : 1}}}, \
        {"$sort" : {"count" : -1}},
        {"$limit":10}
    ])


pprint.pprint(list(hospitals))


[{u'_id': {u'name': u'Carney Hospital'}, u'count': 3},
 {u'_id': {u'name': None}, u'count': 2},
 {u'_id': {u'name': u"St. Elizabeth's Medical Center"}, u'count': 2},
 {u'_id': {u'name': u'Arbour Hospital'}, u'count': 2},
 {u'_id': {u'name': u'Bournewood Hospital'}, u'count': 2},
 {u'_id': {u'name': u'Arbour-Hri Hospital'}, u'count': 2},
 {u'_id': {u'name': u'Cambridge Health Alliance-Whidden Memorial Hospital'},
  u'count': 2},
 {u'_id': {u'name': u'Faulkner Hospital'}, u'count': 2},
 {u'_id': {u'name': u'Steward Satellite Emergency Facility - Quincy'},
  u'count': 2},
 {u'_id': {u'name': u'Central Street Health Center'}, u'count': 2}]

fast_food = boston_db.aggregate([ {"$match":{"cuisine":{"$exists":1},"amenity":"fast_food"}}, {"$group":{"_id":"$cuisine","count":{"$sum":1}}}, {"$sort":{"count":-1}}, {"$limit":10} ])

pprint.pprint(list(fast_food))


In [59]:
gas_station_brands = boston_db.aggregate([
    {"$match":{"brand":{"$exists":1},"amenity":"fuel"}},
    {"$group":{"_id":"$brand","count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":10}
])

pprint.pprint(list(gas_station_brands))


[{u'_id': u'Gulf', u'count': 4},
 {u'_id': u'Shell', u'count': 3},
 {u'_id': u'Hess', u'count': 3},
 {u'_id': u'Super Petroleum', u'count': 1},
 {u'_id': u"Eli's", u'count': 1},
 {u'_id': u'APrime Energy', u'count': 1},
 {u'_id': u'US Petroleum', u'count': 1},
 {u'_id': u'Cumberland Farm', u'count': 1},
 {u'_id': u'Valvoline Oil Change', u'count': 1},
 {u'_id': u'Sunoco', u'count': 1}]

In [60]:
banks = boston_db.aggregate([
    {"$match":{"name":{"$exists":1},"amenity":"bank"}},
    {"$group":{"_id":"$name","count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":10}
])

pprint.pprint(list(banks))


[{u'_id': u'Bank of America', u'count': 11},
 {u'_id': u'Citizens Bank', u'count': 7},
 {u'_id': u'TD Bank', u'count': 6},
 {u'_id': u'Eastern Bank', u'count': 4},
 {u'_id': u'Cambridge Savings Bank', u'count': 4},
 {u'_id': u'Santander', u'count': 4},
 {u'_id': u'Sovereign Bank', u'count': 4},
 {u'_id': u'Brookline Bank', u'count': 3},
 {u'_id': u'East Cambridge Savings Bank', u'count': 2},
 {u'_id': u'Cambridge Trust Company', u'count': 2}]

In [61]:
restaurants = boston_db.aggregate([
    {"$match":{"name":{"$exists":1},"amenity":"restaurant"}},
    {"$group":{"_id":"$name","count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":10}
])

pprint.pprint(list(restaurants))


[{u'_id': u'Panera Bread', u'count': 6},
 {u'_id': u"Bertucci's", u'count': 4},
 {u'_id': u'Dunkin Donuts', u'count': 2},
 {u'_id': u'Olecito', u'count': 2},
 {u'_id': u'The Elephant Walk', u'count': 2},
 {u'_id': u'Chipotle', u'count': 2},
 {u'_id': u'Boloco', u'count': 2},
 {u'_id': u"Crazy Dough's", u'count': 2},
 {u'_id': u'Ninety Nine', u'count': 2},
 {u'_id': u'Boca Grande', u'count': 2}]

6. Additional Ideas

Analyzing the data of Boston I found out that not all nodes or ways include this information since its geographical position is represented within regions of a city. What could be done in this case, is check if each node or way belongs to a city based on the latitude and longitude and ensure that the property "address.city" is properly informed. By doing so, we could get statistics related to cities in a much more reliable way. In fact, I think this is the biggest benefit to anticipate problems and implement improvements to the data you want to analyze. Real world data are very susceptible to being incomplete, noisy and inconsistent which means that if you have low-quality of data the results of their analysis will also be of poor quality.

I think that extending this open source project to include data such as user reviews of establishments, subjective areas of what bound a good and bad neighborhood, housing price data, school reviews, walkability/bikeability, quality of mass transit, and on would form a solid foundation of robust recommender systems. These recommender systems could aid users in anything from finding a new home or apartment to helping a user decide where to spend a weekend afternoon.

Another alternative to help in the absence of information in the region would be the use of gamification or crowdsource information to make more people help in the map contribution. Something like the mobile apps similar to Waze and Minutely have already done to make the users responsible for improving the app and social network around the app.

A different application of this project is that it can be helpful on the problem of how the city boundaries well-defined. The transportation networks (street networks), the built environment can be good indicators of metropolitan area and combining an elementary clustering technique, we consider two street intersections to belong to the same cluster if they have a distance below a given distance threshold (in metres). The geospatial information gives us a good definition of city boundaries through spatial urban networks.

An interesting fact that we can use the geospatial coordinates information to find out country/city name (search OSM data by name and address and to generate synthetic addresses of OSM points). This problem is called reverse geocoding which maps geospatial coordinates to location name. And the Nominatim from Open Street Map enables us to do that.


In [71]:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
location = geolocator.reverse("42.3725677, -71.1193068")
print(location.address)


The Garage, Mount Auburn Street, Harvard Square, Cambridge, Middlesex County, Massachusetts, 02138, United States of America

However, potential problems associated with reverse geocoding is that it may give us weird results near the poles and the international date line or for cities within cities, for example certain locations in Rome may return "Vatican City" - depending on the lat/lon specified in the database for each

For example : Pontificio Collegio Teutonico di Santa Maria in Campo Santo (Collegio Teutonico) is located in Vatican City but the result of given set of coordinates gives us the location in Roma, Italia.


In [78]:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
vatican=(41.89888433, 12.45376451)
location = geolocator.reverse(vatican)
print(location.address)


Sotto La Cupola - Guest House, 15, Via Cardinal Agliardi, Aurelio, Municipio Roma XIII, Roma, RM, LAZ, 00165, Italia

In [79]:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
artic=(-86.06303611,6.81517107)
location = geolocator.reverse(artic)
print(location.address)


None

Despite the many issues with the reverse coding, I think another benefits of this project that it can be applied in disease mapping which facilitates us use the longitudes and latitudes information to find the plaintext addresses of patients for identifying patterns, correlates, and predictors of disease in academia, government and private sector with the widespread availability of geographic information.

7. Conclusion

This review of the data is cursory, but it seems that the Boston area is incomplete, though I believe it has been well cleaned and represented after this project.

8. References


In [ ]:
os.killpg(pro.pid, signal.SIGTERM)  # Send the signal to all the process groups, killing the MongoDB instance