Wrangling Boston OpenStreetMap Data

Find the data

Using the very hepful Mapzen Weekly OSM Metro Extracts (https://mapzen.com/metro-extracts), I was able to download the OSM file for Boston, Massachusetts, USA.

List the data files


In [7]:
!ls -la /Users/excalibur/Dropbox/nanodegree/data/


total 741872
drwxr-xr-x@  5 excalibur  staff        170 Mar 10 12:14 .
drwxr-xr-x@ 11 excalibur  staff        374 Mar 10 12:07 ..
-rw-r--r--@  1 excalibur  staff       6148 Mar 10 12:14 .DS_Store
-rw-r-----@  1 excalibur  staff  366426696 Feb 23 11:32 boston_massachusetts.osm
-rw-r-----@  1 excalibur  staff   13399735 Feb 23 11:32 boston_massachusetts.osm.pbf

The boston_massachusetts.osm file is 366.4 MB.

Display head of data file


In [8]:
!head /Users/excalibur/Dropbox/nanodegree/data/boston_massachusetts.osm


<?xml version='1.0' encoding='UTF-8'?>
<osm version="0.6" generator="Osmosis 0.43.1">
  <bounds minlon="-71.19100" minlat="42.22800" maxlon="-70.92300" maxlat="42.39900" origin="http://www.openstreetmap.org/api/0.6"/>
  <node id="26746680" version="1" timestamp="2007-03-24T19:38:02Z" uid="6817" user="lurker" changeset="244358" lat="42.3089253" lon="-71.1191797">
    <tag k="created_by" v="YahooApplet 1.0"/>
  </node>
  <node id="30730952" version="2" timestamp="2012-12-19T19:24:31Z" uid="326503" user="wambag" changeset="14335103" lat="42.3678097" lon="-71.0218711"/>
  <node id="30730953" version="2" timestamp="2012-12-19T19:24:31Z" uid="326503" user="wambag" changeset="14335103" lat="42.3677364" lon="-71.0218568"/>
  <node id="30730954" version="2" timestamp="2012-12-19T19:24:31Z" uid="326503" user="wambag" changeset="14335103" lat="42.3676084" lon="-71.0218168"/>
  <node id="30730955" version="2" timestamp="2012-12-19T19:24:32Z" uid="326503" user="wambag" changeset="14335103" lat="42.3675229" lon="-71.0218486"/>

Import Python packages


In [12]:
import xml.etree.cElementTree as ET
import pprint as pp
import os

General helper function


In [13]:
# system beep
def finished():
    os.system("printf '\a'")
    os.system("printf '\a'")

Iterate over and display tags, attributes, and descendants in data file


In [14]:
filename = '/Users/excalibur/Dropbox/nanodegree/data/boston_massachusetts.osm'

tags = {}

for event,element in ET.iterparse(filename):
    
    if element.tag not in tags:
        tags[element.tag] = {}
        tags[element.tag]['count'] = 1
        tags[element.tag]['attributes'] = {}
        tags[element.tag]['children'] = {}
        tags[element.tag]['grandchildren'] = {}
        tags[element.tag]['greatgrandchildren'] = {}
    else:
        tags[element.tag]['count'] += 1
        
    for attribute_key,attribute_val in element.attrib.items():
            if attribute_key not in tags[element.tag]['attributes']:
                tags[element.tag]['attributes'][attribute_key] = 1
            else:
                tags[element.tag]['attributes'][attribute_key] += 1
                
    for child in element:
        if child.tag not in tags[element.tag]['children']:
            tags[element.tag]['children'][child.tag] = 1
        else:
            tags[element.tag]['children'][child.tag] += 1
        
        for grandchild in child:
            if grandchild.tag not in tags[element.tag]['grandchildren']:
                tags[element.tag]['grandchildren'][grandchild.tag] = 1
            else:
                tags[element.tag]['grandchildren'][grandchild.tag] += 1
                
            for greatgrandchild in grandchild:
                if greatgrandchild.tag not in tags[element.tag]['greatgrandchild']:
                    tags[element.tag]['greatgrandchild'][greatgrandchild.tag] = 1
                else:
                    tags[element.tag]['greatgrandchild'][greatgrandchild.tag] += 1
               
# clean up unused dictionaries
for item in tags.items():
    if not item[1]['attributes']:
        del item[1]['attributes']
    if not item[1]['children']:
        del item[1]['children']
    if not item[1]['grandchildren']:
        del item[1]['grandchildren']
    if not item[1]['greatgrandchildren']:
        del item[1]['greatgrandchildren']

pp.pprint(tags)
finished()


{'bounds': {'attributes': {'maxlat': 1,
                           'maxlon': 1,
                           'minlat': 1,
                           'minlon': 1,
                           'origin': 1},
            'count': 1},
 'member': {'attributes': {'ref': 8328, 'role': 8328, 'type': 8328},
            'count': 8328},
 'nd': {'attributes': {'ref': 1904147}, 'count': 1904147},
 'node': {'attributes': {'changeset': 1601437,
                         'id': 1601437,
                         'lat': 1601437,
                         'lon': 1601437,
                         'timestamp': 1601437,
                         'uid': 1601437,
                         'user': 1601437,
                         'version': 1601437},
          'children': {'tag': 274720},
          'count': 1601437},
 'osm': {'attributes': {'generator': 1, 'version': 1},
         'children': {'bounds': 1,
                      'node': 1601437,
                      'relation': 1050,
                      'way': 245626},
         'count': 1,
         'grandchildren': {'member': 8328, 'nd': 1904147, 'tag': 748353}},
 'relation': {'attributes': {'changeset': 1050,
                             'id': 1050,
                             'timestamp': 1050,
                             'uid': 1050,
                             'user': 1050,
                             'version': 1050},
              'children': {'member': 8328, 'tag': 4366},
              'count': 1050},
 'tag': {'attributes': {'k': 748353, 'v': 748353}, 'count': 748353},
 'way': {'attributes': {'changeset': 245626,
                        'id': 245626,
                        'timestamp': 245626,
                        'uid': 245626,
                        'user': 245626,
                        'version': 245626},
         'children': {'nd': 1904147, 'tag': 469267},
         'count': 245626}}

What to do with with over a million nodes of OpenStreetMap data?

To help focus this project, I realized that this might be the perfect time to try to implement and old idea I had with a buddy of mine.