Boston Roulette

A Something-to-Do-in-Boston Suggestion Engine

Import Python packages



In [2]:

    
import xml.etree.cElementTree as ET
import pprint as pp
import re
import os
import json
import string

Data file name



In [3]:

    
filename = '/Users/excalibur/Dropbox/nanodegree/data/boston_massachusetts.osm'

Helper functions



In [4]:

    
# system beep
def finished():
    os.system("printf '\a'")
    os.system("printf '\a'")



In [5]:

    
# replace sets with lists for JSON
def set_default(obj):
    if isinstance(obj, set):
        return list(obj)
    raise TypeError



In [6]:

    
# make a JSON file
def make_places_file(places):
    with open('places.json', 'w') as f:
        json.dump(places, f, default=set_default, sort_keys=True, indent=2, separators=(',', ' : '))
    finished()

Get data from the OSM file

We shall call the nodes, places, and it will be good.



In [7]:

    
places = {}
for event,element in ET.iterparse(filename):
    if element.tag == 'node':
        node_id = element.attrib['id']
        places[node_id] = {}
        places[node_id]['lat'] = element.attrib['lat'] 
        places[node_id]['lon'] = element.attrib['lon']
        
        for tag in element:
            if tag.attrib['k'] == 'addr:city':
                places[node_id]['city'] = tag.attrib['v']
            if tag.attrib['k'] == 'addr:housenumber':
                places[node_id]['number'] = tag.attrib['v']
            if tag.attrib['k'] == 'addr:street':
                places[node_id]['street'] = tag.attrib['v']
            if tag.attrib['k'] == 'address':
                places[node_id]['address'] = tag.attrib['v']
            if tag.attrib['k'] == 'amenity':
                places[node_id]['amenity'] = tag.attrib['v']
            if tag.attrib['k'] == 'cuisine':
                places[node_id]['cuisine'] = tag.attrib['v']
            if tag.attrib['k'] == 'designation':
                places[node_id]['designation'] = tag.attrib['v']
            if tag.attrib['k'] == 'leisure':
                places[node_id]['leisure'] = tag.attrib['v']
            if tag.attrib['k'] == 'name':
                places[node_id]['name'] = tag.attrib['v']
            if tag.attrib['k'] == 'note':
                places[node_id]['note'] = tag.attrib['v']
            if tag.attrib['k'] == 'opening_hours':
                places[node_id]['opening_hours'] = tag.attrib['v']
            if tag.attrib['k'] == 'phone':
                places[node_id]['phone'] = tag.attrib['v']
            if tag.attrib['k'] == 'shop':
                places[node_id]['shop'] = tag.attrib['v']
            if tag.attrib['k'] == 'website':
                places[node_id]['website'] = tag.attrib['v']

finished()

Initial number of places



In [8]:

    
len(places)









    Out[8]:





1601437

Save places as a JSON file



In [9]:

    
make_places_file(places)

Start getting rid of places that don't make the cut (given our goals)

We don't want places with only latitude and longitude.



In [10]:

    
to_remove = []
for place in places:
    if len(places[place].keys()) <= 2:
        to_remove.append(place)

for place_id in to_remove:
    del places[place_id]

Check to see how many places are left.



In [11]:

    
len(places)









    Out[11]:





10386

Well, there goes most of the data, which makes sense, since it was map data.

Similarly, we don't want places that don't have some sort of helpful label.



In [12]:

    
keys = set()
for place in places:
    if 'name' not in places[place].keys():
        for key in places[place].keys():
            keys.add(key)
print keys









    



set(['shop', 'city', 'amenity', 'designation', 'cuisine', 'website', 'lon', 'number', 'leisure', 'note', 'phone', 'street', 'address', 'lat'])

So those labels listed above exist in various combinations for places even when the name label is absent.

Since name might not be used, but something else like shop could be, eliminate places that don't have some sort of helpful label (i.e., one from the above set).



In [13]:

    
to_remove = []
for place in places:
    if not 'name' in places[place].keys() and not 'shop' in places[place].keys() and not 'amenity' in places[place].keys() and not 'leisure' in places[place].keys() and not 'note' in places[place].keys():
        to_remove.append(place)
        
for place_id in to_remove:
    del places[place_id]

Check to see how many places are left.



In [14]:

    
len(places)









    Out[14]:





9984

Our places are slowly evaporating, but that's OK.

Not every place is the type of place we want to list



In [15]:

    
def display_labels(label):
    # for nbviewer, use a counter so output is truncated
    count = 0 #
    
    labels = set()
    for place in places:
        if label in places[place]:
            labels.add(places[place][label])

    for l in labels:
        print l
        
        # for nbviewer, use a counter so output is truncated
        count += 1
        if count > 50:
            break

Show names



In [17]:

    
display_labels('name')









    



Sofra Bakery & Cafe
Huron Ave @ Gurney St
Edmonds Hall
Kennedy Biscuit Loft apartments
Union Square Family Health
New Mission High School
Chestnut Hill Ave @ Buckminster Rd
Saint Matthews Syrian Orthodox Church
Hojo Study Lounge
Hubway - Boston Convention & Exhibition Center
Washington St @ Fore River Bridge
145 Grove St opp Bellingham Rd
Hubway - University of Massachusetts Boston
Middlesex County Jail (Cambridge)
735 Hyde Park Ave
Island End River
Warren St @ Gaston St
Lasell St @ Perham St
101 Anawan Ave opp Allenwood St
Boston Pre-Release Center
Adams St @ Brae Rd
West Head
Washington St opp Ashmont St
Centre St @ Brooks Ave
Highland Ave @ Trull Ln
Broadway opp Cross St
Hubway - Washington Square at Washington St. / Beacon St.
Wahington St @ Evans Rd
Shiki
North St opp Moreland Rd
Middlesex County Sheriff
Huntington Ave @ Opera Pl
Neck St @ Great Hill Dr
Market St @ Morrow Rd
Anawan Ave @ Allenwood St
Cambridge St @ Hano St
Truman Hwy @ Tyler St
Moore Youth Center
Saint Theresa Elementary School
Dudley St @ Shawmut Ave
House of Chang
First Corps of Cadets Museum
E 8th St @ Mercer St
Pig Rock
Rab Wing
AT&T
Readville St @ W Milton St
Sorelle
Harpoon Brewery
Newton South High
Roxbury Latin School

After scanning the above list, some obvious candidates for removal appear. Schools and churches probably wouldn't be popular destinations (some folks would disagree, of course).

Names with @ seem to be cross-streets (while not a terrible idea for suggested destinations, cross-streets obviously lack certain enticing details). Similarly, any names that are merely attempts at addresses should be nixed.

Regex for removal



In [18]:

    
remove_regex = re.compile(r'School|Academy|Elem|Church|@|St|Ave|Pkwy|Rd|Hwy|Dr|I-')



In [19]:

    
to_remove = []
for place in places:
    if 'name' in places[place]:
        if remove_regex.search(places[place]['name']):
            to_remove.append(place)
        
for place_id in to_remove:
    del places[place_id]

Run another name check



In [20]:

    
display_labels('name')









    



E Somerville Community
Bike Rack
Haynes Early Education Center
Milton Town Hall
Sofra Bakery & Cafe
Island End River
Darque Tan
East Boston
Kennedy Biscuit Loft apartments
Hebrew Rehabilitation Center For Aged
Union Square Family Health
Overholt Building
Zo Greek
Our Lady Of Lourdes
Oakdale
Hubway - Boston Convention & Exhibition Center
Hess
Gibbs Building
John Fitzgerald Kennedy National Historic Site
Lippman Building
Snowden International High
Hubway - University of Massachusetts Boston
Middlesex County Jail (Cambridge)
The Upper Crust Pizzeria
Carnegie Building
Gordon's Fine Wine & Liquors
Boston Pre-Release Center
West Head
A22
Bella Vista
Boston Univ Central - Outbound
Donaldson House
Higgins Hall
Spectroscopy Laboratory
Central Library
Rod Dee
Middlesex Law Library
Harborside Inn
Jackson Sq Orange Line Inbound
Clark Park
Wong Auditorium
Merrymount
Shiki
Capuano Early Childhood Learning Center
Porter Square Books
Copley Art & Framing
Athens Pizza & Grille
The Upper Crust
Wood Building
Meritage and Rowes Wharf Sea Grille
Ginger Exchange

There are likely plenty of other types of names to weed out, but the current batch should be good for now.

Get a new count for the number of places that remain.



In [21]:

    
len(places)









    Out[21]:





4902

Ah! Are we going to have any left when all of this is over!?

Lump label-like values together

Moving on to the other keys: shop, amenity, designation, cuisine, and leisure, all seem like they can function as general labels and, thus, need not be distinguished from one another.



In [22]:

    
for label in ['shop','amenity','designation','cuisine','leisure']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- SHOP -------

nail_salon
outdoor
electrician
shoes
food
vacant
tax
newsagent
travel_agency
hardware
real_estate
erotic
books
butcher
toys
florist
bicycle
yes
car_repair
photo
carpet
kitchen
art
confectionery
alcohol
acupuncture
pet
pet_care
psychic
jewelry
convenience
antiques
sports
vacuum_cleaner
fishmonger
music
dry_cleaning;tailor
cash
tile
indian_grocer
electronics
doityourself
chemist
mobile_phone
laundry
beauty
hairdresser
deli
beverages
supermarket
stationery

------- AMENITY -------

grave_yard
drinking_water
marketplace
picnic_table
car_sharing
public_building
cinema
recycling
car_wash
telephone
library
bicycle_rental
clinic
embassy
childcare
clock
parking
post_office
bench
magazine_box
yes
toilets
bicycle_repair_station
cafe
police
townhall
food_court
community_centre
hospital
veterinary
music_club
pharmacy
police dist 7
fountain
car_rental
prison
fuel
bicycle_parking
bbq
fast_food
fire_station
bakery
theatre
hairdresser
post_box
social_facility
arts_centre
emergency_phone
pub
waste_basket
dentist

------- DESIGNATION -------

Salad Bar
Sound-Reflection
John Joseph Moakley United States Courthouse • One Courthouse Way • Boston, MA 02210
Study and Eating lounge
Alchemist
Campus Dining Option
Food and Study Area
Place to eat/study
Roslindale Post Office
Place of worship, study space, food source, function space
Place of worship/ study space/ food source/ function area
Minot Rose Garden, Study Spot
Student Resource Center/ Food Source/ Study Space
Study Spot

------- CUISINE -------

portuguese
Israeli
Italian panini, subs, pizza, calzone, zuppe
mexican
chinese
Turkish
german
fish
sushi
ethiopian
japanese
frozen_yogurt
burger
persian
Bakery
asian
oysters,fish
Yogurt
chicken
coffee_shop
Italian,_pizza,_pasta
afghan
kebab
turkish
salads
bakery
bagel
peruvian
indian
brazilian
pizza;burger;sandwich
international
venezuelan
creole
Romanza Pizzaria
pizza
sandwich
steak_house
mediterranean
Tapas
regional
french
sandwich;coffee_shop
Burmese
greek
eritrean
pub
pasta
thai
ice_cream
tibetan

------- LEISURE -------

picnic_table
fitness_centre
fitness_station
marina
club
amusement_arcade
park
hackerspace
swimming_pool
slipway
playground
pitch
sports_centre

The above labels are good candidates for snakecasify-ing and should be added to a set for each place (since there could be unwanted overlap between them); the set should then be converted to a list for the sake of JSON.



In [23]:

    
for place in places:
    places[place]['labels'] = set()
    for key in ['shop','amenity','designation','cuisine','leisure']:
        if key in places[place]:
            # snakecasify
            places[place]['labels'].add(places[place][key].replace(',','').lower().replace(' ','_'))

    places[place]['labels'] = list(places[place]['labels'])



In [24]:

    
def remove_keys(keys):
    for place in places:
        for key in keys:
            if key in places[place]:
                del places[place][key]



In [25]:

    
remove_keys(['shop','amenity','designation','cuisine','leisure'])

Save progress.



In [26]:

    
make_places_file(places)

Check to see what keys are left



In [27]:

    
def check_keys():
    keys = set()
    for place in places:
        for key in places[place].keys():
            keys.add(key)
    print keys



In [28]:

    
check_keys()









    



set(['website', 'city', 'name', 'opening_hours', 'labels', 'lon', 'number', 'note', 'phone', 'street', 'address', 'lat'])

Check out notes and opening hours



In [29]:

    
for label in ['note','opening_hours']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- NOTE -------

CBIZ Tofias, New England’s leading professional service and accounting firm.  1rst floor: retail, Asgard, Cambridge Savings Bank. Upper floors: offices of Novartis Vaccines and Diagnostics, and CBIZ Tofias, professional service and accounting firm.
Trendy nightclub, dress well
this road leads to a service/delivery entrance to Hynes convention center. not sure if/ how to represent that.
Babcock Arboretum
Biotechnology Lab and Office space: Lab Incubator. http://labcentral.org/
Deluxe hotel within easy walking distance to Fenway Park
The Wright Way Boston basis its philosophy on the idea of creating the “exceptional” customer experience. You aren’t just another client to us. We have developed our proprietary match-in-motion system that places you first. Apartment Rentals, Sales, PM.
location approximate
Geospatial consulting firm
I provide Japanese style acupuncture treatments for a variety of disorders.
Closed in June 2007: http://library.harvard.edu/libraries/0032.html
In the subway station
small grocery, coffee and deli.
aligned to yimg
name and loc fixed. not an actual hospital
Novartis: floors 1, 6-10. Other biotech 2-5.  Built 1962;2007 lab conversion.  Rentable sq ft: 177,101
Community Resource Center Afterschool program, computer classes, food pantry
Claritas Genomics
Property Type: Office. Year Built: 2001 Rentable Square Feet: 127,150. MIT/ Alexandria Real Estate Equities, Inc.
Formerly MIT's Barta Bldg (N42), now Novartis.
The oldest hotel in America.
fire signal
Takeda/Millenium, oncology research
Changed name from "Museum of American China Trade" in 1984
FIXME: what/where is this? is it really a library?
Formerly Drumlins.  Formerly the High Hat.
Wake up to the sunrise over the harbor
Engraved on this marker: "BOSTON 3 MILES 1729"
This may be out of business 11/2013
Formerly at 5 Western Ave, Cambridge, MA, 02139; Moved Dec. 8, 2008
Oldest continuously operating restaurant in the US. Comfortable atmosphere. Raw bar. $30-50 (less expensive in bar section).
no trucks here, this is control office for high pressure hydrant system
Alkermes, a developer of pharmaceutical products and drug delivery technologies.  Focus: chronic CNS conditions such as MS, schizophrenia, bipolar, addiction, pain, muscle spasticity;DM2, immunosuppression.
charging station for cell phones
Pharmaceuticals: Alkermes, Pharmamar, Peptimmune. Biotech: Genzyme BioSurgery.  An optical research group from Brigham & Women's Hospital.
Best Buy is closed
Novartis (vaccines, diagnostics);Millennium Pharmaceuticals;Bioscale, nanobiology;Aveo, biopharmaceuticals, oncology. 137,958 square foot lab and office building.  Connected by a sky bridge to 45 Sidney St.
Partners HealthCare, a network of physicians, hospitals, and community health centers. The first collaborative research space of the Partners consortium, founded by  Brigham and Women's and Massachusetts General hospitals. 122,410 sq ft.
Enjoy lunch or dinner on the 52. floor with a impressive view over Boston. Dress code apply.
Built 1965, major renovation 2002 Rentable square ft 178,664 MIT/ Alexandria Real Estate Equities, Inc.
2nd oldest restaurant in Boston
FIXME: not sure which school building contains the library
Tenants: Pharma: Acambis, ETEX Corporation, Gene Logic, and Ore Pharmaceuticals. Partners Health Care.
214,638 sq ft.  Research Labs. Millennium: The Takeda Oncology Company. Our mission is to deliver extraordinary medicines to patients with cancer worldwide through our science, innovation, and passion.
99,000 sq feet. Ariad Pharmaceuticals: HQ Cambridge and Lausanne, Switzerland. Development of new small-molecule medicines to advance the treatment of cancer.
360 degree view of Boston from this observation Deck on the 51. floor of the Prudential Tower.
This appears to be the name of the building, but the building is more prominantly labeled the city anex. don't know what to do with this POI -- JasonWoof
Community Service Center
Wine, olive oil. Big selection of gift baskets.
Currently closed for renovations (November 7.2010)
45 Sidney St. is part of a matched pair of buildings connected by a skybridge. Tenants: Novartis Institutes for Biomedical Research;and Millennium Pharmaceuticals. 138,724 sq. ft.

------- OPENING_HOURS -------

Mo-Sa 07:00-23:00, Su 10:00-18:00;
Mo-Sa 11:00-18:00; Su 12:00-17:00
Monday - Friday 7am - 8 pm
Mo 11:00-19:00; Tu-Th 08:00-20:00; Fr 11:00-19:00; Sa 10:00-18:00; Su off
9:00-17:00
Mo-Sa 10:30-23:00; Su 11:00-22:00
Mo-Sa 17:00-23:00; Su 17:00-22:00
Mo-Fr 08:30-17:00; Sa-Su off
Mo-Th 9am-7pm;Fr-Sa 9am-8pm
Mo-Fr 06:00-17:30; Sa-Su off
12pm - 1am
Mo-We 08:30-16:00; Th 08:00-20:00; Fr 08:30-18:00; Sa 09:00-13:00; Su off
Mo-Tu 08:00-22:00; We-Fr 08:00-23:00; Sa 11:00-23:00; Su 12:00-22:00
Mo-We 08:30-16:00; Th-Fr 08:30-18:00; Sa 09:00-13:00; Su off
Mo-Th 11:00-22:00; Fr-Sa 11:00-22:30; Su off
Mo-We 11:00-22:30; Th-Fr 11:00-23:00; Sa-Su 10:00-23:00
Mo 11:30-24:00; Tu-Fr 00:00-00:30,11:30-24:00; Sa-Su 00:00-02:00,11:30-24:00
9:00am - 5:30pm
11:30-02:00
Mo-Fr 07:00-19:00; Sa 08:00-18:00; Su off
Mo-We 11:00-19:00; Th-Fr 11:00-21:00; Sa 11:00-19:00; Su 11:00-17:00
Mo-Fr 07:00-20:00; Sa 08:00-18:00; Su 09:00-18:00
Apr-Sep: Mo-Fr 10:00-19:00; Sa 10:00-18:00; Su 12:00-17:00; Oct-Mar: Mo-Tu,Th,Sa 10:00-18:00; We,Fr 10:00-19:00; Su off
Mo-Fr 07:00-19:00; Sa 08:00-17:00; Su off
Mo off; Tu-Su 11:00-18:00
Mo-Fr 09:00-21:00; Sa 09:00-19:00; Su 10:00-18:00
Mo-Sa 11:00-18:00; Su 12:00-17:00
Mo,Su off; Tu,We,Fr 11:00-18:00; Th 11:00-20:00; Sa 09:00-17:00
Mo-Th 10:00-19:00; Fr 10:00-20:00; Sa 10:00-18:00; Su 12:00-17:00
Mo-Sa 07:00-23:00; Su 09:00-23:00
Mo-Fr 07:30-21:00; Su 09:00-18:00
Mo-Fr 09:00-17:00; Sa 09:00-13:00; Su off
Mo-Su 06:00-23:59
Mo-Su 07:00-23:00
Mo-We 08:00-18:00; Th-Fr 08:00-19:00; Sa 08:00-14:00; Su 11:00-15:00
11:00-02:00
Mo-Th 05:00-22:00;Fr 05:00-23:00;Sa 11:30-23:00;Su 12:00-21:00
Mo-Su 10:00-23:00
24/7
11:00-22:50
No public access
Mo-Fr 11:00-20:00; Sa 10:00-20:00; Su 12:00-19:00
Mo-Sa 08:00-20:00;Su 09:00-18:00
Mo-We 08:30-16:30; Th-Fr 08:30-17:00; Sa 09:00-13:00; Su off
Mo-Fr 09:00-16:30; Sa 09:00-12:00; Su off
7am - 8pm
Mo,Su off; Tu-Fr 11:00-18:00; Sa 10:00-18:00
Mo-We 08:30-18:00; Th 08:30-20:00; Fr 08:30-18:00; Sa 10:00-17:00; Su off
Mo-We 09:00-16:00; Th 09:00-17:00; Fr 09:00-16:00; Sa 09:00-13:00; Su off
Mo-Sa 10:00-21:00; Su 11:00-19:00
Mo-Fr 06:00-18:00; Sa-Su 06:00-16:00

For our purposes, note and opening_hours seem a little lame, chaotic, random, and infrequent. Get rid of 'em.



In [30]:

    
remove_keys(['note','opening_hours'])

Check out websites



In [31]:

    
for label in ['website']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- WEBSITE -------

http://www.japonaisebakery.com
www.eztaxandaccount.com
http://www.kelleyssquarepub.com/
http://www.darulkabab.com
http://www.blueshirtcafedavis.com
http://elephantwalk.com/
http://www.marriott.com/hotels/fact-sheet/travel/bostw-residence-inn-boston-harbor-on-tudor-wharf/
mysmile.net
chipotle.com
http://next.mit.edu
http://www.solaztecarestaurants.com/
www.ashmontcycle.com
http://www.maaco-somerville.com/
http://www.saloondavis.com
http://www.cityslickercafe.com/
http://www.goodfellasquincy.com/
www.shaws.com
http://www.brickandmortarbar.com/
http://massachusetts.salvationarmy.org/MA/BostonSouthEnd
http://www.lucyparsons.org/
http://www.bertuccis.com/
http://www.paddleboston.com
https://harvardmedsim.org
http://www.barakacafe.com/
http://www.bhh.com/
www.luckyslounge.com
http://www.annastaqueria.com/
http://www.communitybicycle.com/
http://www.tenthousandvillages.com
http://USPS.com
http://massconvention.com/about-us/boston-common-garage
http://www.zoboston.com/
http://www.outoftheblueartgallery.com/
http://areafour.com/
http://www.unionoysterhouse.com/
http://voltagecoffee.com/
www.heliosdesigngroup.com
http://www.dudley.harvard.edu/icb/icb.do?keyword=k73092&pageid=icb.page359690
http://microcenter.com/
http://www.sushiexpress-ma.com/
http://www.tosci.com/
http://traderjoes.com/
http://bicyclebelleboston.com/
http://avalonatassemblyrow.com
http://www.quanskitchen-boston.com/
http://www.hardrock.com
http://www.brueggers.com/
http://rumorboston.com/
http://brooklineosaka.com/
http://www.1369coffeehouse.com/
http://www.tattecookies.com/

People like websites. Fix 'em.



In [32]:

    
http_regex = re.compile(r'http://|https://')
for place in places:
    if 'website' in places[place]:
        if not http_regex.match(places[place]['website']):
            places[place]['website'] = 'http://' + places[place]['website']
            print places[place]['website']









    



http://www.ashmontcycle.com
http://mysmile.net
http://www.qaudio.com
http://advanceautoparts.com
http://www.bertuccis.com
http://www.oasisgh.com
http://www.centralbottle.com
http://TatteBakery.com
http://www.luckyslounge.com
http://www.drivemint.com
http://www.mjoconnors.com
http://www.quirkkia.com
http://www.quirkworkssubaru.com
http://www.garysliquors.com
http://flourbakery.com
http://www.msi.org
http://www.eztaxandaccount.com
http://www.quirkchevy.com
http://5northsquare.com
http://www.cambridgemedsupply.com
http://www.JacobWirth.com
http://www.tastyburger.com
http://www.bleacherbarboston.com
http://www.grano-pizza.com
http://www.heliosdesigngroup.com
http://www.shaws.com
http://chipotle.com
http://www.eztaxandaccount.com
http://www.appgeo.com
http://lordhobo.com

Check out phone numbers



In [33]:

    
for label in ['phone']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- PHONE -------

+1 (617) 482-3473
781-848-6250
(617) 855-8593
617-635-8700
6173671866
617-635-8234
+1 (781) 267-4539
+1 781 9171299
617 923 6060
617-666-3900
983-890-4140
6173384333
+1 6175361775
617-635-8789
+1 617 536 5700
617-635-8422
617-484-3078
+1 617 8683585
+1 617 567-8787
617-625-6600
617-635-8781
617-924-0187
617-552-7409
617-876-2200
617-635-8275
+1 617 612 8253
617-436-8301
+1 617-208-6922;+1 617-208-6928
617-889-8422
781-335-1836
617 491 2999
+1 617 375 8550
617-536-5260
+1 617 426.2000
+1-617-764-3152
+1 617 623 0803
+1 617 349-1650
+16174076271
617-491-2220
617-713-5003
617-354-7766
617-522-4850
6174247000
+1 617 661 0077
+1 (617) 623-9068
617-424-5500
617.354.4200
617-635-8832
617-635-8731
(617) 436-2786
617-635-9914

Bleck. Those phone number strings are nutzo.

I think people still use phones for calling though. Might as well fix 'em.



In [34]:

    
phone_regex = re.compile(r'\d{3}-\d{3}-\d{4}')
for place in places:
    if 'phone' in places[place]:
        if not phone_regex.match(places[place]['phone']):
            
            phone = list(places[place]['phone'])
            phone.reverse()
            
            digit_regex = re.compile(r'\d')
            new_phone = []

            digit_count = 0

            for i in phone:

                if digit_regex.match(i):
                    new_phone.insert(0,i)
                    digit_count += 1

                    if digit_count == 4 or digit_count == 7:
                        new_phone.insert(0,'-')

                if digit_count > 9:
                    break
            
            places[place]['phone'] = str.join('',new_phone)
            
            # print any that got through
            if not phone_regex.match(places[place]['phone']):
                print "problem number: " + places[place]['phone']









    



problem number: -161-7357

Simply scanning the numbers displayed in the output window above, 617 357 LUCK is the likely culprit.



In [35]:

    
for place in places:
    if 'phone' in places[place]:
        if places[place]['phone'] == '-161-7357':
            places[place]['phone'] = '617-357-5825'



In [36]:

    
for label in ['phone']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- PHONE -------

617-772-5800
617-266-7525
617-542-5942
617-567-9871
781-848-6250
983-890-4140
617-635-8234
617-679-1680
617-666-3900
617-742-2100
617-357-5825
617-247-4141
617-635-8789
617-635-8676
617-635-8422
617-424-7000
617-484-3078
617-764-3152
617-625-6600
617-635-8781
617-868-3585
617-552-7409
617-876-2200
617-635-8275
781-340-4922
617-436-8301
617-889-8422
617-338-4600
781-335-1836
617-561-3737
617-536-5260
781-326-5351
617-496-5955
617-491-2220
617-713-5003
617-354-7766
617-542-8623
617-522-4850
617-482-3473
617-424-5500
617-437-0300
617-783-5804
617-635-8832
617-635-8731
617-325-2453
617-264-2002
617-714-3974
617-569-7272
617-776-2100
617-635-8205
617-497-1513

Scanning again, I noticed , Forest City Management in the displayed results.

It must have escaped the frontal-regex match. Drop it like it's hot.



In [37]:

    
for place in places:
    if 'phone' in places[place]:
        if places[place]['phone'] == '617-494-9330,  Forest City Management':
            print places[place]['phone']









    



617-494-9330,  Forest City Management



In [38]:

    
for place in places:
    if 'phone' in places[place]:
        if places[place]['phone'] == '617-494-9330,  Forest City Management':
            places[place]['phone'] = '617-494-9330'

Check for any other oddballs by looking for string length 12.



In [39]:

    
for place in places:
    if 'phone' in places[place]:
        if len(places[place]['phone']) != 12:
            print places[place]['phone']

Check to see what keys are left



In [40]:

    
check_keys()









    



set(['website', 'city', 'name', 'labels', 'lon', 'number', 'phone', 'street', 'address', 'lat'])

Check out cities



In [41]:

    
for label in ['city']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- CITY -------

Newton
Quincy
boston
East Boston
Cambridge, Massachusetts
Hingham
West Roxbury
Cambridge 
Watertown, MA
somerville
Newton Centre
Cambridge, MA
Chestnut Hill
Dorchester
South End
Braintree
Brookline
BOSTON
watertown, MA
Milton
Boston, MA
Watertown
Dedham
Weymouth
Roxbury Crossing
Roslindale
Belmont, MA
Cambridge
Boston
Jamaica Plain
Belmont
Brookline, MA
Brighton
Somerville
2067 Massachusetts Avenue

Since we're using the cities (we're assuming folks will know we're in Massachusetts), toss any punctuation and references to the state.

Also, capitalize proper nouns.



In [42]:

    
state_regex = re.compile(r'')
for place in places:
    if 'city' in places[place]:
        # remove state
        city = places[place]['city'].split(',')
        places[place]['city'] = city[0]
        # capitalize
        places[place]['city'] = string.capwords(places[place]['city'])



In [43]:

    
for label in ['city']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- CITY -------

Newton
Quincy
East Boston
Hingham
West Roxbury
Newton Centre
Chestnut Hill
Dorchester
South End
Roxbury Crossing
Brookline
Brighton
Watertown
Dedham
Weymouth
Braintree
Roslindale
Cambridge
Boston
Jamaica Plain
Belmont
Milton
Somerville
2067 Massachusetts Avenue

Deal with that street address. Google says it's in Cambridge



In [44]:

    
for place in places:
    if 'city' in places[place]:
        if places[place]['city'] == '2067 Massachusetts Avenue':
            print places[place]









    



{'city': '2067 Massachusetts Avenue', 'name': 'The Elephant Walk', 'labels': ['restaurant'], 'lon': '-71.1229588', 'lat': '42.3914326'}



In [45]:

    
for place in places:
    if 'city' in places[place]:
        if places[place]['city'] == '2067 Massachusetts Avenue':
            places[place]['address'] = places[place]['city']
            places[place]['city'] = 'Cambridge'



In [46]:

    
for label in ['city']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- CITY -------

Newton
Quincy
East Boston
Hingham
West Roxbury
Newton Centre
Chestnut Hill
Dorchester
South End
Roxbury Crossing
Brookline
Brighton
Watertown
Dedham
Weymouth
Braintree
Roslindale
Cambridge
Boston
Jamaica Plain
Belmont
Milton
Somerville

Check out numbers, streets, and addresses



In [47]:

    
for label in ['number','street', 'address']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- NUMBER -------

6-2A
264-266
0
343
348
914
295
290
593
599
1121
196
191
190
270
271
272
273
277
279
524
525
526
2382
520
1016
1010
1230
443
442
447
446
445
444
103
100
101
106
107
105
900
901
907
33
31
37
36
35
640
647
435

------- STREET -------

South Street
Church st
Pearl St.
Charles Street
Massachusetts Ave.
Grant Avenue
Newbury Street
Maverick Street
Harvard Avenue
South Station, near Track 6
Savin Hill Avenue
Broad St
Great River Road
Rindge Avenue
Garden Street
High Street
Edgerly Road
Windsor Street
Willow Ave
East Berkeley Street
Broad Canal Street
Milk Street
Congress St
Waverly Street
Hancock Street.
Summer Street
Harrison Avenue
Ipswich Street
Holyoke Street
Dartmouth Street
Cummington St
State Street
Watertown Street
First Street
Fawcett Street
Lagrange Street
School Street
Hanover Street
Wormwood Street
Parmenter Street
Putnam Avenue
Broad Street
Highland Ave
Talbot Avenue
East Newton Street
Chandler Street
Main St.
Brattle Street
Somerville Avenue
Landsdowne Street
Broadway

------- ADDRESS -------

1249 Hyde Park Ave, Boston, MA, 02136
140 Brandeis Road, Newton Centre MA 00
1 Haywood Street, Braintree, MA, 02184
451 Central Ave, Milton MA 02186
105 Cummins Highway, Roslindale MA 02131
618 Harrison Avenue, Boston, MA, 02118
1060 Morton Street, Mattapan MA 02126
1234 Columbus Avenue, Boston MA 02120
115 Sycamore Street, Somerville MA 02145
54 Brookside Avenue, Jamaica Plain MA 02130
301 Neponset Avenue, Boston, MA, 02122
174 Ipswich Street, Boston MA 02215
80 Independence Avenue, Braintree MA 02184
120 Tremont Street, Boston MA 02108
40 Philbrick Street, Roslindale MA 02131
200 Frontage Rd, Boston, MA, 02118
1884 Dorchester Avenue, Boston, MA, 02124
564 Mount Auburn Street, Watertown, MA, 02472
, Boston MA 02133
436 Washington Street, Dedham, MA, 02026
650 Harrison, Boston, MA, 02118
45 Pauline Street, Winthrop MA 02152
205 Townsend Street, Boston MA 00
8 Ashburton Place, Boston MA 00
175 Cambridge Street, Cambridge, MA, 02141
1669 Dorchester Avenue, Dorchester MA 02122
465 Huntington Avenue, Boston MA 00
10 Brookline Place West, Brookline MA 02445
125 Sixth St East Cambridge  MA,  02142
615 Washington Street, Quincy, MA, 02169
489 Broadway, Cambridge, MA, 02138
1 Parish Street, Dorchester MA 02122
515 Canton Avenue, Milton, MA, 02186
300 Hammond Pond Parkway, Newton MA 00
459 Broadway, Cambridge MA 02138
125 Parker Hill Avenue, Boston MA 00
221 Rivermoor Street, Boston MA 00
250 Washington Street, Boston MA 02108
791 Hammond Street, Chestnut Hill MA 02467
35 Baker Street, West Roxbury MA 02132
115 Mill St, Belmont, MA
150 Morrissey Boulevard, Dorchester MA 02125
1 Winter Street, Boston MA 00
Pilgrim Road, Boston MA 02215
40 Gibson St, Boston, MA, 02122
89 Fairview Avenue, Belmont, MA, 02478
300 South St, Brookline, MA
950 Metropolitan Avenue, Boston MA 00
61 School Street, Roxbury MA 02119
396 Northampton Street, Boston MA 02118
307 Chestnut Street, Chelsea, MA, 02150

Fix numbers, streets, and addresses

Look for unnecessary duplication, where numbers and/or streets are present when an address is as well.



In [48]:

    
for place in places:
    if 'address' in places[place].keys() and ('number' in places[place].keys() or 'street' in places[place].keys()):
        print places[place]









    



{'name': 'Brookline Town Hall', 'labels': ['townhall'], 'lon': '-71.1206269', 'number': '333', 'street': 'Washington Street', 'address': '333 Washington St., Brookline MA 02445', 'lat': '42.3339923'}
{'website': 'http://www2.cambridgema.gov/CFD/Lafayette.cfm', 'city': 'Cambridge', 'name': 'Cambridge Fire Department', 'labels': ['fire_station'], 'lon': '-71.1001598', 'number': '378', 'street': 'Massachusetts Avenue', 'address': '378 Massachusetts Avenue, Cambridge, MA, 02139', 'lat': '42.3629857'}
{'name': 'Somerville Fire Department', 'labels': ['fire_station'], 'lon': '-71.1108341', 'number': '265', 'street': 'Highland Avenue', 'address': '265 Highland Avenue, Somerville, MA, 02143', 'lat': '42.3916806'}
{'name': 'Media Archives And Preservation Center', 'labels': ['library'], 'lon': '-71.125771', 'number': '135', 'address': '125 Western Avenue, Boston MA 02134', 'lat': '42.3638112'}

Remove duplication.



In [49]:

    
for place in places:
    if 'address' in places[place].keys() and ('number' in places[place].keys() or 'street' in places[place].keys()):
        if 'number' in places[place].keys():
            del places[place]['number']
        if 'street' in places[place].keys():
            del places[place]['street']

Numbers without corresponding streets are a bit on the worthless side, so remove them too.



In [50]:

    
for place in places:
    if 'number' in places[place].keys() and 'street' not in places[place].keys():
        print places[place]









    



{'name': 'Los Paisanos', 'labels': ['restaurant'], 'lon': '-71.0807423', 'number': '62', 'lat': '42.3865984'}
{'lat': '42.3791136', 'labels': [], 'lon': '-71.1026375', 'name': 'ABJ Auto Supply Warehouse', 'number': '352'}
{'name': 'Christian Assembly', 'labels': ['place_of_worship'], 'lon': '-71.0840991', 'number': '9', 'lat': '42.3943331'}
{'name': 'Phoenix Landing', 'labels': ['pub'], 'lon': '-71.1018806', 'number': '512', 'lat': '42.3640906'}
{'name': 'Dunkin Donuts', 'labels': ['cafe'], 'lon': '-71.1066122', 'number': '519', 'lat': '42.3834837'}



In [51]:

    
for place in places:
    if 'number' in places[place].keys() and 'street' not in places[place].keys():
        del places[place]['number']

Make addresses from numbers and/or streets.



In [52]:

    
for place in places:
    if 'address' not in places[place].keys():
        if 'number' in places[place].keys() and 'street' in places[place].keys():
            places[place]['address'] = places[place]['number'] + " " + places[place]['street']
            del places[place]['number']
            del places[place]['street']
        elif 'street' in places[place].keys():
            places[place]['address'] = places[place]['street']
            del places[place]['street']



In [53]:

    
for label in ['address']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- ADDRESS -------

133 Middlesex Avenue
1249 Hyde Park Ave, Boston, MA, 02136
502 Massachusetts Ave.
201 Monsignor O'Brien Highway
243 Charles Street, Boston, MA
140 Brandeis Road, Newton Centre MA 00
493 Massachusetts Avenue
451 Central Ave, Milton MA 02186
155 Milk Street
105 Cummins Highway, Roslindale MA 02131
138 Cambridge Street
86 White Street, East Boston MA 02128
1060 Morton Street, Mattapan MA 02126
6 Beacon Street
21 Temple Place
1 Sea St, Quincy, MA, 02169
120 Huntington Avenue
268 Norfolk Street
55 Causeway Street
720 Massachusetts Avenue
1629 Cambridge St
60 Garden Street
1234 Columbus Avenue, Boston MA 02120
97 Salem St
7 Holyoke Street
5 North Square
2344 Massachusetts Avenue
60 Temple Place
54 Brookside Avenue, Jamaica Plain MA 02130
1010 Beacon Street
77 Middlesex Avenue
191 Stuart Street
10 Post Office Square
1 Haywood Street, Braintree, MA, 02184
174 Ipswich Street, Boston MA 02215
80 Independence Avenue, Braintree MA 02184
120 Tremont Street, Boston MA 02108
635 Commonwealth Avenue
689 Massachusetts Avenue
444 Somerville Avenue
40 Philbrick Street, Roslindale MA 02131
636 Beacon St
South Station, near Track 6
200 Frontage Rd, Boston, MA, 02118
538 Massachusetts Avenue
704 Massachusetts Avenue
350 Massachusetts Avenue
1884 Dorchester Avenue, Boston, MA, 02124
564 Mount Auburn Street, Watertown, MA, 02472
470 Centre Street
9 Brookline Street

Get rid of the city, state, and zip bits.



In [54]:

    
for place in places:
    if 'address' in places[place].keys():
        places[place]['address'] = places[place]['address'].split(',')[0]



In [55]:

    
for label in ['address']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- ADDRESS -------


133 Middlesex Avenue
502 Massachusetts Ave.
1884 Dorchester Avenue
234 Bussey Street
201 Monsignor O'Brien Highway
436 Washington Street
7 Murray Hill Rd.
493 Massachusetts Avenue
155 Milk Street
1205 VFW Parkway
24 School Street
138 Cambridge Street
451 Central Ave
171 Goddard Ave.
648 Beacon Street
588 Somerville Avenue
21 Temple Place
800 Morton St.
120 Huntington Avenue
140 Marion Street
268 Norfolk Street
55 Causeway Street
720 Massachusetts Avenue
250 Stuart Street
1629 Cambridge St
30 Millstone Road
318 Third Street
11 North Square
5 North Square
60 Fairmount Avenue
60 Temple Place
301 Neponset Avenue
128 Sidney Street
1010 Beacon Street
77 Middlesex Avenue
191 Stuart Street
10 Post Office Square
189 Glenway Street
20 Prospect St.
635 Commonwealth Avenue
689 Massachusetts Avenue
444 Somerville Avenue
449 Broadway
55A Summer Street
427 Washington Street
538 Massachusetts Avenue
704 Massachusetts Avenue
Thompson Island
465 Huntington Avenue
945 Canterbury Street

Attempt to make street designations uniform.

Check out the last string in each address.



In [56]:

    
last_string_set = set()
for place in places:
    if 'address' in places[place].keys():
        address_strings = places[place]['address'].split(' ')
        last_string_set.add(address_strings[len(address_strings)-1])

print last_string_set









    



set(['', 'Boulevard', 'WAY', 'Elm', 'West', 'Leverett', 'Pkwy.', 'Rd', 'street', 'Hill', 'Way', 'Circle', 'Rd.', 'avenue', 'Newbury', 'Highway', 'Marlborough', 'Park', 'Plz', 'Ln', 'Ln.', 'South', 'Street.', 'Brimmer', 'Artery', 'rd.', 'St.', 'Station', 'Lane', 'Plaza', 'Ave.', 'Island', 'Sq', 'Hwy', 'Drive', 'Pkwy', 'Wharf', 'Place', 'Fenway', 'Broadway', 'Lafayette', 'Parkway', 'St', 'Road', 'Blvd.', 'Square', 'MA', 'House', 'Newton', 'st', 'St..', '02467', 'Street', 'London', 'Terrace', 'Harrison', 'Blvd', 'Ave', 'Avenue', 'Row'])



In [57]:

    
swap = { 'Rd':'Road', 'Plz':'Plaza', 'Ln':'Lane', 'Sq':'Square', 'Hwy':'Highway', 'St':'Street', 'Ave':'Avenue', 'Pkwy':'Parkway', 'Blvd':'Boulevard' }
for place in places:
    if 'address' in places[place].keys():
        address_strings = places[place]['address'].split(' ')
        
        # capitalize
        address_strings[len(address_strings)-1] = string.capwords(address_strings[len(address_strings)-1])
        # remove punctuation
        address_strings[len(address_strings)-1] = address_strings[len(address_strings)-1].translate(None,'.,')
        # swap
        if address_strings[len(address_strings)-1] in swap.keys():
            address_strings[len(address_strings)-1] = swap[address_strings[len(address_strings)-1]]
        
        # remove problematic ones showing up in results list above
        if address_strings[len(address_strings)-1] == 'Ma' or address_strings[len(address_strings)-1] == '02467':
            address_strings[len(address_strings)-1] = ''
        
        # rejoin address string
        places[place]['address'] = str.join(' ',address_strings)



In [58]:

    
last_string_set = set()
for place in places:
    if 'address' in places[place].keys():
        address_strings = places[place]['address'].split(' ')
        last_string_set.add(address_strings[len(address_strings)-1])

print last_string_set









    



set(['', 'Boulevard', 'Elm', 'West', 'Leverett', 'Station', 'Hill', 'Way', 'Circle', 'Newbury', 'Highway', 'Marlborough', 'Park', 'South', 'Brimmer', 'Artery', 'Lane', 'Island', 'Plaza', 'Drive', 'Wharf', 'Place', 'Fenway', 'Lafayette', 'Parkway', 'Road', 'Square', 'House', 'Newton', 'Street', 'London', 'Terrace', 'Harrison', 'Broadway', 'Avenue', 'Row'])



In [59]:

    
for label in ['address']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- ADDRESS -------


133 Middlesex Avenue
975 Blue Hill Avenue
234 Bussey Street
201 Monsignor O'Brien Highway
436 Washington Street
493 Massachusetts Avenue
155 Milk Street
1205 VFW Parkway
24 School Street
138 Cambridge Street
552 Massachusetts Avenue
50 Druce Street
58 Day Street
1 City Hall Plaza
21 Temple Place
120 Huntington Avenue
140 Marion Street
268 Norfolk Street
55 Causeway Street
720 Massachusetts Avenue
250 Stuart Street
178 Tremont Street
30 Millstone Road
7 Holyoke Street
5 North Square
60 Fairmount Avenue
60 Temple Place
301 Neponset Avenue
128 Sidney Street
1010 Beacon Street
77 Middlesex Avenue
9 Gallivan Boulevard
450 Western Avenue
191 Stuart Street
10 Post Office Square
189 Glenway Street
1 Worrell Street
635 Commonwealth Avenue
689 Massachusetts Avenue
444 Somerville Avenue
449 Broadway
55A Summer Street
427 Washington Street
538 Massachusetts Avenue
50 Birmingham Parkway
Thompson Island
465 Huntington Avenue
945 Canterbury Street
350 Massachusetts Avenue
75 Mystic Avenue

Remove some others based on cursory spot checking.



In [60]:

    
for place in places:
    if 'address' in places[place].keys():
        if 'Chestnut Hill MA' in places[place]['address']:
            places[place]['address'] = places[place]['address'].replace('Chestnut Hill MA','')
        elif '. Dorchester Education Complex' in places[place]['address']:
            places[place]['address'] = places[place]['address'].replace('. Dorchester Education Complex','')



In [61]:

    
for label in ['address']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- ADDRESS -------


133 Middlesex Avenue
975 Blue Hill Avenue
234 Bussey Street
201 Monsignor O'Brien Highway
436 Washington Street
493 Massachusetts Avenue
155 Milk Street
1205 VFW Parkway
24 School Street
138 Cambridge Street
552 Massachusetts Avenue
50 Druce Street
58 Day Street
1 City Hall Plaza
21 Temple Place
120 Huntington Avenue
140 Marion Street
268 Norfolk Street
55 Causeway Street
720 Massachusetts Avenue
250 Stuart Street
178 Tremont Street
30 Millstone Road
7 Holyoke Street
5 North Square
60 Fairmount Avenue
60 Temple Place
301 Neponset Avenue
128 Sidney Street
1010 Beacon Street
77 Middlesex Avenue
9 Gallivan Boulevard
450 Western Avenue
191 Stuart Street
10 Post Office Square
189 Glenway Street
1 Worrell Street
635 Commonwealth Avenue
689 Massachusetts Avenue
444 Somerville Avenue
449 Broadway
55A Summer Street
427 Washington Street
538 Massachusetts Avenue
50 Birmingham Parkway
Thompson Island
465 Huntington Avenue
945 Canterbury Street
350 Massachusetts Avenue
75 Mystic Avenue

Check to See What Keys are Left



In [62]:

    
check_keys()









    



set(['website', 'city', 'name', 'labels', 'lon', 'phone', 'address', 'lat'])

As checked during earlier investigations, lat and lon seem fine.



In [63]:

    
for label in ['lat','lon']:
    print "\n------- " + label.upper() + " -------\n"
    display_labels(label)









    



------- LAT -------

42.3641808
42.3661645
42.3216357
42.3661392
42.2924758
42.3286758
42.3811162
42.3370401
42.3937073
42.3413378
42.365786
42.2535434
42.3112621
42.3435616
42.3586267
42.3889473
42.2509924
42.3080364
42.3738501
42.2725044
42.266983
42.3892527
42.3923972
42.3647658
42.3813091
42.3775206
42.3663589
42.3804502
42.3809348
42.2389882
42.3702555
42.3591583
42.2389884
42.2403136
42.3947393
42.3642174
42.3433614
42.3565983
42.274822
42.3625342
42.3605677
42.2429703
42.3326458
42.3809662
42.2921406
42.3090422
42.3871219
42.3593853
42.3883537
42.348914
42.3152461

------- LON -------

-70.9960313
-71.0897568
-71.1117021
-71.0693025
-71.0795141
-71.0991017
-71.0108255
-70.9293817
-71.1502206
-71.0037174
-71.0292172
-71.0955128
-71.1072683
-70.9496423
-71.0559471
-71.0919135
-71.0668562
-71.06345
-71.1181927
-71.1850018
-71.103874
-71.0989425
-71.0118446
-70.9344915
-71.1583183
-71.114239
-71.1220615
-71.062818
-71.1377054
-71.0352776
-71.104939
-71.0971039
-70.9555219
-70.9947717
-71.1615625
-71.175503
-71.1186638
-71.1044968
-71.0871349
-71.1436
-71.0644176
-71.0689393
-71.0827353
-71.122965
-71.1030009
-71.1279216
-71.0873659
-71.0383772
-71.1071664
-71.0251142
-71.0689401

How many places are left now?



In [64]:

    
len(places)









    Out[64]:





4902

Now that we have names and labels...

No longer need unique ids for each place (MongoDB will provide new ones).



In [65]:

    
new_places = []
for place in places:
    new_places.append(places[place])



In [70]:

    
new_places[0:10]









    Out[70]:





[{'address': '295 Third Street',
  'labels': ['cafe', 'coffee_shop'],
  'lat': '42.3643922',
  'lon': '-71.0833102',
  'name': 'Voltage',
  'phone': '617-714-3974',
  'website': 'http://voltagecoffee.com/'},
 {'labels': ['restaurant'],
  'lat': '42.3944389',
  'lon': '-71.1209963',
  'name': 'Out of the Blue'},
 {'labels': ['pizza', 'restaurant'],
  'lat': '42.3931136',
  'lon': '-71.1202935',
  'name': 'Pizzeria Posto',
  'phone': '617-625-0600',
  'website': 'http://www.pizzeriaposto.com'},
 {'labels': ['pharmacy'],
  'lat': '42.2536966',
  'lon': '-71.0279564',
  'name': 'Walgreens'},
 {'labels': ['cafe'],
  'lat': '42.2535695',
  'lon': '-71.0274628',
  'name': "Dunkin' Donuts"},
 {'labels': ['park'],
  'lat': '42.2323216',
  'lon': '-71.0333821',
  'name': 'Saint Moritz Park'},
 {'labels': ['cafe'],
  'lat': '42.3534422',
  'lon': '-71.0623116',
  'name': u'Caff\xe8 Nero'},
 {'address': '33 Dunster Street',
  'labels': ['pub'],
  'lat': '42.3725011',
  'lon': '-71.1193006',
  'name': "John Harvard's Brewery & Ale House",
  'phone': '617-868-3585',
  'website': 'https://www.johnharvards.com/locations/cambridge-ma'},
 {'labels': [],
  'lat': '42.2717659',
  'lon': '-70.9833813',
  'name': 'Half Moon Island'},
 {'address': '500 Technology Square',
  'city': 'Cambridge',
  'labels': ['regional', 'restaurant'],
  'lat': '42.3630566',
  'lon': '-71.09244',
  'name': 'MexiCali Burrito',
  'phone': '617-225-2777'}]

Make a new JSON file, import to local MongoDB



In [67]:

    
make_places_file(new_places)



In [68]:

    
!mongoimport --db bosroul --collection places --file places.json --jsonArray









    



2015-03-10T17:26:57.446-0400	connected to: localhost
2015-03-10T17:26:58.912-0400	imported 4902 documents

Boston Roulette

Import Python packages

Data file name

Helper functions

Get data from the OSM file

We shall call the nodes, places, and it will be good.

Initial number of places

Save places as a JSON file

Start getting rid of places that don't make the cut (given our goals)

Not every place is the type of place we want to list

Show names

Regex for removal

Run another name check

Lump label-like values together

Check to see what keys are left

Check out notes and opening hours

Check out websites

Check out phone numbers

Check to see what keys are left

Check out cities

Check out numbers, streets, and addresses

Fix numbers, streets, and addresses

Check to See What Keys are Left

Now that we have names and labels...

Make a new JSON file, import to local MongoDB

BOOM

Boston Roulette: Part 2