In [3]:
import IPython.display as disp
from pymongo import MongoClient
from pprint import pprint
client = MongoClient("mongodb://localhost:27017")
bayarea = client.examples.bayarea
Several inconsistencies were discovered during the audit:
There was a big number of keys prefixed with tiger. After some researching, it was found that tiger keys originate from Census data. OpenStreetMap initially used tiger data to import mass amounts of map information during the early years. Future census data will no longer be imported and tiger data is being slowly converted to fit the recommended OpenStreetMap schema. Until it is properly converted, some of these key values need to be converted to the "address" dictionary.
Two keys were converted:
"tiger":"zip_left"
-> "address":"postcode"
"tiger":"county"
-> "address":"county"
These two keys were converted only if there wasn't a value in place for those key values located under the address dictionary.
In [8]:
bayarea.find().count()
Out[8]:
In [9]:
bayarea.find({"type": "node"}).count()
Out[9]:
In [10]:
bayarea.find({"type": "way"}).count()
Out[10]:
In [4]:
pipeline = [{"$match": {"amenity": {"$ne": None}}},
{"$group": {"_id": "$amenity",
"count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}]
result = bayarea.aggregate(pipeline)
pprint(result)
In [5]:
# Top 10 fast food chains
pipeline = [{"$match": {"amenity": "fast_food", "name": {"$ne": None}}},
{"$group": {"_id": "$name", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}]
result = bayarea.aggregate(pipeline)
pprint(result)
In [6]:
pipeline = [{"$match": {"leisure": {"$exists": 1}}},
{"$group": {"_id": "$leisure", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}]
result = bayarea.aggregate(pipeline)
pprint(result)
In [8]:
pipeline = [{"$match": {"leisure": {"$exists": 1}, "address.city": {"$exists": 1}}},
{"$group": {"_id": "$address.city", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}]
result = bayarea.aggregate(pipeline)
pprint(result)
In [10]:
pipeline = [{"$match": {"amenity": {"$exists": 1}, "address.city": {"$exists": 1}}},
{"$group": {"_id": "$address.city", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}]
result = bayarea.aggregate(pipeline)
pprint(result)
In [11]:
pipeline = [{"$match": {"building": {"$exists": 1}, "address.city": {"$exists": 1}}},
{"$group": {"_id": "$address.city", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}]
result = bayarea.aggregate(pipeline)
pprint(result)
In [12]:
pipeline = [{"$match": {"address.city": {"$exists": 1}}},
{"$group": {"_id": "$address.city", "count": {"$sum": 1}}},
{"$sort": {"count": -1}},
{"$limit": 10}]
result = bayarea.aggregate(pipeline)
pprint(result)
Open Street Map data is not only incomplete but also filled with inconsistencies and lack of structure. Some ideas need to be discussed at a public forum to organize OpenStreetMap data into a more structured form to make this public open source collaboration more relevant compared with Google Maps, Apple Maps, and Yelp.
In [17]:
bayarea.find({"type": "node"}).count()
Out[17]:
In [14]:
bayarea.find({"type": "node", "address.city": {"$exists": 0}}).count()
Out[14]:
In [15]:
bayarea.find({"type": "node", "address.county": {"$exists": 0}}).count()
Out[15]:
In [16]:
bayarea.find({"type": "node", "address.postcode": {"$exists": 0}}).count()
Out[16]:
For the city, county, and postcode there is only < 1% coverage for all the nodes. During the shaping of the data to JSON, these values can be programmatically filled out if I had a geographical database that can fill out the city, county, and postcode from latitudinal and longitudinal coordinates. Once these values are completed, then more inferences can be made for city-city, county-county, or post code - post code comparisons.
It also would not hurt to recommend to OpenStreetMap to automatically fill these fields for a user when a node or way is inputted or editted.
In [18]:
disp.Image("./images/leisure.png")
Out[18]:
Source: http://taginfo.openstreetmap.org/keys/leisure#similar
Majority of the values for the leisure key fall under 6 values. That makes 1799 items that are irrelevant (with 1 item being < 1% of the total count) for making inferences. I believe there needs to be a discussion on how these 1799 values can be consolidated into groups.
Analysis of the queries show an unusual amount of entries for the city of Stockton. The number of documents in Stockton is almost 11x the amount in San Francisco. More research needs to be done on why.
If there are incentives or someone has created an efficient script to input/modify data into OpenStreetMap from other sources for the city of Stockton, then we should find out what they are to give the tools more visibility.
The Open Street Map service is a valuable service that allows the community to rapidly update information about their surrounding areas. However, a common problem with big scale open collaboration efforts is that structure is often sacrificed with the vast amount of collaborators. There needs to be an open forum on how to combat the chaos that naturally comes with open collaboration to keep OpenStreetMap becoming irrelevant compared with Google/Apple maps, Yelp, etc. OpenStreetMap has enormous potential and it is up to the community to spearhead efforts to improve on it's infrastructure.
In [7]:
def css_styling():
styles = open("../css/custom.css", "r").read()
return disp.HTML(styles)
css_styling()
Out[7]: