Convert a pandas dataframe to geojson for web-mapping


In [1]:
import pandas as pd, requests, json

First download data from the city of Berkeley's API. You can use Socrata's $limit parameter to specify how many rows to grab (otherwise the default is 1,000 rows of data): https://dev.socrata.com/docs/paging.html

Example request: https://data.cityofberkeley.info/resource/k489-uv4i.json?$limit=5


In [2]:
# API endpoint for city of Berkeley's 311 calls
endpoint_url = 'https://data.cityofberkeley.info/resource/k489-uv4i.json?$limit=2000'

In [3]:
# fetch the URL and load the data
response = requests.get(endpoint_url)
data = response.json()

Next, turn the json data into a dataframe and clean it up a bit: drop unnecessary columns and any rows that lack lat-long data. We want to make our json file as small as possible (prefer under 5 mb) so that it can be loaded over the Internet to anyone viewing your map, without taking forever to download a huge file.


In [4]:
# turn the json data into a dataframe and see how many rows and what columns we have
df = pd.DataFrame(data)

print('We have {} rows'.format(len(df)))
str(df.columns.tolist())


We have 2000 rows
Out[4]:
"['apn', 'city', 'indbdate', 'issue_description', 'issue_type', 'latitude', 'location', 'longitude', 'neighborhood_district', 'object_type', 'secondary_issue_type', 'state', 'street_address', 'ticket_closed_date_time', 'ticket_created_date_time', 'ticket_id', 'ticket_status']"

In [5]:
df.head()


Out[5]:
apn city indbdate issue_description issue_type latitude location longitude neighborhood_district object_type secondary_issue_type state street_address ticket_closed_date_time ticket_created_date_time ticket_id ticket_status
0 NaN Berkeley 2016-10-13T04:17:52 Parking Meter Repair Traffic and Transportation NaN NaN NaN Berkeley Individual Parking CA NaN 2011-08-23T11:27:06 2011-03-01T12:48:53 121000042212 Closed
1 054 170801100 Berkeley 2016-10-13T04:13:37 Residential Service Start Refuse and Recycling 37.86165226 {'longitude': '-122.25267724', 'latitude': '37... -122.25267724 Berkeley Property Residential CA 2717 FOREST AVE 2013-05-09T09:35:31 2013-05-07T13:00:10 121000141381 Closed
2 053 160400800 Berkeley 2016-10-13T04:18:09 Residential Lost or Stolen Cart Refuse and Recycling 37.85527017 {'longitude': '-122.27364915', 'latitude': '37... -122.27364915 Berkeley Property Residential CA 2910 ELLIS ST 2014-10-01T17:32:04 2014-09-25T09:33:44 121000214450 Closed
3 052 143700100 Berkeley 2016-10-13T04:13:29 Commercial Cart Size Decrease Refuse and Recycling 37.84693953 {'longitude': '-122.27597171', 'latitude': '37... -122.27597171 Berkeley Property Commercial CA 1538 SIXTY-THIRD ST 2015-09-18T11:51:12 2015-07-06T09:14:01 121000252506 Closed
4 057 202901300 Berkeley 2016-10-13T04:13:26 Illegal Dumping - City Property Streets, Utilities, and Transportation 37.86833388 {'longitude': '-122.26742331', 'latitude': '37... -122.26742331 Berkeley Property Clean City Program CA 2255 SHATTUCK AVE 2015-06-25T08:33:30 2015-06-24T09:14:52 121000251283 Closed

In [6]:
# convert lat-long to floats and change address from ALL CAPS to Regular Capitalization
df['latitude'] = df['latitude'].astype(float)
df['longitude'] = df['longitude'].astype(float)
df['street_address'] = df['street_address'].str.title()

In [7]:
# we don't need all those columns - only keep useful ones
useful_cols = ['issue_description', 'issue_type', 'latitude', 'longitude', 'street_address', 'ticket_status']
df_subset = df[useful_cols]

In [8]:
# drop any rows that lack lat/long data
df_geo = df_subset.dropna(subset=['latitude', 'longitude'], axis=0, inplace=False)

print('We have {} geotagged rows'.format(len(df_geo)))
df_geo.tail()


We have 985 geotagged rows
Out[8]:
issue_description issue_type latitude longitude street_address ticket_status
1989 Residential Missed Pickup Refuse and Recycling 37.884110 -122.248453 1438 Grizzly Peak Blvd Closed
1991 Residential Bulky Pickup Refuse and Recycling 37.857363 -122.274100 1726 Stuart St Closed
1992 Commercial Site Inspection Refuse and Recycling 37.854159 -122.259605 3075 Telegraph Ave Com Closed
1993 Miscellaneous Internet Request General Questions/information 37.869606 -122.273367 2100 M L King Jr Way Closed
1995 Commercial Bin Size Increase Refuse and Recycling 37.851065 -122.291845 2950 Seventh St Closed

In [9]:
# what is the distribution of issue types?
df_geo['issue_type'].value_counts()


Out[9]:
Refuse and Recycling                            709
General Questions/information                   103
Streets, Utilities, and Transportation           96
Parks, Trees and Vegetation                      39
Environmental Services and Programs              14
Business License                                  7
Traffic and Transportation                        7
Facilities, Electrical & Property Management      5
Graffiti and Vandalism                            3
Other Account Services and Billing                1
Equipment Maintenance                             1
Name: issue_type, dtype: int64

Finally, convert each row in the dataframe to a geojson-formatted feature and save the result as a file. The format is pretty simple and you can see it here: http://geojson.org/


In [10]:
def df_to_geojson(df, properties, lat='latitude', lon='longitude'):
    """
    Turn a dataframe containing point data into a geojson formatted python dictionary
    
    df : the dataframe to convert to geojson
    properties : a list of columns in the dataframe to turn into geojson feature properties
    lat : the name of the column in the dataframe that contains latitude data
    lon : the name of the column in the dataframe that contains longitude data
    """
    
    # create a new python dict to contain our geojson data, using geojson format
    geojson = {'type':'FeatureCollection', 'features':[]}

    # loop through each row in the dataframe and convert each row to geojson format
    for _, row in df.iterrows():
        # create a feature template to fill in
        feature = {'type':'Feature',
                   'properties':{},
                   'geometry':{'type':'Point',
                               'coordinates':[]}}

        # fill in the coordinates
        feature['geometry']['coordinates'] = [row[lon],row[lat]]

        # for each column, get the value and add it as a new feature property
        for prop in properties:
            feature['properties'][prop] = row[prop]
        
        # add this feature (aka, converted dataframe row) to the list of features inside our dict
        geojson['features'].append(feature)
    
    return geojson

In [11]:
useful_columns = ['street_address', 'issue_description', 'issue_type', 'ticket_status']
geojson_dict = df_to_geojson(df_geo, properties=useful_columns)
geojson_str = json.dumps(geojson_dict, indent=2)

In [12]:
# save the geojson result to a file
output_filename = 'dataset.js'
with open(output_filename, 'w') as output_file:
    output_file.write('var dataset = {};'.format(geojson_str))
    
# how many features did we save to the geojson file?
print('{} geotagged features saved to file'.format(len(geojson_dict['features'])))


985 geotagged features saved to file

In [ ]: