title: "GitHub Gists for GIS in Python: Loading a Zipped Local or Web-based Shapefile with One Function" author: twitter: linwoodc3 summary: This post introduces a utility function that can automatically read web-based or local shapefiles in zip format into the Python ecosystem. It takes one line of code!
Date: {{ page.date | date_to_rfc822 }}
There is nothing worse than not knowing where you are.
We have all experienced it. It’s the panic that overtakes you when you don’t recognize your surroundings. The buildings and roads are unfamiliar. You don’t know where you are. Naturally, you focus on getting to a familiar landmark or location. Reaching that landmark brings a sense of relief. Comfort. Peace. Because you know where you are on a map, it’s easier to plot a course to your final destination.
In the world of data science, we embrace the concept of spatial awareness and knowing where the data are (or datum is). In the same way that familiar surroundings (i. e. geo-referenced data) brings clarity to a lost traveler, spatial context can bring clarity to a data set. This “where” does not always have to apply to a location on the earth’s surface. Spatial context (i.e. analytic geometry), or understanding data in geometric space, is just as enlightening.
Ansecombe’s quartet is a great example. Despite having nearly same summary statistics, the plots are nowhere near same. This is a reminder to plot your data before drawing a conclusion. It can prevent costly errors.
Python's seaborn library includes this data set, and we load it and compute the summary statistics. Each row is a data set, and it's clear that the numbers are nearly identical.
In [1]:
import seaborn as sns
df = sns.load_dataset('anscombe')
ddf = df.groupby('dataset').describe().unstack()
All = slice(None)
test = ddf.T.loc[(All,slice('mean','std')),All]
test
Out[1]:
In [89]:
Out[89]:
In [90]:
import seaborn as sns
import pandas as pd
df = sns.load_dataset('anscombe')
ddf = pd.concat([df.groupby('dataset').std(),df.groupby('dataset').mean()],axis=1)
print(ddf)
In [16]:
import pytablewriter
writer = pytablewriter.MarkdownTableWriter()
writer.from_dataframe(d)
writer.write_table()
# print(tabulate.tabulate(ddf))
In [74]:
import geopandas as gpd
import numpy as np
import requests
import gdal
import fiona
import uuid
import re
# informaed by: https://gis.stackexchange.com/questions/225586/reading-raw-data-into-geopandas/225627
def shapefilereader(target):
"""Function to convert zipped shapefiles from the web or on disk into geopandas dataframes
Parameters
----------
target : str
string representing path to file on disk or url to download the zipped shapefile.
Returns
-------
Geopandas dataframe
Pandas dataframe with geospatial features and operations.
"""
# Detect whether we are using a web-based shapefile or local disk
r = re.compile('^(http|https)://',re.I)
if r.search(target):
download = True
request = requests.get(target)
target = '/vsimem/{}.zip'.format(uuid.uuid4().hex) #gdal/ogr requires a .zip extension
gdal.FileFromMemBuffer(target,bytes(request.content))
else:
download = False
with fiona.Collection(target,vsi='zip') as f:
return gpd.GeoDataFrame.from_features(f,crs=f.crs)
def shaper(row):
"""
Parallel function to create shapely points
from latitude/longitude pair in dataframe
Parameters
----------
row : pandas or dask dataframe row
Row containing latitude and longitude variables and data
Returns
-------
shapely point
Shapely spatial object for geoprocessing in Python.
"""
geometry=Point(row['longitude'],row['latitude'])
return geometry
In [4]:
import dask.dataframe as dd
gpd.pd.read_hdf()
ds = dd.read_hdf('/Users/linwood/projects/Blogs/drafts/geolocated_social_transcends_political_barriers/data/tweetLanguages.h5','tweets')
In [102]:
d = (ds[(ds.latitude >=40) & (ds.latitude <=45) & (ds.longitude >=-80) & (ds.longitude<=-70)]).compute()
d.info(memory_usage='deep')
In [46]:
ny = shapefilereader('https://data.cityofnewyork.us/download/i8iw-xf4u/application%2Fzip')
ny= ny.to_crs({'init':"epsg:4326"})
In [56]:
In [169]:
%matplotlib inline
import matplotlib.pyplot as plt
f,ax = plt.subplots(figsize=(15,9))
nytweets.plot(color='red',ax=ax)
ax.set_axis_off()
plt.savefig('./assets/img/newyorktweets.png')
In [170]:
f,ax = plt.subplots(figsize=(15,9))
ax.set_facecolor('lightblue')
ny.plot(color='tan',ax=ax)
ax.set_axis_off()
plt.savefig('./assets/img/newyorkzips.png')
# d.plot(kind='scatter',x='longitude',y='latitude',ax=ax)
In [134]:
from IPython.display import Image
Image('./assets/img/newyorktweets.png')
Out[134]:
In [94]:
from shapely.geometry import Point
dg = gpd.GeoDataFrame(d.assign(geometry=d.apply(shaper,axis=1)),crs = {'init': 'epsg:4326'})
In [96]:
nytweets = gpd.sjoin(dg.reset_index(),ny[['ZIPCODE','geometry','CTY_FIPS']],op='intersects',how='inner')
In [ ]:
In [171]:
f,ax=plt.subplots(frameon=False,figsize=(15,9))
ny.plot(color='white',linewidth=0.2,ax=ax)
nytweets.plot(color='red',ax=ax,alpha=0.1)
ax.set_axis_off()
plt.savefig('./assets/img/newyorkspatialjoin.png')
In [155]:
import pytablewriter
writer = pytablewriter.MarkdownTableWriter()
writer.from_dataframe(test)
writer.write_table()
In [176]:
countedTweets = nytweets.groupby('ZIPCODE')['ZIPCODE']\
.size().sort_values(ascending=False)\
.reset_index().rename(columns={0:'tweetcount'})
final = ny[['ZIPCODE','geometry']].merge(countedTweets,on='ZIPCODE')
f,ax = plt.subplots(figsize=(15,9))
final.plot(column='tweetcount', scheme='Fisher_Jenks', k=5, cmap='OrRd', linewidth=0.1, ax=ax,legend=True)
ax.set_axis_off()
plt.savefig('./assets/img/newyorkchoropleth.png')
In [ ]: