Step 0: Estimating Building Sizes

Real world building list can be generated using parcel map shapefile downloaded from data.detroitmi.gov by clicking the 'export' button near upper right corner of the screen, and then select 'CSV' option.

Each valid row contains a multipolygon constituted by a list of coordinates. From this information, we can construct a rectangle using lowerleft corner and upperright corner derived from the polygon. Each rectangle would then represent a building.

Notes:

  • The data files given from the course is slightly different from the posted coordiates at data.detroitmi.gov. (This is The Right Thing since it protects privacy. However, this also means that we can not use the coordiates directly from website. Instead, one has to estimate building dimensions based solely on the given data from course materials.
  • shapefile python documentation can be found here
  • Detroit parcel map shapefile can be cleaned with the help of this script ('test_valid_shp.py').
  • After cleaning, bboxes was extracted using script 'define_buildings.py' with some functions defined in 'bbox.py'. The generated building data was stored as a .csv file located at '../data/buildings.csv'
  • Visualization of the boxes was realized by script 'draw_bboxes.py'. Basemap library was used for this purpose.

Generating data from PARCELS.csv for building size info

One can in principle assume a grid to represent buildings. However, there are two problems with this approach:

  1. buildings can be of very different sizes.
  2. the density of buildings are not uniform.

Since we have access to building info from data.detroitmi.gov. The real world building info is extracted from that file in combination with the provided .csv files from the course to provide a more meaningful analysis.


In [1]:
import numpy as np
import pandas as pd

In [2]:
parcels = pd.read_csv("../data/Parcel_Map/PARCELS.csv")


/usr/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2705: DtypeWarning: Columns (12,47,50,51) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [3]:
parcels.columns


Out[3]:
Index(['the_geom', 'OBJECTID', 'Shape_STAr', 'PARCELNO', 'Shape_STLe', 'WARD',
       'DISTRICT', 'ECF', 'PROPADDR', 'PROPNO', 'PROPDIR', 'PROPSTR',
       'PROPZIP', 'TAXPAYER1', 'TAXPAYER2', 'TAXADDR', 'TAXCITY', 'TAXSTATE',
       'PREVPCLASS', 'TAXZIP', 'PROPCLASS', 'PROPSTATUS', 'TAXSTATUS',
       'PREVTSTATU', 'TOTALSQFT', 'TOTALACREA', 'FRONTAGE', 'DEPTH', 'USECODE',
       'HOMESTEAD', 'NEZ', 'CIBFLAREA', 'CIBBLDGNO', 'CIBOCC', 'CIBYRBUILT',
       'RESFLAREA', 'RESBLDGNO', 'RESOCC', 'RESSTYLE', 'RESYRBUILT',
       'ISIMPROVED', 'SALEPRICE', 'SALEDATE', 'AV', 'TV', 'SEV', 'LANDVALUE',
       'LANDMAP', 'XSTRNAME1', 'XSTRNAME2', 'RELATED', 'AKA', 'SUBDIVISIO',
       'SPECACT', 'LEGAL', 'RP'],
      dtype='object')

In [4]:
parcels = parcels[['the_geom', 'PARCELNO','PROPADDR']]
print("original number of parcels: %d" % parcels.shape[0])


original number of parcels: 384675

In [5]:
parcels.dropna(axis=0,inplace=True)
print("after dropna: %d" % parcels.shape[0])


after dropna: 384597

In [6]:
parcels.iloc[0,0]


Out[6]:
'MULTIPOLYGON (((-83.04165895478116 42.329554868661866, -83.03929195571008 42.33057986836967, -83.03912395576653 42.33026786788393, -83.03901095566742 42.33005586809884, -83.03867595539081 42.329624868498854, -83.03841695545209 42.32929186848, -83.03804795627754 42.32881686840838, -83.0379439556402 42.32868386847111, -83.03834395568987 42.32849886867042, -83.0402839555524 42.327599868893444, -83.04063795566806 42.328051869036344, -83.04085495543352 42.32832886835788, -83.04083595526853 42.32833686885066, -83.04087895524218 42.32839186853662, -83.04089995520081 42.328385868700984, -83.04099495573118 42.32850586865732, -83.0409719547512 42.32851186854012, -83.04116295560775 42.328757868534744, -83.04159295493145 42.32929086810708, -83.04165895478116 42.329554868661866)))'

First column ('the_geom') contains coordinates forming a polygon. For simplicity, we can derive a rectangule with lower left corner and upper right corner to represent the area.


In [7]:
parcels.reset_index(inplace=True)

In [8]:
geoms = parcels['the_geom'].astype(str).apply(lambda x: x.split())
geoms = geoms.apply(lambda x: x[1:])
geoms = geoms.apply(lambda x: [x[0].lstrip('[').lstrip('(')]+x[1:-1]+[x[-1].rstrip(']').rstrip(')')])

In [9]:
def extract_lons(list_of_geom):
    '''select longitudes from a list of multipolygon'''
    np_list_of_geom = np.array(list_of_geom)
    np_lons = np_list_of_geom[::2]
    if (not np_lons.size):
        return []
    np_lons = np.char.lstrip(np_lons,'(')
    np_lons = np.char.replace(np_lons,'EMPTY','Nan')
    np_lons = np_lons.astype(float)
    return np_lons

def extract_lats(list_of_geom):
    '''select latitudes from a list of multipolygon'''
    np_list_of_geom = np.array(list_of_geom)
    np_lats = np_list_of_geom[1::2]
    
    if (not np_lats.size):
        return []
    np_lats = np.char.rstrip(np_lats,',')
    np_lats = np.char.rstrip(np_lats,')')
    np_lats = np.char.replace(np_lats,'EMPTY','Nan')
    np_lats = np_lats.astype(float)
    return np_lats

In [10]:
parcels['lons'] = geoms.apply(lambda x: extract_lons(x))
parcels['lats'] = geoms.apply(lambda x: extract_lats(x))

In [11]:
parcels['llcrnrlon'] = parcels['lons'].apply(lambda x: min(x))
parcels['llcrnrlat'] = parcels['lats'].apply(lambda x: min(x))
parcels['urcrnrlon'] = parcels['lons'].apply(lambda x: max(x))
parcels['urcrnrlat'] = parcels['lats'].apply(lambda x: max(x))
parcels['lon'] = parcels['lons'].apply(lambda x: x.mean()) # center of rectangle
parcels['lat'] = parcels['lats'].apply(lambda x: x.mean()) # center of rectangle
parcels.dropna(inplace=True)

In [12]:
parcels['addr'] = parcels['PROPADDR'].apply(lambda x: x.lower())

In [2]:
from matplotlib import pyplot as plt
%matplotlib inline

In [14]:
fig, ax = plt.subplots(1)
ax.scatter(parcels['lon'], parcels['lat'], s = 2, color='darkblue', alpha = 0.1)
ax.set_xlabel('longitude')
ax.set_xlim(min(parcels['lon'])-0.01,max(parcels['lon']+0.01))
ax.set_ylabel('latitude')
ax.set_ylim(min(parcels['lat'])-0.01,max(parcels['lat']+0.01))
plt.title('Distribution of Buildings')
plt.show()



In [15]:
parcels['building_id'] = np.arange(0,parcels.shape[0])

In [16]:
parcels = parcels[['building_id','lon','lat','llcrnrlon','llcrnrlat','urcrnrlon','urcrnrlat','addr','PARCELNO']]

In [17]:
parcels['length'] = parcels['urcrnrlon'] - parcels['llcrnrlon']
parcels['width'] = parcels['urcrnrlat'] - parcels['llcrnrlat']

In [19]:
parcels = parcels[parcels['length'].notnull()]
parcels = parcels[parcels['width'].notnull()]

In [20]:
parcels.reset_index(inplace=True)

In [21]:
parcels.tail()


Out[21]:
index building_id lon lat llcrnrlon llcrnrlat urcrnrlon urcrnrlat addr PARCELNO length width
384591 384592 384591 -83.261211 42.365590 -83.261416 42.365532 -83.260955 42.365639 14150 w outer drive 22126044. 0.000461 0.000107
384592 384593 384592 -83.261208 42.365494 -83.261413 42.365436 -83.260952 42.365543 14158 w outer drive 22126045. 0.000461 0.000107
384593 384594 384593 -83.261225 42.365397 -83.261409 42.365340 -83.260948 42.365447 14164 w outer drive 22126046. 0.000461 0.000107
384594 384595 384594 -83.261219 42.365229 -83.261406 42.365052 -83.260937 42.365351 14180 w outer drive 22126047-9 0.000469 0.000299
384595 384596 384595 -83.261216 42.364934 -83.261396 42.364815 -83.260928 42.365063 14200 w outer drive 22126050-1 0.000468 0.000248

In [22]:
parcels.to_csv('../data/buildings_step_0.csv', index=False)

Visualizing buildings


In [3]:
parcels = pd.read_csv('../data/buildings_step_0.csv')
bboxes = parcels[['llcrnrlon','llcrnrlat','urcrnrlon','urcrnrlat']]
bboxes = bboxes.as_matrix()

In [24]:
from bbox import draw_screen_bbox    # self-defined helper function

A subset of real buildings, just to show:

  • There are variations in the density of buildings.
  • There are overlap between buildings, especially when they were constructed along tilted axis w.r.t. our coordinates.

In [25]:
fig = plt.figure(figsize=(8,6))
for box in bboxes[0:1000]:
    draw_screen_bbox(box, fig)
    
plt.xlim(-83.08,-83.03)
plt.ylim(42.32,42.375)
    
plt.show()


Analyze dimensions of buildings


In [38]:
lengths = parcels['length'].values
widths = parcels['width'].values

indices = np.random.randint(0,lengths.shape[0],int(lengths.shape[0]/10) - 1)
lengths_sample = lengths[indices]
widths_sample = widths[indices]

In [40]:
fig = plt.figure(figsize=(10,4))
ax1 = plt.subplot2grid((1,2),(0,0))
ax1.hist(lengths_sample,200,facecolor='blue', alpha=0.75)
ax1.set_title("Distribution of Lengths")
ax1.set_xlim(0,0.003)
ax2 = plt.subplot2grid((1,2),(0,1))
ax2.hist(widths_sample,200,facecolor='red', alpha=0.75)
ax2.set_title("Distribution of Widths")
ax2.set_xlim(0,0.0015)
plt.show()



In [50]:
fig = plt.figure(figsize=(10,4))
ax1 = plt.subplot2grid((1,2),(0,0))
ax1.boxplot(lengths,0,"")
ax1.set_title("Distribution of Lengths")
ax1.set_ylim(0,0.001)
ax2 = plt.subplot2grid((1,2),(0,1))
ax2.boxplot(widths,0,"")
ax2.set_title("Distribution of Widths")
ax2.set_ylim(0,0.0008)
plt.show()


The distribution of both lengths of widths of rectangular boundaries are close to their median values. Length distribution is almost unimodal and normal. Widths are slightly right skewed.

In both cases, median values should be sufficient to represent each values


In [51]:
median_length = np.median(lengths)
median_width = np.median(widths)

In [52]:
print("Median of lengths: %.6f" % median_length)
print("Median of widths: %.6f" % median_width)


Median of lengths: 0.000411
Median of widths: 0.000204

In [ ]: