In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
import os
import geopandas as gpd
import math
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import pandas as pd
import mplleaflet
from shapely import geometry, ops
from sklearn import preprocessing, decomposition
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
from tqdm import tqdm_notebook as tqdm
from pygsp import graphs, filters, plotting
plt.rcParams['figure.figsize'] = (30,15)
The assumption that old parts of a city can be distinguished from the new parts, solely based on the road network was tested by anaylizing the road newtwork of different cities.
The dataset used for this specific task was gethered from OpenStreetMap (freely-reusable geospatial data) by extracting the specifc city or region of interest. The data was cleaned, by only keeping drivable or walkable roads, thereby discarding for instance "railways". In addition, a one-to-one correspondance between street name and the specific street was enforced by renaming streets, if geographically distict streets would show the same name. For each road, a multiple of features, such as street length, street curvature etc. were extracted. Those features were then used to construct a graph based on the similarity matrix build from a weighted distance metric. Using the constructed graph, the Fiedler eigenvector was calculated to finally obtain a data clusterization. The initial assumption was verified to a certain extend.
This notebook visualizes the process pipe-line by exploring the street network of Switzerland's capital Bern.
The Dataset, which will be the road network of a certain city, is downloaded from https://mapzen.com/data/metro-extracts/ This website can produce extracts from the OpenStreetMap Dataset.
The following cities or areas where extracted using mapzen:
GeoPandas was used to work with the shapefiles provided by the extracts. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.
For more information on GeoPandas: http://geopandas.org/
In [2]:
# Loading the downloaded shapefile
# Bern
shapefile_dir = '../data/bern_switzerland.imposm-shapefiles'
shapefile_name ='bern_switzerland_osm_roads.shp'
shapefile_roads = os.path.join(shapefile_dir, shapefile_name)
# Here we create the GeoPandas Dataset:
df = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None" to "no_name
df.name[df.name.isnull() == True] = 'no_name'
# In order to have meaningful distances, we use the Mercator projection
# that preserves the angles and the shapes of small objects
# For more information visit: http://geopandas.org/projections.html
df = df.to_crs({'init': 'epsg:3395'})
Let's have a look a the data.
We can see that GeoPanda nicely structures the data provided by OpenStreetMap. In the last column, the geometry -a line together with its coordinates- can be seen. We can now work with GeoDataFrame just like a pandas dataframe.
In [3]:
df[0:3]
Out[3]:
...and this is how the data looks like:
In [4]:
ax = df.plot(color='black',figsize=(10, 10))
To see it printed on a real map, this code could be used: mplleaflet.show(fig=ax.figure, crs=df.crs)
Here is an extract as an image:
The following cell takes the geopandas dataframe and only keeps specific types of entries.
First, the classes are restricted to being only of class "highway" (which in OSM stands for roads).
Then within this class, it was choosen to
In [5]:
total_rows_before_cleaning = len(df.index)
In [6]:
# Get Only Highway classes
df=df[df['class']=='highway']
sel = (df['type']=='residential')| (df['type']=='living_street') | (df['type']=='pedestrian')| (df['type']=='trunk') | (df['type']=='primary') | (df['type']=='secondary') | (df['type']=='tertiary')
df = df[sel]
#check if the types are there
df[sel]["type"].unique()
Out[6]:
In [7]:
total_rows_after_cleaning = len(df.index)
In [8]:
1-total_rows_after_cleaning/total_rows_before_cleaning
Out[8]:
By focusing our interest on streets only we reduced the dataset by slightly more than 80%.
Which Are Geometrically Seperated But Have The Same Name
Inspiered by: https://stackoverflow.com/questions/47988849/split-lines-based-on-names (a different naming procedure was implemented )
For all streets, which have the same name, their streets are clustered using DBSCAN clustering techniques.
Then each cluster is identified and the associated streets are renamed in the following way: oldName_clusterNumber
In [9]:
def split_rename(temp):
'''Splits streets which are geometrically apart but have the same name
INPUT: GeoPandas Dataframe of the streets which all have the same name
OUTPUT: A renamed GeoPandas Dataframe based on the geographic locality of the street.
Naming convention: oldName_clusterNumber'''
ids = []
coords = []
for row in temp.itertuples():
geom = np.asarray(row.geometry)
# Record the geometric loaction of every point
coords.extend(geom)
# Record the id of every point
ids.extend([row.id] * geom.shape[0])
# Initilaize DBSCAN
clust = DBSCAN(eps=0.08)
# Normalize the coordinates and fit the clusters
clusters = clust.fit_predict(StandardScaler().fit_transform(coords))
# store in a pandas dataframe
points_clusters = pd.DataFrame({"id":ids, "cluster":clusters})
#print(points_clusters)
# One Street Chunk Id should only be assigned to one cluster
# -> Assign to the cluster where the most points are
id_cluster = points_clusters.groupby('id').cluster.apply(lambda x : x.mode()).reset_index().drop('level_1',1)
# Rename
temp_add_cluster = temp.merge(id_cluster, on='id')
temp_add_cluster['name'] = temp_add_cluster.name + '_' + temp_add_cluster.cluster.astype(str)
temp_add_cluster = temp_add_cluster.drop('cluster', 1)
return temp_add_cluster
In [10]:
df_split = df.groupby('name').apply(split_rename)
Here the previous code is demonstrated:
Bahnhofstrasse is a very common name for the street which leads to (or used to lead) to the railway station. As there are multiple railway stations around Bern, this street name appears naturally often. However, a one-to-one corespondance between street name and street should be achieved.
In [11]:
# 1
grouped = df.groupby('name')
streets = {}
frequency = {}
for key, values in grouped:
streets[key] = values
no_split = streets['Bahnhofstrasse'].plot(color='red',linewidth=5.0)
In [12]:
# 2
grouped = df_split.groupby('name')
streets = {}
for key, values in grouped:
streets[key] = values
fig=plt.figure()
ax = streets['Bahnhofstrasse_0'].plot(color='red',linewidth=5.0)
ax = streets['Bahnhofstrasse_1'].plot(ax=ax, color='green',linewidth=5.0);
ax = streets['Bahnhofstrasse_2'].plot(ax=ax, color='orange',linewidth=5.0);
ax = streets['Bahnhofstrasse_3'].plot(ax=ax, color='black',linewidth=5.0);
ax = streets['Bahnhofstrasse_4'].plot(ax=ax, color='yellow',linewidth=5.0);
ax = streets['Bahnhofstrasse_5'].plot(ax=ax, color='purple',linewidth=5.0);
ax = streets['Bahnhofstrasse_6'].plot(ax=ax, color='magenta',linewidth=5.0);
ax = streets['Bahnhofstrasse_7'].plot(ax=ax, color='grey',linewidth=5.0);
In [14]:
df_split.to_file("../data/cleaned_dataset_bern.shp")
In [15]:
df = gpd.GeoDataFrame.from_file("../data/cleaned_dataset_bern.shp")
In [16]:
street_names= {"name":df.name.unique()}
dfp_features = pd.DataFrame(street_names)
dfp_features.index.name='Progressive_index'
In [17]:
dfp_features[0:5]
Out[17]:
In [18]:
street_length = []
for street_name in tqdm(dfp_features['name']):
sub_gpdf = df[df.name == street_name]
street_length_curr = []
for index, row in sub_gpdf.iterrows():
street_length_curr.append(row['geometry'].length)
street_length.append(np.sum(street_length_curr))
dfp_features['length'] = street_length
Let's have a look at the data ordered by length
In [19]:
dfp_features_temp = dfp_features.sort_values('length',ascending=False)
dfp_features_temp[0:7]
Out[19]:
There are some streets that are unnamed. But since they have been splitted in connected pieces by the splitting algorithm, we can treat them as normal streets (the name is the only information that is missing).
Let's determine the curvature of each of the streets.
By the law of cosine, in a triangle of sides $d_1$, $d_2$ and $d_3$, the angle $\alpha$ opposit to the side $d_3$ is given by:
$$acos(\alpha) = \frac{d_1^2 + d_2^2 - d_3^2}{2d_1d_2}.$$Note that for a straight line the corresponging curvature will be $\pi$ (the output of that formula is included in the interval $[0,\pi]$, the more curvy the line is, the more it tends towards zero.
Since we'll compute an angle for every triplet of coordinates of each street, we need to aggregate these coefficients to have a fixed representation for each street with the same name. This task can be accomplished by exploiting 7 summary statistics: minimum, maximum, median and the first 4 moments (mean, standard deviation, skew and kurtosis). Note that the computation of the curvature angle is performed by firstly computing an angle for each of the triplets of coordinates of each chunk of streets with the same name and then merging all the angles of the street in a list (curvature_curr).
In [20]:
std_curvature=[]
mean_curvature = []
max_curvature = []
min_curvature = []
skew_curvature = []
kurtosis_curvature = []
median_curvature = []
for street_name in tqdm(dfp_features['name']):
sub_gpdf=df[df.name==street_name]
curvature_curr=[]
for index, row in sub_gpdf.iterrows():
line_x= row['geometry'].xy[0]
line_y= row['geometry'].xy[1]
temp_ang=[]
if (len(line_x)<=2):
temp_ang.append(np.pi)
else:
for i in range(len(line_x)-2):
d1=np.sqrt((line_x[i]-line_x[i+1])**2+(line_y[i]-line_y[i+1])**2)
d2=np.sqrt((line_x[i+1]-line_x[i+2])**2+(line_y[i+1]-line_y[i+2])**2)
d3=np.sqrt((line_x[i]-line_x[i+2])**2+(line_y[i]-line_y[i+2])**2)
temp_ang.append(math.acos(math.fmod(((d1**2+d2**2-d3**2)/(2*d1*d2)),1)))
curvature_curr.extend(temp_ang)
std_curvature.append(np.std(curvature_curr))
mean_curvature.append(np.mean(curvature_curr))
max_curvature.append(np.max(curvature_curr))
min_curvature.append(np.min(curvature_curr))
skew_curvature.append(scipy.stats.skew(curvature_curr))
kurtosis_curvature.append(scipy.stats.kurtosis(curvature_curr))
median_curvature.append(np.median(curvature_curr))
dfp_features['std_curvature'] = std_curvature
dfp_features['mean_curvature'] = mean_curvature
dfp_features['max_curvature'] = max_curvature
dfp_features['min_curvature'] = min_curvature
dfp_features['skew_curvature'] = skew_curvature
dfp_features['kurtosis_curvature'] = kurtosis_curvature
dfp_features['median_curvature'] = median_curvature
In [21]:
dfp_features['oneway'] = df.oneway
One of the features that can be extracted from the collection of coordinates that we have for each street, is the distance between the starting and ending point. In case a street is not a contiguous line, the method linemerge() will merge only the contiguous portions into a component of a new multi-linestring. In this case, the flight length is computed as the sum of the flight lengths of each connected line.
In [22]:
street_flight_distance=[]
for street_name in tqdm(dfp_features['name']):
sub_gpdf = df[df.name == street_name]
lines = []
for index, row in sub_gpdf.iterrows():
line = row['geometry']
lines.append(line)
multi_line = geometry.MultiLineString(lines)
merged_line = ops.linemerge(multi_line)
if isinstance(merged_line, geometry.multilinestring.MultiLineString) == True:
lengths = []
for chunks in merged_line:
init_x = chunks.xy[0][0]
init_y = chunks.xy[1][0]
final_x = chunks.xy[0][-1]
final_y = chunks.xy[1][-1]
lengths.append(np.sqrt((init_x-final_x)**2 + (init_y-final_y)**2))
street_flight_distance.append(np.sum(lengths))
else:
init_x = merged_line.xy[0][0]
init_y = merged_line.xy[1][0]
final_x = merged_line.xy[0][-1]
final_y = merged_line.xy[1][-1]
street_flight_distance.append(np.sqrt((init_x-final_x)**2 + (init_y-final_y)**2))
dfp_features['flight_distance']=street_flight_distance
As expected, the streets with the biggest curvature (that means in our convention that they are straight lines) have a length that is equal to the flight length
In [23]:
dfp_features_temp = dfp_features.sort_values('mean_curvature', ascending=False )
dfp_features_temp[0:5]
Out[23]:
The opposite for the curvy streets:
In [24]:
dfp_features_temp = dfp_features.sort_values('mean_curvature')
dfp_features_temp[0:5]
Out[24]:
Let's have a look at the data ordered by flight length
In [25]:
dfp_features_temp = dfp_features.sort_values('flight_distance')
dfp_features_temp[0:10]
Out[25]:
So many zeros! Let's have a look at them:
In [26]:
grouped = df.groupby('name')
streets = {}
fig=plt.figure()
fig, axes = plt.subplots(3,3)
for key, values in grouped:
streets[key] = values
streets['Hallmattstrasse_0'].plot(ax=axes[0][0], color='red',linewidth=5.0)
streets['Kreisel Steinhölzli_-1'].plot(ax=axes[0][1], color='red',linewidth=5.0)
streets['Gilberte-de-Courgenay-Platz_-1'].plot(ax=axes[0][2], color='red',linewidth=5.0)
streets['Hubelmattstrasse_2'].plot(ax=axes[1][0], color='red',linewidth=5.0)
streets['Neuhausplatz_-1'].plot(ax=axes[1][1], color='red',linewidth=5.0)
streets['Bühlplatz_-1'].plot(ax=axes[1][2], color='red',linewidth=5.0)
streets['Pourtalèsstrasse_1'].plot(ax=axes[2][0], color='red',linewidth=5.0)
streets['Rubigenstrasse_0'].plot(ax=axes[2][1], color='red',linewidth=5.0)
streets['Unterdettigenstrasse_1'].plot(ax=axes[2][2], color='red',linewidth=5.0)
Out[26]:
In [27]:
sfdp=[] #street_flight_distance_percentage
for street in tqdm(dfp_features.iterrows()):
d=street[1]['length']
fd=street[1]['flight_distance']
sfdp.append(fd/d)
dfp_features['percentage_flight_distance']=sfdp
Let's see if we have meaningful data (percentage of flight distance should be at most 1):
In [28]:
dfp_features_temp = dfp_features.sort_values('percentage_flight_distance',ascending=False)
dfp_features_temp[0:5]
Out[28]:
It is also possible to computed the derivative of the coordinate points of each street. As before, we firstly compute the derivative for each of the chunk of streets with the same name and then merging all the derivatives of the street in two lists (derivative_curr_x and derivative_curr_y). Again, as the derivate of each line has the same length of the line itself, to be able to compare derivates of different streets, we aggregate the coefficients by using the previous described 7 summary statistics (minimum, maximum, median and the first 4 moments).
In [29]:
std_derivative_x = []
mean_derivative_x = []
max_derivative_x = []
min_derivative_x = []
skew_derivative_x = []
kurtosis_derivative_x = []
median_derivative_x = []
std_derivative_y = []
mean_derivative_y = []
max_derivative_y = []
min_derivative_y = []
skew_derivative_y = []
kurtosis_derivative_y = []
median_derivative_y = []
for street_name in tqdm(dfp_features['name']):
sub_gpdf=df[df.name==street_name]
derivative_curr_x = []
derivative_curr_y = []
for index, row in sub_gpdf.iterrows():
line_x= row['geometry'].xy[0]
line_y= row['geometry'].xy[1]
derivative_x = np.gradient(line_x)
derivative_y = np.gradient(line_y)
derivative_curr_x.extend(derivative_x)
derivative_curr_y.extend(derivative_y)
std_derivative_x.append(np.std(derivative_curr_x))
mean_derivative_x.append(np.mean(derivative_curr_x))
max_derivative_x.append(np.max(derivative_curr_x))
min_derivative_x.append(np.min(derivative_curr_x))
skew_derivative_x.append(scipy.stats.skew(derivative_curr_x))
kurtosis_derivative_x.append(scipy.stats.kurtosis(derivative_curr_x))
median_derivative_x.append(np.median(derivative_curr_x))
std_derivative_y.append(np.std(derivative_curr_y))
mean_derivative_y.append(np.mean(derivative_curr_y))
max_derivative_y.append(np.max(derivative_curr_y))
min_derivative_y.append(np.min(derivative_curr_y))
skew_derivative_y.append(scipy.stats.skew(derivative_curr_y))
kurtosis_derivative_y.append(scipy.stats.kurtosis(derivative_curr_y))
median_derivative_y.append(np.median(derivative_curr_y))
dfp_features['std_derivative_x'] = std_derivative_x
dfp_features['mean_derivative_x'] = mean_derivative_x
dfp_features['max_derivative_x'] = max_derivative_x
dfp_features['kurtosis_derivative_x'] = kurtosis_derivative_x
dfp_features['median_derivative_x'] = median_derivative_x
dfp_features['skew_derivative_x'] = skew_derivative_x
dfp_features['min_derivative_x'] = min_derivative_x
dfp_features['std_derivative_y'] = std_derivative_y
dfp_features['mean_derivative_y'] = mean_derivative_y
dfp_features['max_derivative_y'] = max_derivative_y
dfp_features['min_derivative_y'] = min_derivative_y
dfp_features['kurtosis_derivative_y'] = kurtosis_derivative_y
dfp_features['median_derivative_y'] = median_derivative_y
dfp_features['skew_derivative_y'] = skew_derivative_y
In [30]:
dfp_features.to_csv('../data/features_bern.csv', encoding='utf-8')
In [31]:
dfp=pd.read_csv('../data/features_bern.csv',index_col='Progressive_index',encoding='utf-8')
In [32]:
dfp[0:5]
Out[32]:
Because not all the features have the same importance we need to assign them a weight, a multiplier that "gives more relevance" to one feature respect to one other. For example: curvature and percentage flight distance are more important than absolute distance. Even if they have heterogeneous ranges (curvature spans from 0 to pi for example) the normalization can make the weighting effective because it's applied using total mean and variance over the weighted vector
In [33]:
#tuning parameters
l=2 #coeff. for length
c=2 #coeff. for mean curvature
f=2 #coeff. for flight-distance
f_r=3.5 #coeff. for ratio flight-distance
o=1.5 #coeff. for one-way
s=1.5 #coeff. for std_curvature
mi=0.25 #coeff. for min curvature
ma=3.5 #coeff. for max curvature
skew=1 #coeff. for skew curvature
kurt=0.5 #coeff. for kurtosis curvature
med=1.5 #coeff. for median curvature
dexstd = 1 #coeff. for std_derivative_x
dexmed = 1 #coeff. for median_derivative_x
dexskew=0.5 #coeff. for skew_derivative_x
dexkurt=0.25 #coeff. for kurtosis derivative x
dexmi = 1 #coeff. for min derivative x
dexma = 1.5 #coeff. for max derivative x
dexmean = 1 #coeff. for mean derivative x
deymean = 1 #coeff. for mean derivative y
deymi = 1 #coeff. for min derivative y
deyma = 1.5 #coeff. for max derivative y
deystd = 1 #coeff. for std derivative y
deymed = 1 #coeff. for median derivative y
deyskew=0.5 #coeff. for skew derivative y
deykurt=0.25 #coeff. for kurtosis derivate y
bias=0.01 #generic bias
dfp['length']=dfp['length']*l
dfp['mean_curvature']=dfp['mean_curvature']*c
dfp['flight_distance']=dfp['flight_distance']*f
dfp['percentage_flight_distance']=dfp['percentage_flight_distance']*f_r
dfp['oneway']=dfp['oneway']*o
dfp['std_curvature']=dfp['std_curvature']*s
dfp['min_curvature']=dfp['min_curvature']*mi
dfp['max_curvature']=dfp['max_curvature']*ma
dfp['skew_curvature'] = dfp['skew_curvature']*skew
dfp['kurtosis_curvature'] = dfp['kurtosis_curvature']*kurt
dfp['median_curvature'] = dfp['median_curvature']*med
dfp['std_derivative_x'] = dfp['std_derivative_x']*dexstd
dfp['median_derivative_x'] = dfp['median_derivative_x']*dexmed
dfp['skew_derivative_x'] = dfp['skew_derivative_x']*dexskew
dfp['kurtosis_derivative_x'] = dfp['kurtosis_derivative_x']*dexkurt
dfp['min_derivative_x'] = dfp['min_derivative_x']*dexmi
dfp['max_derivative_x'] = dfp['max_derivative_x']*dexma
dfp['mean_derivative_x'] = dfp['mean_derivative_x']*dexmean
dfp['mean_derivative_y'] = dfp['mean_derivative_y']*deymean
dfp['min_derivative_y'] = dfp['min_derivative_y']*deymi
dfp['max_derivative_y'] = dfp['max_derivative_y']*deyma
dfp['std_derivative_y'] = dfp['std_derivative_y']*deystd
dfp['median_derivative_y'] = dfp['median_derivative_y']*deymed
dfp['skew_derivative_y'] = dfp['skew_derivative_y']*deyskew
dfp['kurtosis_derivative_y'] = dfp['kurtosis_derivative_y']*deykurt
In [34]:
def normalize_min_max(df):
return (df-df.min())/(df.max()-df.min())
def normalize_mean_std(df):
return (df-df.mean())/df.std()
In [35]:
dfp_norm=dfp.copy()
dfp_norm.drop('name', axis=1, inplace=True)
dfp_norm=normalize_mean_std(dfp_norm);
We can also drop some of the features
In [36]:
#dfp_norm.drop('max_curvature', axis=1, inplace=True)
#dfp_norm.drop('min_curvature', axis=1, inplace=True)
#dfp_norm.drop('median_curvature', axis=1, inplace=True)
dfp_norm.drop('median_derivative_x', axis=1, inplace=True)
dfp_norm.drop('median_derivative_y', axis=1, inplace=True)
#dfp_norm.drop('skew_curvature', axis=1, inplace=True)
dfp_norm.drop('skew_derivative_x', axis=1, inplace=True)
dfp_norm.drop('skew_derivative_y', axis=1, inplace=True)
dfp_norm.drop('kurtosis_derivative_x', axis=1, inplace=True)
dfp_norm.drop('kurtosis_derivative_y', axis=1, inplace=True)
dfp_norm.drop('kurtosis_curvature', axis=1, inplace=True)
In [37]:
print('Matrix dimentions: {} x {}'.format(len(dfp_norm),len(dfp_norm)))
We firstly need to define the similarity measure
In [38]:
dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'braycurtis'))
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'euclidean'))
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'cityblock'))
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'minkowski'))
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'cosine'))
Check if it is symmetric and how many zeros do we have
In [39]:
print('Number of zero elements: {}'.format(np.sum(dist==0)))
...and check if we have a symmetric matrix
In [40]:
is_sym = np.sum(np.around((dist-np.transpose(dist)),14)!=0)
# a correct implementation should pass the below test.
assert is_sym == 0
We can also plot an histogram of the distances.
In [41]:
plt.hist(dist.reshape(-1), bins=1000);
In [42]:
W=np.exp(-(dist**2)/(dist.mean()**2))
np.fill_diagonal(W, 0) #fill the diagonal with zeros to avoid self-connections
We can also visualise the results (histogram)
In [43]:
plt.hist(W.reshape(-1), bins=200);
In [44]:
plt.spy(W)
Out[44]:
In [45]:
NEIGHBOURS = 130
indeces = np.argsort(W, axis=1)
indeces = indeces[:,:len(W)-NEIGHBOURS]
indeces = np.arange(W.shape[0])[:,None], indeces
W[indeces] = 0
Let's symmetrize it:
In [46]:
W_sparse=np.maximum(W,np.transpose(W))
And visualise it:
In [47]:
plt.spy(W_sparse)
Out[47]:
Weight distribution:
In [48]:
degrees = W_sparse.sum(0)
plt.hist(np.count_nonzero(W_sparse, 0), bins=100, label='non-weighted graph degree distribution')
plt.hist(degrees, bins=100, label='weighted graph degree distribution')
plt.legend(fontsize = 'xx-large')
Out[48]:
In [49]:
G = graphs.Graph(W_sparse)
print('{} nodes, {} edges'.format(G.N, G.Ne))
In [50]:
G.is_connected()
Out[50]:
In [51]:
G.compute_laplacian('normalized')
laplacian = G.L
We compute the very first eigenvalues (to avoid a full eigendecomposition on big matrices)
In [52]:
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian,k=6,which='SM')
print(eigenvalues)
Note that the first eigenvalue is almost zero (meaning that the graph is connected)
In [53]:
print(eigenvalues[0])
In [54]:
plt.plot(eigenvalues, '.-', markersize=15);
In [55]:
plt.plot(eigenvectors[:, 1])
Out[55]:
In [56]:
labels = (eigenvectors[:,1]<0).astype(int)
x = eigenvectors[:,1]
y = eigenvectors[:,2]
plt.scatter(x, y, c=labels, cmap='RdBu', alpha=0.5);
Now, we'll store the values of the sign of the Fiedler vector together with the name of the corresponding street on a dataframe to be able to visualise the results of the clusterization on a map
In [57]:
signal=labels*2-1
In [58]:
labled={}
for index,row in dfp.iterrows():
labled[str(row['name'])]=signal[index]
name={'name':list(labled.keys())}
pd_label=pd.DataFrame(name)
pd_label['lable']=labled.values()
pd_label.index.name='Progressive_index'
pd_label.to_csv('../data/labled_bern.csv', encoding='utf-8')
pd_label[0:5]
Out[58]:
Let's visualize the Fiedler vector over the graph
In [59]:
G.set_coordinates()
In [60]:
G.plot_signal(eigenvectors[:,1], vertex_size=20)
...and its sign
In [61]:
G.plot_signal(signal, vertex_size=20)
In [62]:
# Loading the downloaded shapefile
# Bern
shapefile_dir = '../data/bern_switzerland.imposm-shapefiles'
shapefile_name ='bern_switzerland_osm_roads.shp'
#Budapest
#shapefile_dir = '../data/budapest'
#shapefile_name ='budapest_hungary_osm_roads.shp'
#Avia
#shapefile_dir = '../data/avia'
#shapefile_name ='ex_S9FVmMQ5ZQ7RRCSD68Hm5SNFNvwcB_osm_roads.shp'
shapefile_roads = os.path.join(shapefile_dir, shapefile_name)
# Here we create the GeoPandas Dataset:
dfp_original = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None" to "no_name
dfp_original.name[dfp_original.name.isnull() == True] = 'no_name'
dfp_original = dfp_original.to_crs({'init': 'epsg:3395'})
In [63]:
dfp=pd.read_csv('../data/labled_bern.csv',index_col='Progressive_index',encoding='utf-8')
df_gp = gpd.GeoDataFrame.from_file('../data/cleaned_dataset_bern.shp')
df_gp_blue=df_gp.copy()
df_gp_blue=df_gp_blue.set_index('name')
df_gp_red=df_gp.copy()
df_gp_red=df_gp_red.set_index('name')
In [64]:
for row in tqdm(dfp.iterrows()):
name=row[1]['name']
if row[1]['lable']==1:
df_gp_blue.drop(name,inplace=True)
else:
df_gp_red.drop(name,inplace=True)
In [65]:
fig=plt.figure()
fig, axes = plt.subplots(1, 2)
ax_blue=df_gp_blue.plot(ax=axes[0],color='blue',linewidth=1.0)
ax_red=df_gp_red.plot(ax=axes[1],color='red',linewidth=1.0)
Finnaly, here is the resulting seperation of Bern:
To test the efficacy of the algorithm we can let it run on a dataframe obtained by merging two cities: a very new one (New York) and a rural locality (Lama Mocogno, a municipality in the Province of Modena, Italy). The data is ordered: first 4283 raws refer to New York and all the others to Lama Mocogno.
In [67]:
df=pd.read_csv('../data/features_two_cities.csv',index_col='index',encoding='utf-8')
dfp=pd.read_csv('../data/features_twocities_lamamocogne_newyork_new.csv',encoding='utf-8')
Let's quickly redo the previous steps for this dataset
In [68]:
#we create the distances
dist =spatial.distance.squareform(spatial.distance.pdist(df.values,'braycurtis'))
#build the weight matrix
W=np.exp(-(dist**2)/(dist.mean()**2))
np.fill_diagonal(W, 0)
#sparsify the matrix with KNN
NEIGHBOURS = 1205
indeces = np.argsort(W, axis=1)
indeces = indeces[:,:len(W)-NEIGHBOURS]
indeces = np.arange(W.shape[0])[:,None], indeces
W[indeces] = 0
#symmetrize it
W_sparse=np.maximum(W,np.transpose(W))
Let's visualize W
In [69]:
plt.spy(W_sparse)
Out[69]:
Note that there are shorter distances between streets from the same city.
Compute the graph Laplacian
In [70]:
G = graphs.Graph(W_sparse)
G.compute_laplacian('normalized')
laplacian = G.L
Compute eigenvectors and eigenvalues
In [71]:
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian,k=6,which='SM')
print(eigenvalues)
print(eigenvalues[1])
As before, we use the value of the second eigenvector of the Laplacian matrix as the x coordinate of a node and the value of the third eigenvector as the y coordinate. The color is indicated by the sign of the Fiedler vector.
In [72]:
labels = (eigenvectors[:,1]<0).astype(int)
x = eigenvectors[:,1]
y = eigenvectors[:,2]
plt.scatter(x, y, c=labels, cmap='RdBu', alpha=0.5);
Let's retrieve the ground truth city categorization. We've built a dataframe containing all the names of the streets of the two cities and the corresponding lable (1 for New York and -1 for Lama)
In [73]:
pd_label = pd.read_csv('../data/labled_features_twocities_lamamocogne_newyork_new.csv', encoding='utf-8')
In [74]:
pd_label.head(2)
Out[74]:
In [75]:
pd_label.tail(2)
Out[75]:
In [76]:
cities = preprocessing.LabelEncoder().fit_transform(pd_label['lable'])
Now, let's plot it, but with the color indicated by the ground truth labelling
In [77]:
labels = (eigenvectors[:,1]<0).astype(int)
x = eigenvectors[:,1]
y = eigenvectors[:,2]
plt.scatter(x, y, c=cities, cmap='RdBu', alpha=0.5);
Let's compute how many streets were wrongly classified by the Fiedler vector according to the ground truth
In [78]:
err = np.count_nonzero(cities - labels)
err = err if err < len(labels) / 2 else len(labels) - err
print('{} errors ({:.2%})'.format(err, err/len(labels)))
Finally, we can plot the two cities on the map as before
In [79]:
signal=labels*2-1
In [80]:
labled={}
for index,row in dfp.iterrows():
labled[str(row['name'])]=signal[index]
name={'name':list(labled.keys())}
pd_label=pd.DataFrame(name)
pd_label['lable']=labled.values()
pd_label.index.name='Progressive_index'
pd_lable_new=pd_label[0:4282].copy();
pd_label_lama=pd_label[4283:len(pd_label)].copy();
pd_label.to_csv('../data/labled_new_lama_big.csv', encoding='utf-8')
pd_lable_new.to_csv('../data/labled_new_in_set_lama_big.csv', encoding='utf-8')
pd_label_lama.to_csv('../data/labled_lama_big_in_set_new.csv', encoding='utf-8')
pd_label[0:10]
Out[80]:
Plot
In [81]:
shapefile_dir = '../data/Lama_Mocogne_area'
shapefile_name ='ex_UWxd3ovMDcQwonMEUFyC4SE9R84bu_osm_roads.shp'
shapefile_roads = os.path.join(shapefile_dir, shapefile_name)
# Here we create the GeoPandas Dataset:
dfp_original = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None" to "no_name
dfp_original.name[dfp_original.name.isnull() == True] = 'no_name'
dfp_original = dfp_original.to_crs({'init': 'epsg:3395'})
dfp=pd.read_csv('../data/labled_lama_big_in_set_new.csv',index_col='Progressive_index',encoding='utf-8')
df_gp = gpd.GeoDataFrame.from_file('../data/cleaned_dataset_lamamocogne_new.shp')
df_gp_blue=df_gp.copy()
df_gp_blue=df_gp_blue.set_index('name')
df_gp_red=df_gp.copy()
df_gp_red=df_gp_red.set_index('name')
for row in tqdm(dfp.iterrows()):
name=row[1]['name']
if row[1]['lable']==1:
df_gp_blue.drop(name,inplace=True)
else:
df_gp_red.drop(name,inplace=True)
In [82]:
fig=plt.figure()
fig, axes = plt.subplots(1, 2)
ax_blue=df_gp_blue.plot(ax=axes[0],color='blue',linewidth=1.0)
ax_red=df_gp_red.plot(ax=axes[1],color='red',linewidth=1.0)
In [83]:
mplleaflet.show(fig=ax_blue.figure, crs=dfp_original.crs);
In [84]:
#NY
shapefile_dir = '../data/newyork'
shapefile_name ='ex_LrzRkmVP8Si2zoixHivCQeGvUaTaJ_osm_roads.shp'
shapefile_roads = os.path.join(shapefile_dir, shapefile_name)
# Here we create the GeoPandas Dataset:
dfp_original = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None" to "no_name
dfp_original.name[dfp_original.name.isnull() == True] = 'no_name'
dfp_original = dfp_original.to_crs({'init': 'epsg:3395'})
dfp=pd.read_csv('../data/labled_new_in_set_lama_big.csv',index_col='Progressive_index',encoding='utf-8')
df_gp = gpd.GeoDataFrame.from_file('../data/cleaned_dataset_newyork.shp')
df_gp_blue=df_gp.copy()
df_gp_blue=df_gp_blue.set_index('name')
df_gp_red=df_gp.copy()
df_gp_red=df_gp_red.set_index('name')
for row in tqdm(dfp.iterrows()):
name=row[1]['name']
if row[1]['lable']==1:
df_gp_blue.drop(name,inplace=True)
else:
df_gp_red.drop(name,inplace=True)
In [85]:
fig=plt.figure()
fig, axes = plt.subplots(1, 2)
ax_blue=df_gp_blue.plot(ax=axes[0],color='blue',linewidth=1.0)
ax_red=df_gp_red.plot(ax=axes[1],color='red',linewidth=1.0)
In [86]:
mplleaflet.show(fig=ax_blue.figure, crs=dfp_original.crs);
Budapest is the capital of the Hungury. There were two different cities and they merged to become Budapest in 19th Century. Because of curiosity, we have tried our algorithm on the roads of Budapest.
The results shows that on the left side of the river, roads are mostly in blue and on the right side there are more roads with red. This indicates that in fact they were two different cities.