In [1]:

    
%matplotlib inline
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
import os
import geopandas as gpd
import math
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import pandas as pd
import mplleaflet
from shapely import geometry, ops
from sklearn import preprocessing, decomposition
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
from tqdm import tqdm_notebook as tqdm
from pygsp import graphs, filters, plotting

plt.rcParams['figure.figsize'] = (30,15)

BUDA-PEST

The assumption that old parts of a city can be distinguished from the new parts, solely based on the road network was tested by anaylizing the road newtwork of different cities.

The dataset used for this specific task was gethered from OpenStreetMap (freely-reusable geospatial data) by extracting the specifc city or region of interest. The data was cleaned, by only keeping drivable or walkable roads, thereby discarding for instance "railways". In addition, a one-to-one correspondance between street name and the specific street was enforced by renaming streets, if geographically distict streets would show the same name. For each road, a multiple of features, such as street length, street curvature etc. were extracted. Those features were then used to construct a graph based on the similarity matrix build from a weighted distance metric. Using the constructed graph, the Fiedler eigenvector was calculated to finally obtain a data clusterization. The initial assumption was verified to a certain extend.

This notebook visualizes the process pipe-line by exploring the street network of Switzerland's capital Bern.

1 | Getting The Dataset

The Dataset, which will be the road network of a certain city, is downloaded from https://mapzen.com/data/metro-extracts/ This website can produce extracts from the OpenStreetMap Dataset.

The following cities or areas where extracted using mapzen:

Bern
Budapest
New York
Bologna
Denver
Lama Mocogno
Lama Mocogno Area
Seoul
London

GeoPandas was used to work with the shapefiles provided by the extracts. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.

For more information on GeoPandas: http://geopandas.org/



In [2]:

    
# Loading the downloaded shapefile

# Bern
shapefile_dir = '../data/bern_switzerland.imposm-shapefiles'
shapefile_name ='bern_switzerland_osm_roads.shp'

shapefile_roads = os.path.join(shapefile_dir, shapefile_name)

# Here we create the GeoPandas Dataset:
df = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None" to "no_name
df.name[df.name.isnull() == True] = 'no_name'

# In order to have meaningful distances, we use the Mercator projection 
# that preserves the angles and the shapes of small objects
# For more information visit: http://geopandas.org/projections.html
df = df.to_crs({'init': 'epsg:3395'})

1.1 | Exploring the Data

Let's have a look a the data.

Column 1: GeoPandas id
Column 2: osm_id:
Column 3: type within osm_class
Column 4: name of the line
Column 5: if the line is inside a tunnel
Column 6: if the line is on a bridge
Column 7: if the line is a one way street
Column 8: basically the drawing layer, more preciscly the field in osm2pgsql datamodel that map OSM Layers in OSM tags to a number in osm2pgsql model
Column 9: tag to denote if only specific vehicles or people have access
Column 10: describe the service or services offered by a car shop/garage, or to drill into more detail about a types of "service" roads, railways or waterways
Column 11: class

We can see that GeoPanda nicely structures the data provided by OpenStreetMap. In the last column, the geometry -a line together with its coordinates- can be seen. We can now work with GeoDataFrame just like a pandas dataframe.



In [3]:

    
df[0:3]









    Out[3]:







  
    
      
      id
      osm_id
      type
      name
      tunnel
      bridge
      oneway
      ref
      z_order
      access
      service
      class
      geometry
    
  
  
    
      0
      1
      2955839.0
      rail
      no_name
      0
      0
      0
      None
      7
      None
      None
      railway
      LINESTRING (834180.7799455991 5913106.66249309...
    
    
      1
      2
      3230982.0
      residential
      Hochfeldstrasse
      0
      0
      0
      None
      3
      None
      None
      highway
      LINESTRING (826867.6168769542 5903563.51628549...
    
    
      2
      3
      3728938.0
      primary
      Bernstrasse
      1
      0
      0
      1;12
      -14
      None
      None
      highway
      LINESTRING (834386.4750492289 5913916.23293858...

...and this is how the data looks like:



In [4]:

    
ax = df.plot(color='black',figsize=(10, 10))

To see it printed on a real map, this code could be used: mplleaflet.show(fig=ax.figure, crs=df.crs)

Here is an extract as an image:

2 | Cleaning The Dataset

2.1| Discard All Non-Roads

The following cell takes the geopandas dataframe and only keeps specific types of entries.

First, the classes are restricted to being only of class "highway" (which in OSM stands for roads).

Then within this class, it was choosen to

keep residential streets
living streets
pedestrian streets (often found in the city center)
trunk ( high performance or high importance roads that don't meet the requirement for motorway)
primary, secondary and tertiary roads



In [5]:

    
total_rows_before_cleaning = len(df.index)



In [6]:

    
# Get Only Highway classes
df=df[df['class']=='highway']

sel = (df['type']=='residential')| (df['type']=='living_street') | (df['type']=='pedestrian')| (df['type']=='trunk') | (df['type']=='primary') | (df['type']=='secondary') | (df['type']=='tertiary') 
df = df[sel]
#check if the types are there
df[sel]["type"].unique()









    



C:\Users\sonia\Anaconda3\lib\site-packages\geopandas\geodataframe.py:398: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  result = super(GeoDataFrame, self).__getitem__(key)






    Out[6]:





array(['residential', 'primary', 'secondary', 'tertiary', 'living_street',
       'pedestrian', 'trunk'], dtype=object)



In [7]:

    
total_rows_after_cleaning = len(df.index)



In [8]:

    
1-total_rows_after_cleaning/total_rows_before_cleaning









    Out[8]:





0.8019922879177378

By focusing our interest on streets only we reduced the dataset by slightly more than 80%.

2.2| Split And Rename Streets

Which Are Geometrically Seperated But Have The Same Name

Inspiered by: https://stackoverflow.com/questions/47988849/split-lines-based-on-names (a different naming procedure was implemented )

For all streets, which have the same name, their streets are clustered using DBSCAN clustering techniques.

Then each cluster is identified and the associated streets are renamed in the following way: oldName_clusterNumber



In [9]:

    
def split_rename(temp):
    '''Splits streets which are geometrically apart but have the same name
    INPUT: GeoPandas Dataframe of the streets which all have the same name
    OUTPUT: A renamed GeoPandas Dataframe based on the geographic locality of the street.
    Naming convention: oldName_clusterNumber'''
    ids = []
    coords = []

    for row in temp.itertuples():

        geom = np.asarray(row.geometry)
        # Record the geometric loaction of every point
        coords.extend(geom)
        # Record the id of every point
        ids.extend([row.id] * geom.shape[0])

    # Initilaize DBSCAN 
    clust = DBSCAN(eps=0.08)

    # Normalize the coordinates and fit the clusters
    clusters = clust.fit_predict(StandardScaler().fit_transform(coords))

    # store in a pandas dataframe 
    points_clusters = pd.DataFrame({"id":ids, "cluster":clusters})
    #print(points_clusters)
    # One Street Chunk Id should only be assigned to one cluster
    # -> Assign to the cluster where the most points are
    id_cluster = points_clusters.groupby('id').cluster.apply(lambda x : x.mode()).reset_index().drop('level_1',1)

    # Rename 
    temp_add_cluster = temp.merge(id_cluster, on='id')
    temp_add_cluster['name'] = temp_add_cluster.name + '_' + temp_add_cluster.cluster.astype(str)
    temp_add_cluster = temp_add_cluster.drop('cluster', 1)

    return temp_add_cluster



In [10]:

    
df_split = df.groupby('name').apply(split_rename)

2.2.1 Visualization: "Bahnhofstrasse"

Here the previous code is demonstrated:

Bahnhofstrasse is a very common name for the street which leads to (or used to lead) to the railway station. As there are multiple railway stations around Bern, this street name appears naturally often. However, a one-to-one corespondance between street name and street should be achieved.

Without spliting and renaming
With spliting and renaming



In [11]:

    
# 1 
grouped = df.groupby('name')
streets = {}
frequency = {}
for key, values in grouped:
    streets[key] = values
    
    
no_split = streets['Bahnhofstrasse'].plot(color='red',linewidth=5.0)



In [12]:

    
# 2
grouped = df_split.groupby('name')
streets = {}
for key, values in grouped:
    streets[key] = values
    
fig=plt.figure()
ax = streets['Bahnhofstrasse_0'].plot(color='red',linewidth=5.0)
ax = streets['Bahnhofstrasse_1'].plot(ax=ax, color='green',linewidth=5.0);
ax = streets['Bahnhofstrasse_2'].plot(ax=ax, color='orange',linewidth=5.0);
ax = streets['Bahnhofstrasse_3'].plot(ax=ax, color='black',linewidth=5.0);
ax = streets['Bahnhofstrasse_4'].plot(ax=ax, color='yellow',linewidth=5.0);
ax = streets['Bahnhofstrasse_5'].plot(ax=ax, color='purple',linewidth=5.0);
ax = streets['Bahnhofstrasse_6'].plot(ax=ax, color='magenta',linewidth=5.0);
ax = streets['Bahnhofstrasse_7'].plot(ax=ax, color='grey',linewidth=5.0);









    



C:\Users\sonia\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: 'name' is both a column name and an index level.
Defaulting to column but this will raise an ambiguity error in a future version
  






    





<matplotlib.figure.Figure at 0x1d33cc88278>

2.3| Export Cleaned Dataset

To avoid having to collect the data everytime the kernel is restarted, the dataset is saved as a CSV file.



In [14]:

    
df_split.to_file("../data/cleaned_dataset_bern.shp")

3| Feature Creation

To deal with the heterogeneity in dimention and hugeness of the data points, we need to represent them using a set of features appositely chosen to be maximally relevant to the task.



In [15]:

    
df = gpd.GeoDataFrame.from_file("../data/cleaned_dataset_bern.shp")

3.1| Merging the Streeet Pieces into Streets With the Same Name

To store all the features, we firstly initialize a "classical" Pandas Dataframe with the Street Names



In [16]:

    
street_names= {"name":df.name.unique()}
dfp_features = pd.DataFrame(street_names)
dfp_features.index.name='Progressive_index'



In [17]:

    
dfp_features[0:5]









    Out[17]:







  
    
      
      name
    
    
      Progressive_index
      
    
  
  
    
      0
      ARA-Strasse_-1
    
    
      1
      Aarbergergasse_-1
    
    
      2
      AarbÃ¼hlstrasse_-1
    
    
      3
      Aarehalde_-1
    
    
      4
      Aarestrasse_-1

3.2| Length Of The Street

We can obtain the length of each street by summing up the length of the street chunks with the same name



In [18]:

    
street_length = []
for street_name in tqdm(dfp_features['name']):
    sub_gpdf = df[df.name == street_name]
    street_length_curr = []
    for index, row in sub_gpdf.iterrows():
        street_length_curr.append(row['geometry'].length)
    street_length.append(np.sum(street_length_curr))
dfp_features['length'] = street_length

Let's have a look at the data ordered by length



In [19]:

    
dfp_features_temp = dfp_features.sort_values('length',ascending=False)
dfp_features_temp[0:7]









    Out[19]:







  
    
      
      name
      length
    
    
      Progressive_index
      
      
    
  
  
    
      176
      Bernstrasse_4
      11793.307927
    
    
      679
      Freiburgstrasse_0
      10984.081925
    
    
      1299
      Lindentalstrasse_-1
      8140.113976
    
    
      2253
      Thunstrasse_0
      7975.498903
    
    
      252
      Bolligenstrasse_0
      7153.922945
    
    
      181
      Bernstrasse_9
      6668.876028
    
    
      2586
      no_name_11
      6506.178714

There are some streets that are unnamed. But since they have been splitted in connected pieces by the splitting algorithm, we can treat them as normal streets (the name is the only information that is missing).

3.3| Curvature

Let's determine the curvature of each of the streets.

By the law of cosine, in a triangle of sides $d_1$, $d_2$ and $d_3$, the angle $\alpha$ opposit to the side $d_3$ is given by:

$$acos(\alpha) = \frac{d_1^2 + d_2^2 - d_3^2}{2d_1d_2}.$$

Note that for a straight line the corresponging curvature will be $\pi$ (the output of that formula is included in the interval $[0,\pi]$, the more curvy the line is, the more it tends towards zero.

Since we'll compute an angle for every triplet of coordinates of each street, we need to aggregate these coefficients to have a fixed representation for each street with the same name. This task can be accomplished by exploiting 7 summary statistics: minimum, maximum, median and the first 4 moments (mean, standard deviation, skew and kurtosis). Note that the computation of the curvature angle is performed by firstly computing an angle for each of the triplets of coordinates of each chunk of streets with the same name and then merging all the angles of the street in a list (curvature_curr).



In [20]:

    
std_curvature=[]
mean_curvature = []
max_curvature = []
min_curvature = []
skew_curvature = []
kurtosis_curvature = []
median_curvature = []

for street_name in tqdm(dfp_features['name']):
    sub_gpdf=df[df.name==street_name]
    curvature_curr=[]
    for index, row in sub_gpdf.iterrows():
        line_x= row['geometry'].xy[0]
        line_y= row['geometry'].xy[1]
        temp_ang=[]
        if (len(line_x)<=2):
            temp_ang.append(np.pi)
        else:
            for i in range(len(line_x)-2):
                d1=np.sqrt((line_x[i]-line_x[i+1])**2+(line_y[i]-line_y[i+1])**2)
                d2=np.sqrt((line_x[i+1]-line_x[i+2])**2+(line_y[i+1]-line_y[i+2])**2)
                d3=np.sqrt((line_x[i]-line_x[i+2])**2+(line_y[i]-line_y[i+2])**2)
                temp_ang.append(math.acos(math.fmod(((d1**2+d2**2-d3**2)/(2*d1*d2)),1)))
        curvature_curr.extend(temp_ang)
    std_curvature.append(np.std(curvature_curr))
    mean_curvature.append(np.mean(curvature_curr))
    max_curvature.append(np.max(curvature_curr))
    min_curvature.append(np.min(curvature_curr))
    skew_curvature.append(scipy.stats.skew(curvature_curr))
    kurtosis_curvature.append(scipy.stats.kurtosis(curvature_curr))
    median_curvature.append(np.median(curvature_curr))
    

dfp_features['std_curvature'] = std_curvature
dfp_features['mean_curvature'] = mean_curvature
dfp_features['max_curvature'] = max_curvature
dfp_features['min_curvature'] = min_curvature
dfp_features['skew_curvature'] = skew_curvature
dfp_features['kurtosis_curvature'] = kurtosis_curvature
dfp_features['median_curvature'] = median_curvature

3.4| Oneway

Boolean value that is one in case the street traffic has to moves in a single direction.



In [21]:

    
dfp_features['oneway'] = df.oneway

3.5| Flight length

One of the features that can be extracted from the collection of coordinates that we have for each street, is the distance between the starting and ending point. In case a street is not a contiguous line, the method linemerge() will merge only the contiguous portions into a component of a new multi-linestring. In this case, the flight length is computed as the sum of the flight lengths of each connected line.



In [22]:

    
street_flight_distance=[]
for street_name in tqdm(dfp_features['name']):
    sub_gpdf = df[df.name == street_name]
    lines = []
    for index, row in sub_gpdf.iterrows():
        line = row['geometry']
        lines.append(line)
    multi_line = geometry.MultiLineString(lines)
    merged_line = ops.linemerge(multi_line)
    if isinstance(merged_line, geometry.multilinestring.MultiLineString) == True:
        lengths = []
        for chunks in merged_line:
            init_x = chunks.xy[0][0]
            init_y = chunks.xy[1][0]
            final_x = chunks.xy[0][-1]
            final_y = chunks.xy[1][-1]
            lengths.append(np.sqrt((init_x-final_x)**2 + (init_y-final_y)**2))
        street_flight_distance.append(np.sum(lengths))
        
    else:
        init_x = merged_line.xy[0][0]
        init_y = merged_line.xy[1][0]
        final_x = merged_line.xy[0][-1]
        final_y = merged_line.xy[1][-1]    
        street_flight_distance.append(np.sqrt((init_x-final_x)**2 + (init_y-final_y)**2))
    
    
dfp_features['flight_distance']=street_flight_distance

As expected, the streets with the biggest curvature (that means in our convention that they are straight lines) have a length that is equal to the flight length



In [23]:

    
dfp_features_temp = dfp_features.sort_values('mean_curvature', ascending=False )
dfp_features_temp[0:5]









    Out[23]:







  
    
      
      name
      length
      std_curvature
      mean_curvature
      max_curvature
      min_curvature
      skew_curvature
      kurtosis_curvature
      median_curvature
      oneway
      flight_distance
    
    
      Progressive_index
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2439
      Weissensteinstrasse_0
      270.547434
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      270.547208
    
    
      2518
      Worbstrasse_-1
      19.625991
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      19.625991
    
    
      1840
      RÃ¼tiweg_0
      12.361306
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      12.361306
    
    
      784
      Grossacher_-1
      24.989844
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      24.989844
    
    
      1738
      Raintalweg_-1
      45.083040
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      45.083040

The opposite for the curvy streets:



In [24]:

    
dfp_features_temp = dfp_features.sort_values('mean_curvature')
dfp_features_temp[0:5]









    Out[24]:







  
    
      
      name
      length
      std_curvature
      mean_curvature
      max_curvature
      min_curvature
      skew_curvature
      kurtosis_curvature
      median_curvature
      oneway
      flight_distance
    
    
      Progressive_index
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1741
      Rainweg_-1
      127.278679
      0.000000
      1.269074
      1.269074
      1.269074
      0.0
      -3.0
      1.269074
      0
      79.385880
    
    
      1549
      Nelkenweg_-1
      176.353850
      0.000000
      1.389613
      1.389613
      1.389613
      0.0
      -3.0
      1.389613
      0
      113.139175
    
    
      2007
      Seerosenstrasse_-1
      247.667410
      0.061194
      1.536513
      1.597707
      1.475318
      0.0
      -2.0
      1.536513
      0
      199.641619
    
    
      786
      Grossackerweg_-1
      148.241237
      0.000000
      1.589940
      1.589940
      1.589940
      0.0
      -3.0
      1.589940
      0
      106.151360
    
    
      108
      Bahnhofmatte_-1
      91.384568
      0.015333
      1.667803
      1.683136
      1.652470
      0.0
      -2.0
      1.667803
      0
      73.508840

Let's have a look at the data ordered by flight length



In [25]:

    
dfp_features_temp = dfp_features.sort_values('flight_distance')
dfp_features_temp[0:10]









    Out[25]:







  
    
      
      name
      length
      std_curvature
      mean_curvature
      max_curvature
      min_curvature
      skew_curvature
      kurtosis_curvature
      median_curvature
      oneway
      flight_distance
    
    
      Progressive_index
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      872
      Hallmattstrasse_0
      88.479177
      0.063947
      2.797525
      2.877147
      2.646033
      -1.073955
      0.327356
      2.822536
      0
      0.0
    
    
      2675
      no_name_100
      70.366378
      0.025497
      2.831483
      2.898344
      2.789635
      0.617011
      1.014966
      2.830027
      0
      0.0
    
    
      2143
      Steinbachstrasse_2
      85.699370
      0.100609
      2.883891
      3.141593
      2.713228
      0.667169
      0.106906
      2.856430
      1
      0.0
    
    
      2578
      no_name_4
      149.296622
      0.138137
      2.745294
      3.134596
      2.531204
      0.608711
      0.317866
      2.733639
      0
      0.0
    
    
      981
      Hubelmattstrasse_2
      52.073222
      0.177815
      2.849940
      3.141593
      2.347747
      -0.765759
      1.332291
      2.835432
      0
      0.0
    
    
      2575
      no_name_1
      91.221479
      0.019748
      2.828464
      2.864245
      2.805554
      0.430145
      -1.066868
      2.828831
      0
      0.0
    
    
      1179
      Kreisel SteinhÃ¶lzli_-1
      92.154250
      0.030541
      2.828423
      2.887306
      2.789778
      0.829138
      -0.411841
      2.824638
      0
      0.0
    
    
      2646
      no_name_71
      202.014067
      0.039547
      2.930642
      3.017461
      2.818854
      -0.635829
      0.413638
      2.932554
      0
      0.0
    
    
      754
      Gilberte-de-Courgenay-Platz_-1
      365.147808
      0.381430
      2.747225
      3.141389
      1.705908
      -1.563873
      2.348194
      2.757519
      0
      0.0
    
    
      2588
      no_name_13
      98.495765
      0.011089
      2.827426
      2.840002
      2.807103
      -0.853359
      -0.916933
      2.832641
      0
      0.0

So many zeros! Let's have a look at them:



In [26]:

    
grouped = df.groupby('name')
streets = {}

fig=plt.figure()
fig, axes = plt.subplots(3,3)

for key, values in grouped:
    streets[key] = values
streets['Hallmattstrasse_0'].plot(ax=axes[0][0], color='red',linewidth=5.0) 
streets['Kreisel SteinhÃ¶lzli_-1'].plot(ax=axes[0][1], color='red',linewidth=5.0)
streets['Gilberte-de-Courgenay-Platz_-1'].plot(ax=axes[0][2], color='red',linewidth=5.0)
streets['Hubelmattstrasse_2'].plot(ax=axes[1][0], color='red',linewidth=5.0)
streets['Neuhausplatz_-1'].plot(ax=axes[1][1], color='red',linewidth=5.0)
streets['BÃ¼hlplatz_-1'].plot(ax=axes[1][2], color='red',linewidth=5.0)
streets['PourtalÃ¨sstrasse_1'].plot(ax=axes[2][0], color='red',linewidth=5.0)
streets['Rubigenstrasse_0'].plot(ax=axes[2][1], color='red',linewidth=5.0)
streets['Unterdettigenstrasse_1'].plot(ax=axes[2][2], color='red',linewidth=5.0)









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x1d33c8b9320>






    





<matplotlib.figure.Figure at 0x1d33cc381d0>

3.6| Flight length over distance

Another feature that can be interesting to look at is the ratio between flight distance and length computed on the curve.



In [27]:

    
sfdp=[] #street_flight_distance_percentage
for street in tqdm(dfp_features.iterrows()):
    d=street[1]['length']
    fd=street[1]['flight_distance']
    sfdp.append(fd/d)
dfp_features['percentage_flight_distance']=sfdp

Let's see if we have meaningful data (percentage of flight distance should be at most 1):



In [28]:

    
dfp_features_temp = dfp_features.sort_values('percentage_flight_distance',ascending=False)
dfp_features_temp[0:5]









    Out[28]:







  
    
      
      name
      length
      std_curvature
      mean_curvature
      max_curvature
      min_curvature
      skew_curvature
      kurtosis_curvature
      median_curvature
      oneway
      flight_distance
      percentage_flight_distance
    
    
      Progressive_index
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2655
      no_name_80
      158.322908
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      158.322908
      1.0
    
    
      1738
      Raintalweg_-1
      45.083040
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      45.083040
      1.0
    
    
      305
      Brunnmattstrasse_4
      54.344766
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      54.344766
      1.0
    
    
      2522
      Wydenhofstrasse_-1
      115.013988
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      115.013988
      1.0
    
    
      2042
      Sodweg_-1
      177.119457
      0.0
      3.141593
      3.141593
      3.141593
      0.0
      -3.0
      3.141593
      0
      177.119457
      1.0

3.7| Derivative

It is also possible to computed the derivative of the coordinate points of each street. As before, we firstly compute the derivative for each of the chunk of streets with the same name and then merging all the derivatives of the street in two lists (derivative_curr_x and derivative_curr_y). Again, as the derivate of each line has the same length of the line itself, to be able to compare derivates of different streets, we aggregate the coefficients by using the previous described 7 summary statistics (minimum, maximum, median and the first 4 moments).



In [29]:

    
std_derivative_x = []
mean_derivative_x = []
max_derivative_x = []
min_derivative_x = []
skew_derivative_x = []
kurtosis_derivative_x = []
median_derivative_x = []
std_derivative_y = []
mean_derivative_y = []
max_derivative_y = []
min_derivative_y = []
skew_derivative_y = []
kurtosis_derivative_y = []
median_derivative_y = []


for street_name in tqdm(dfp_features['name']):
    sub_gpdf=df[df.name==street_name]
    derivative_curr_x = []
    derivative_curr_y = []
    for index, row in sub_gpdf.iterrows():
        line_x= row['geometry'].xy[0]
        line_y= row['geometry'].xy[1]
        derivative_x = np.gradient(line_x)
        derivative_y = np.gradient(line_y)       
        derivative_curr_x.extend(derivative_x)
        derivative_curr_y.extend(derivative_y)

    std_derivative_x.append(np.std(derivative_curr_x))
    mean_derivative_x.append(np.mean(derivative_curr_x))
    max_derivative_x.append(np.max(derivative_curr_x))
    min_derivative_x.append(np.min(derivative_curr_x))
    skew_derivative_x.append(scipy.stats.skew(derivative_curr_x))
    kurtosis_derivative_x.append(scipy.stats.kurtosis(derivative_curr_x))
    median_derivative_x.append(np.median(derivative_curr_x))
    std_derivative_y.append(np.std(derivative_curr_y))
    mean_derivative_y.append(np.mean(derivative_curr_y))
    max_derivative_y.append(np.max(derivative_curr_y))
    min_derivative_y.append(np.min(derivative_curr_y))
    skew_derivative_y.append(scipy.stats.skew(derivative_curr_y))
    kurtosis_derivative_y.append(scipy.stats.kurtosis(derivative_curr_y))
    median_derivative_y.append(np.median(derivative_curr_y))
    
dfp_features['std_derivative_x'] = std_derivative_x
dfp_features['mean_derivative_x'] = mean_derivative_x
dfp_features['max_derivative_x'] = max_derivative_x
dfp_features['kurtosis_derivative_x'] = kurtosis_derivative_x
dfp_features['median_derivative_x'] = median_derivative_x
dfp_features['skew_derivative_x'] = skew_derivative_x
dfp_features['min_derivative_x'] = min_derivative_x
dfp_features['std_derivative_y'] = std_derivative_y
dfp_features['mean_derivative_y'] = mean_derivative_y
dfp_features['max_derivative_y'] = max_derivative_y
dfp_features['min_derivative_y'] = min_derivative_y
dfp_features['kurtosis_derivative_y'] = kurtosis_derivative_y
dfp_features['median_derivative_y'] = median_derivative_y
dfp_features['skew_derivative_y'] = skew_derivative_y

3.8| Export In CSV



In [30]:

    
dfp_features.to_csv('../data/features_bern.csv', encoding='utf-8')

4 | Graph Creation

Read csv



In [31]:

    
dfp=pd.read_csv('../data/features_bern.csv',index_col='Progressive_index',encoding='utf-8')



In [32]:

    
dfp[0:5]









    Out[32]:







  
    
      
      name
      length
      std_curvature
      mean_curvature
      max_curvature
      min_curvature
      skew_curvature
      kurtosis_curvature
      median_curvature
      oneway
      ...
      median_derivative_x
      skew_derivative_x
      min_derivative_x
      std_derivative_y
      mean_derivative_y
      max_derivative_y
      min_derivative_y
      kurtosis_derivative_y
      median_derivative_y
      skew_derivative_y
    
    
      Progressive_index
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      ARA-Strasse_-1
      380.789011
      0.109108
      3.015231
      3.141580
      2.738821
      -1.071367
      0.418183
      3.047087
      0
      ...
      11.509408
      1.551789
      4.436744
      7.534558
      10.528047
      26.556762
      1.908593
      -0.560718
      8.680791
      0.762723
    
    
      1
      Aarbergergasse_-1
      401.958124
      0.025441
      3.114920
      3.141403
      3.056522
      -0.733484
      -0.543281
      3.127427
      0
      ...
      -14.443911
      -0.774278
      -52.494473
      4.164734
      6.184609
      17.727013
      1.137742
      0.306972
      4.421553
      0.924600
    
    
      2
      AarbÃ¼hlstrasse_-1
      169.771636
      0.111169
      3.008006
      3.139367
      2.871630
      -0.113310
      -1.749004
      3.028933
      0
      ...
      -4.114835
      -0.720305
      -11.616711
      8.599054
      -19.360872
      -9.058012
      -38.942856
      0.426518
      -17.005812
      -1.016285
    
    
      3
      Aarehalde_-1
      198.987378
      0.314167
      2.893636
      3.136984
      2.450034
      -0.697142
      -1.500000
      3.093891
      1
      ...
      27.530207
      0.126653
      -0.289251
      18.178381
      35.591284
      64.651918
      14.806284
      -1.157604
      38.679548
      0.335735
    
    
      4
      Aarestrasse_-1
      1600.259622
      0.114674
      3.062512
      3.141515
      2.534391
      -2.928118
      9.684166
      3.097732
      1
      ...
      -12.153226
      -0.751334
      -43.210434
      26.111982
      31.960392
      95.933273
      2.699986
      -0.129362
      27.555015
      0.970110
    
  

5 rows × 26 columns

4.1| Weighting

Because not all the features have the same importance we need to assign them a weight, a multiplier that "gives more relevance" to one feature respect to one other. For example: curvature and percentage flight distance are more important than absolute distance. Even if they have heterogeneous ranges (curvature spans from 0 to pi for example) the normalization can make the weighting effective because it's applied using total mean and variance over the weighted vector



In [33]:

    
#tuning parameters
l=2          #coeff. for length
c=2          #coeff. for mean curvature
f=2          #coeff. for flight-distance
f_r=3.5      #coeff. for ratio flight-distance
o=1.5        #coeff. for one-way 
s=1.5        #coeff. for std_curvature
mi=0.25      #coeff. for min curvature
ma=3.5       #coeff. for max curvature
skew=1       #coeff. for skew curvature
kurt=0.5     #coeff. for kurtosis curvature
med=1.5      #coeff. for median curvature
dexstd = 1   #coeff. for std_derivative_x
dexmed = 1   #coeff. for median_derivative_x
dexskew=0.5  #coeff. for skew_derivative_x
dexkurt=0.25 #coeff. for kurtosis derivative x
dexmi = 1    #coeff. for min derivative x 
dexma = 1.5  #coeff. for max derivative x
dexmean = 1  #coeff. for mean derivative x
deymean = 1  #coeff. for mean derivative y
deymi = 1    #coeff. for min derivative y
deyma = 1.5  #coeff. for max derivative y
deystd = 1   #coeff. for std derivative y
deymed = 1   #coeff. for median derivative y
deyskew=0.5  #coeff. for skew derivative y
deykurt=0.25 #coeff. for kurtosis derivate y
bias=0.01    #generic bias


dfp['length']=dfp['length']*l
dfp['mean_curvature']=dfp['mean_curvature']*c
dfp['flight_distance']=dfp['flight_distance']*f
dfp['percentage_flight_distance']=dfp['percentage_flight_distance']*f_r
dfp['oneway']=dfp['oneway']*o
dfp['std_curvature']=dfp['std_curvature']*s
dfp['min_curvature']=dfp['min_curvature']*mi
dfp['max_curvature']=dfp['max_curvature']*ma
dfp['skew_curvature'] = dfp['skew_curvature']*skew
dfp['kurtosis_curvature'] = dfp['kurtosis_curvature']*kurt
dfp['median_curvature'] = dfp['median_curvature']*med
dfp['std_derivative_x'] = dfp['std_derivative_x']*dexstd
dfp['median_derivative_x'] = dfp['median_derivative_x']*dexmed
dfp['skew_derivative_x'] = dfp['skew_derivative_x']*dexskew
dfp['kurtosis_derivative_x'] = dfp['kurtosis_derivative_x']*dexkurt
dfp['min_derivative_x'] = dfp['min_derivative_x']*dexmi
dfp['max_derivative_x'] = dfp['max_derivative_x']*dexma
dfp['mean_derivative_x'] = dfp['mean_derivative_x']*dexmean
dfp['mean_derivative_y'] = dfp['mean_derivative_y']*deymean
dfp['min_derivative_y'] = dfp['min_derivative_y']*deymi
dfp['max_derivative_y'] = dfp['max_derivative_y']*deyma
dfp['std_derivative_y'] = dfp['std_derivative_y']*deystd
dfp['median_derivative_y'] = dfp['median_derivative_y']*deymed
dfp['skew_derivative_y'] = dfp['skew_derivative_y']*deyskew
dfp['kurtosis_derivative_y'] = dfp['kurtosis_derivative_y']*deykurt

4.2| Normalization



In [34]:

    
def normalize_min_max(df):
    return (df-df.min())/(df.max()-df.min())

def normalize_mean_std(df):
    return (df-df.mean())/df.std()



In [35]:

    
dfp_norm=dfp.copy()
dfp_norm.drop('name', axis=1, inplace=True)

dfp_norm=normalize_mean_std(dfp_norm);

We can also drop some of the features



In [36]:

    
#dfp_norm.drop('max_curvature', axis=1, inplace=True)
#dfp_norm.drop('min_curvature', axis=1, inplace=True)
#dfp_norm.drop('median_curvature', axis=1, inplace=True)
dfp_norm.drop('median_derivative_x', axis=1, inplace=True)
dfp_norm.drop('median_derivative_y', axis=1, inplace=True)
#dfp_norm.drop('skew_curvature', axis=1, inplace=True)
dfp_norm.drop('skew_derivative_x', axis=1, inplace=True)
dfp_norm.drop('skew_derivative_y', axis=1, inplace=True)
dfp_norm.drop('kurtosis_derivative_x', axis=1, inplace=True)
dfp_norm.drop('kurtosis_derivative_y', axis=1, inplace=True)
dfp_norm.drop('kurtosis_curvature', axis=1, inplace=True)

4.3| Symilarity matrix

Now, we'll create a similarity matrix between streets. Every street street is represented as a node and each edge has a weight that represents the similarity between two streets.



In [37]:

    
print('Matrix dimentions: {} x {}'.format(len(dfp_norm),len(dfp_norm)))









    



Matrix dimentions: 2677 x 2677

We firstly need to define the similarity measure



In [38]:

    
dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'braycurtis')) 
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'euclidean')) 
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'cityblock')) 
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'minkowski')) 
#dist =spatial.distance.squareform(spatial.distance.pdist(dfp_norm.values,'cosine'))

Check if it is symmetric and how many zeros do we have



In [39]:

    
print('Number of zero elements: {}'.format(np.sum(dist==0)))









    



Number of zero elements: 2695

...and check if we have a symmetric matrix



In [40]:

    
is_sym = np.sum(np.around((dist-np.transpose(dist)),14)!=0)
# a correct implementation should pass the below test.
assert is_sym == 0

We can also plot an histogram of the distances.



In [41]:

    
plt.hist(dist.reshape(-1), bins=1000);

4.4| Kernel

A kernel is used to compute the weight of each edge (strong edges for the streets whose distance is short). One of the most used, that works well in this case too is the Gaussian kernel.



In [42]:

    
W=np.exp(-(dist**2)/(dist.mean()**2))
np.fill_diagonal(W, 0) #fill the diagonal with zeros to avoid self-connections

We can also visualise the results (histogram)



In [43]:

    
plt.hist(W.reshape(-1), bins=200);



In [44]:

    
plt.spy(W)









    Out[44]:





<matplotlib.image.AxesImage at 0x1d33cc5b208>

4.5| KNN

To sparsify the matrix, we eliminate the N-K weakest edges for each node (where N is dimention of the matrix).



In [45]:

    
NEIGHBOURS = 130 
indeces = np.argsort(W, axis=1)
indeces = indeces[:,:len(W)-NEIGHBOURS]
indeces = np.arange(W.shape[0])[:,None], indeces
W[indeces] = 0

Let's symmetrize it:



In [46]:

    
W_sparse=np.maximum(W,np.transpose(W))

And visualise it:



In [47]:

    
plt.spy(W_sparse)









    Out[47]:





<matplotlib.image.AxesImage at 0x1d33cc6b390>

Weight distribution:



In [48]:

    
degrees = W_sparse.sum(0)

plt.hist(np.count_nonzero(W_sparse, 0), bins=100, label='non-weighted graph degree distribution')
plt.hist(degrees, bins=100, label='weighted graph degree distribution')
plt.legend(fontsize = 'xx-large')









    Out[48]:





<matplotlib.legend.Legend at 0x1d33cafd5c0>

4.6| Graph



In [49]:

    
G = graphs.Graph(W_sparse)
print('{} nodes, {} edges'.format(G.N, G.Ne))









    



2677 nodes, 233218 edges



In [50]:

    
G.is_connected()









    Out[50]:





True

4.7| Laplacian

Since the clusterization will need the Fiedler vector, we compute the normalized Laplacian



In [51]:

    
G.compute_laplacian('normalized')
laplacian = G.L

We compute the very first eigenvalues (to avoid a full eigendecomposition on big matrices)



In [52]:

    
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian,k=6,which='SM')
print(eigenvalues)









    



[ -6.82133705e-17   1.33395326e-01   1.95137560e-01   2.10906166e-01
   2.22253578e-01   3.53695448e-01]

Note that the first eigenvalue is almost zero (meaning that the graph is connected)



In [53]:

    
print(eigenvalues[0])









    



-6.82133704612e-17



In [54]:

    
plt.plot(eigenvalues, '.-', markersize=15);

4.8| Visulazing the Fiedler vector



In [55]:

    
plt.plot(eigenvectors[:, 1])









    Out[55]:





[<matplotlib.lines.Line2D at 0x1d33c9ac828>]

4.9| Graph embedding

We can use the value of the second eigenvector of the Laplacian matrix as the x coordinate of a node and the value of the third eigenvector as the y coordinate. The color is indicated by the sign of the Fiedler vector.



In [56]:

    
labels = (eigenvectors[:,1]<0).astype(int)
x = eigenvectors[:,1]
y = eigenvectors[:,2]

plt.scatter(x, y, c=labels, cmap='RdBu', alpha=0.5);

Now, we'll store the values of the sign of the Fiedler vector together with the name of the corresponding street on a dataframe to be able to visualise the results of the clusterization on a map



In [57]:

    
signal=labels*2-1



In [58]:

    
labled={}
for index,row in dfp.iterrows():
    labled[str(row['name'])]=signal[index]

name={'name':list(labled.keys())}
pd_label=pd.DataFrame(name)
pd_label['lable']=labled.values()
pd_label.index.name='Progressive_index'
pd_label.to_csv('../data/labled_bern.csv', encoding='utf-8')
pd_label[0:5]









    Out[58]:







  
    
      
      name
      lable
    
    
      Progressive_index
      
      
    
  
  
    
      0
      ARA-Strasse_-1
      -1
    
    
      1
      Aarbergergasse_-1
      -1
    
    
      2
      AarbÃ¼hlstrasse_-1
      -1
    
    
      3
      Aarehalde_-1
      1
    
    
      4
      Aarestrasse_-1
      1

Let's visualize the Fiedler vector over the graph



In [59]:

    
G.set_coordinates()



In [60]:

    
G.plot_signal(eigenvectors[:,1], vertex_size=20)

...and its sign



In [61]:

    
G.plot_signal(signal, vertex_size=20)

5 | Results



In [62]:

    
# Loading the downloaded shapefile
# Bern
shapefile_dir = '../data/bern_switzerland.imposm-shapefiles'
shapefile_name ='bern_switzerland_osm_roads.shp'

#Budapest
#shapefile_dir = '../data/budapest'
#shapefile_name ='budapest_hungary_osm_roads.shp'

#Avia
#shapefile_dir = '../data/avia'
#shapefile_name ='ex_S9FVmMQ5ZQ7RRCSD68Hm5SNFNvwcB_osm_roads.shp'

shapefile_roads = os.path.join(shapefile_dir, shapefile_name)

# Here we create the GeoPandas Dataset:
dfp_original = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None"  to "no_name
dfp_original.name[dfp_original.name.isnull() == True] = 'no_name'
dfp_original = dfp_original.to_crs({'init': 'epsg:3395'})



In [63]:

    
dfp=pd.read_csv('../data/labled_bern.csv',index_col='Progressive_index',encoding='utf-8')
df_gp = gpd.GeoDataFrame.from_file('../data/cleaned_dataset_bern.shp')
df_gp_blue=df_gp.copy()
df_gp_blue=df_gp_blue.set_index('name')
df_gp_red=df_gp.copy()
df_gp_red=df_gp_red.set_index('name')



In [64]:

    
for row in tqdm(dfp.iterrows()):
    name=row[1]['name']
    if row[1]['lable']==1:
        df_gp_blue.drop(name,inplace=True)
    else:
        df_gp_red.drop(name,inplace=True)



In [65]:

    
fig=plt.figure()
fig, axes = plt.subplots(1, 2)

ax_blue=df_gp_blue.plot(ax=axes[0],color='blue',linewidth=1.0)
ax_red=df_gp_red.plot(ax=axes[1],color='red',linewidth=1.0)









    





<matplotlib.figure.Figure at 0x1d33cde64e0>

Finnaly, here is the resulting seperation of Bern:

6 | Algorithm performance

To test the efficacy of the algorithm we can let it run on a dataframe obtained by merging two cities: a very new one (New York) and a rural locality (Lama Mocogno, a municipality in the Province of Modena, Italy). The data is ordered: first 4283 raws refer to New York and all the others to Lama Mocogno.



In [67]:

    
df=pd.read_csv('../data/features_two_cities.csv',index_col='index',encoding='utf-8')
dfp=pd.read_csv('../data/features_twocities_lamamocogne_newyork_new.csv',encoding='utf-8')

Let's quickly redo the previous steps for this dataset



In [68]:

    
#we create the distances
dist =spatial.distance.squareform(spatial.distance.pdist(df.values,'braycurtis')) 
#build the weight matrix
W=np.exp(-(dist**2)/(dist.mean()**2))
np.fill_diagonal(W, 0)
#sparsify the matrix with KNN
NEIGHBOURS = 1205 
indeces = np.argsort(W, axis=1)
indeces = indeces[:,:len(W)-NEIGHBOURS]
indeces = np.arange(W.shape[0])[:,None], indeces
W[indeces] = 0
#symmetrize it
W_sparse=np.maximum(W,np.transpose(W))

Let's visualize W



In [69]:

    
plt.spy(W_sparse)









    Out[69]:





<matplotlib.image.AxesImage at 0x1d33cf2c400>

Note that there are shorter distances between streets from the same city.

Compute the graph Laplacian



In [70]:

    
G = graphs.Graph(W_sparse)
G.compute_laplacian('normalized')
laplacian = G.L

Compute eigenvectors and eigenvalues



In [71]:

    
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian,k=6,which='SM')
print(eigenvalues)
print(eigenvalues[1])









    



[  6.91030522e-17   1.01336189e-01   3.77339490e-01   4.46523210e-01
   5.04400081e-01   5.73444806e-01]
0.101336188806

As before, we use the value of the second eigenvector of the Laplacian matrix as the x coordinate of a node and the value of the third eigenvector as the y coordinate. The color is indicated by the sign of the Fiedler vector.



In [72]:

    
labels = (eigenvectors[:,1]<0).astype(int)
x = eigenvectors[:,1]
y = eigenvectors[:,2]

plt.scatter(x, y, c=labels, cmap='RdBu', alpha=0.5);

Let's retrieve the ground truth city categorization. We've built a dataframe containing all the names of the streets of the two cities and the corresponding lable (1 for New York and -1 for Lama)



In [73]:

    
pd_label = pd.read_csv('../data/labled_features_twocities_lamamocogne_newyork_new.csv', encoding='utf-8')



In [74]:

    
pd_label.head(2)









    Out[74]:







  
    
      
      Progressive_index
      name
      lable
    
  
  
    
      0
      0
      0
      1.0
    
    
      1
      1
      1
      1.0



In [75]:

    
pd_label.tail(2)









    Out[75]:







  
    
      
      Progressive_index
      name
      lable
    
  
  
    
      6462
      6462
      6462
      -1.0
    
    
      6463
      6463
      6463
      -1.0



In [76]:

    
cities = preprocessing.LabelEncoder().fit_transform(pd_label['lable'])

Now, let's plot it, but with the color indicated by the ground truth labelling



In [77]:

    
labels = (eigenvectors[:,1]<0).astype(int)
x = eigenvectors[:,1]
y = eigenvectors[:,2]

plt.scatter(x, y, c=cities, cmap='RdBu', alpha=0.5);

Let's compute how many streets were wrongly classified by the Fiedler vector according to the ground truth



In [78]:

    
err = np.count_nonzero(cities - labels)
err = err if err < len(labels) / 2 else len(labels) - err
print('{} errors ({:.2%})'.format(err, err/len(labels)))









    



837 errors (12.95%)

Finally, we can plot the two cities on the map as before



In [79]:

    
signal=labels*2-1



In [80]:

    
labled={}
for index,row in dfp.iterrows():
    labled[str(row['name'])]=signal[index]

name={'name':list(labled.keys())}
pd_label=pd.DataFrame(name)
pd_label['lable']=labled.values()
pd_label.index.name='Progressive_index'
pd_lable_new=pd_label[0:4282].copy();
pd_label_lama=pd_label[4283:len(pd_label)].copy();
pd_label.to_csv('../data/labled_new_lama_big.csv', encoding='utf-8')
pd_lable_new.to_csv('../data/labled_new_in_set_lama_big.csv', encoding='utf-8')
pd_label_lama.to_csv('../data/labled_lama_big_in_set_new.csv', encoding='utf-8')
pd_label[0:10]









    Out[80]:







  
    
      
      name
      lable
    
    
      Progressive_index
      
      
    
  
  
    
      0
      10th Avenue_0
      1
    
    
      1
      10th Avenue_1
      1
    
    
      2
      10th Street_0
      1
    
    
      3
      10th Street_-1
      1
    
    
      4
      10th Street_1
      1
    
    
      5
      10th Street_2
      1
    
    
      6
      11th Avenue_0
      1
    
    
      7
      11th Avenue_-1
      1
    
    
      8
      11th Avenue_4
      1
    
    
      9
      11th Avenue_1
      1

Plot



In [81]:

    
shapefile_dir = '../data/Lama_Mocogne_area'
shapefile_name ='ex_UWxd3ovMDcQwonMEUFyC4SE9R84bu_osm_roads.shp'
shapefile_roads = os.path.join(shapefile_dir, shapefile_name)

# Here we create the GeoPandas Dataset:
dfp_original = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None"  to "no_name
dfp_original.name[dfp_original.name.isnull() == True] = 'no_name'
dfp_original = dfp_original.to_crs({'init': 'epsg:3395'})

dfp=pd.read_csv('../data/labled_lama_big_in_set_new.csv',index_col='Progressive_index',encoding='utf-8')
df_gp = gpd.GeoDataFrame.from_file('../data/cleaned_dataset_lamamocogne_new.shp')
df_gp_blue=df_gp.copy()
df_gp_blue=df_gp_blue.set_index('name')
df_gp_red=df_gp.copy()
df_gp_red=df_gp_red.set_index('name')

for row in tqdm(dfp.iterrows()):
    name=row[1]['name']
    if row[1]['lable']==1:
        df_gp_blue.drop(name,inplace=True)
    else:
        df_gp_red.drop(name,inplace=True)

Result of Lama Mocogno



In [82]:

    
fig=plt.figure()
fig, axes = plt.subplots(1, 2)

ax_blue=df_gp_blue.plot(ax=axes[0],color='blue',linewidth=1.0)
ax_red=df_gp_red.plot(ax=axes[1],color='red',linewidth=1.0)









    





<matplotlib.figure.Figure at 0x1d33ca85978>



In [83]:

    
mplleaflet.show(fig=ax_blue.figure, crs=dfp_original.crs);



In [84]:

    
#NY
shapefile_dir = '../data/newyork'
shapefile_name ='ex_LrzRkmVP8Si2zoixHivCQeGvUaTaJ_osm_roads.shp'

shapefile_roads = os.path.join(shapefile_dir, shapefile_name)

# Here we create the GeoPandas Dataset:
dfp_original = gpd.GeoDataFrame.from_file(shapefile_roads)
# Renaming all streets tag with "None"  to "no_name
dfp_original.name[dfp_original.name.isnull() == True] = 'no_name'
dfp_original = dfp_original.to_crs({'init': 'epsg:3395'})

dfp=pd.read_csv('../data/labled_new_in_set_lama_big.csv',index_col='Progressive_index',encoding='utf-8')
df_gp = gpd.GeoDataFrame.from_file('../data/cleaned_dataset_newyork.shp')
df_gp_blue=df_gp.copy()
df_gp_blue=df_gp_blue.set_index('name')
df_gp_red=df_gp.copy()
df_gp_red=df_gp_red.set_index('name')

for row in tqdm(dfp.iterrows()):
    name=row[1]['name']
    if row[1]['lable']==1:
        df_gp_blue.drop(name,inplace=True)
    else:
        df_gp_red.drop(name,inplace=True)

Result of New York



In [85]:

    
fig=plt.figure()
fig, axes = plt.subplots(1, 2)

ax_blue=df_gp_blue.plot(ax=axes[0],color='blue',linewidth=1.0)
ax_red=df_gp_red.plot(ax=axes[1],color='red',linewidth=1.0)









    





<matplotlib.figure.Figure at 0x1d33cc05dd8>



In [86]:

    
mplleaflet.show(fig=ax_blue.figure, crs=dfp_original.crs);

7 | Buda + Pest = Budapest

Budapest is the capital of the Hungury. There were two different cities and they merged to become Budapest in 19th Century. Because of curiosity, we have tried our algorithm on the roads of Budapest.

The results shows that on the left side of the river, roads are mostly in blue and on the right side there are more roads with red. This indicates that in fact they were two different cities.

	id	osm_id	type	name	tunnel	ref	z_order	access	service	class	geometry
0	1	2955839.0	rail	no_name	0	None	7	None	None	railway	LINESTRING (834180.7799455991 5913106.66249309...
1	2	3230982.0	residential	Hochfeldstrasse	0	None	3	None	None	highway	LINESTRING (826867.6168769542 5903563.51628549...
2	3	3728938.0	primary	Bernstrasse	1	1;12	-14	None	None	highway	LINESTRING (834386.4750492289 5913916.23293858...

	name
Progressive_index
0	ARA-Strasse_-1
1	Aarbergergasse_-1
2	AarbÃ¼hlstrasse_-1
3	Aarehalde_-1
4	Aarestrasse_-1

	name	length
Progressive_index
176	Bernstrasse_4	11793.307927
679	Freiburgstrasse_0	10984.081925
1299	Lindentalstrasse_-1	8140.113976
2253	Thunstrasse_0	7975.498903
252	Bolligenstrasse_0	7153.922945
181	Bernstrasse_9	6668.876028
2586	no_name_11	6506.178714

	name	length	std_curvature	mean_curvature	max_curvature	min_curvature	skew_curvature	kurtosis_curvature	median_curvature	oneway	flight_distance
Progressive_index
2439	Weissensteinstrasse_0	270.547434	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	270.547208
2518	Worbstrasse_-1	19.625991	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	19.625991
1840	RÃ¼tiweg_0	12.361306	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	12.361306
784	Grossacher_-1	24.989844	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	24.989844
1738	Raintalweg_-1	45.083040	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	45.083040

	name	length	std_curvature	mean_curvature	max_curvature	min_curvature	skew_curvature	kurtosis_curvature	median_curvature	oneway	flight_distance
Progressive_index
1741	Rainweg_-1	127.278679	0.000000	1.269074	1.269074	1.269074	0.0	-3.0	1.269074	0	79.385880
1549	Nelkenweg_-1	176.353850	0.000000	1.389613	1.389613	1.389613	0.0	-3.0	1.389613	0	113.139175
2007	Seerosenstrasse_-1	247.667410	0.061194	1.536513	1.597707	1.475318	0.0	-2.0	1.536513	0	199.641619
786	Grossackerweg_-1	148.241237	0.000000	1.589940	1.589940	1.589940	0.0	-3.0	1.589940	0	106.151360
108	Bahnhofmatte_-1	91.384568	0.015333	1.667803	1.683136	1.652470	0.0	-2.0	1.667803	0	73.508840

	name	length	std_curvature	mean_curvature	max_curvature	min_curvature	skew_curvature	kurtosis_curvature	median_curvature	oneway	flight_distance
Progressive_index
872	Hallmattstrasse_0	88.479177	0.063947	2.797525	2.877147	2.646033	-1.073955	0.327356	2.822536	0	0.0
2675	no_name_100	70.366378	0.025497	2.831483	2.898344	2.789635	0.617011	1.014966	2.830027	0	0.0
2143	Steinbachstrasse_2	85.699370	0.100609	2.883891	3.141593	2.713228	0.667169	0.106906	2.856430	1	0.0
2578	no_name_4	149.296622	0.138137	2.745294	3.134596	2.531204	0.608711	0.317866	2.733639	0	0.0
981	Hubelmattstrasse_2	52.073222	0.177815	2.849940	3.141593	2.347747	-0.765759	1.332291	2.835432	0	0.0
2575	no_name_1	91.221479	0.019748	2.828464	2.864245	2.805554	0.430145	-1.066868	2.828831	0	0.0
1179	Kreisel SteinhÃ¶lzli_-1	92.154250	0.030541	2.828423	2.887306	2.789778	0.829138	-0.411841	2.824638	0	0.0
2646	no_name_71	202.014067	0.039547	2.930642	3.017461	2.818854	-0.635829	0.413638	2.932554	0	0.0
754	Gilberte-de-Courgenay-Platz_-1	365.147808	0.381430	2.747225	3.141389	1.705908	-1.563873	2.348194	2.757519	0	0.0
2588	no_name_13	98.495765	0.011089	2.827426	2.840002	2.807103	-0.853359	-0.916933	2.832641	0	0.0

	name	length	std_curvature	mean_curvature	max_curvature	min_curvature	skew_curvature	kurtosis_curvature	median_curvature	oneway	flight_distance	percentage_flight_distance
Progressive_index
2655	no_name_80	158.322908	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	158.322908	1.0
1738	Raintalweg_-1	45.083040	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	45.083040	1.0
305	Brunnmattstrasse_4	54.344766	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	54.344766	1.0
2522	Wydenhofstrasse_-1	115.013988	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	115.013988	1.0
2042	Sodweg_-1	177.119457	0.0	3.141593	3.141593	3.141593	0.0	-3.0	3.141593	0	177.119457	1.0

	name	length	std_curvature	mean_curvature	max_curvature	min_curvature	skew_curvature	kurtosis_curvature	median_curvature	oneway	...	median_derivative_x	skew_derivative_x	min_derivative_x	std_derivative_y	mean_derivative_y	max_derivative_y	min_derivative_y	kurtosis_derivative_y	median_derivative_y	skew_derivative_y
Progressive_index
0	ARA-Strasse_-1	380.789011	0.109108	3.015231	3.141580	2.738821	-1.071367	0.418183	3.047087	0	...	11.509408	1.551789	4.436744	7.534558	10.528047	26.556762	1.908593	-0.560718	8.680791	0.762723
1	Aarbergergasse_-1	401.958124	0.025441	3.114920	3.141403	3.056522	-0.733484	-0.543281	3.127427	0	...	-14.443911	-0.774278	-52.494473	4.164734	6.184609	17.727013	1.137742	0.306972	4.421553	0.924600
2	AarbÃ¼hlstrasse_-1	169.771636	0.111169	3.008006	3.139367	2.871630	-0.113310	-1.749004	3.028933	0	...	-4.114835	-0.720305	-11.616711	8.599054	-19.360872	-9.058012	-38.942856	0.426518	-17.005812	-1.016285
3	Aarehalde_-1	198.987378	0.314167	2.893636	3.136984	2.450034	-0.697142	-1.500000	3.093891	1	...	27.530207	0.126653	-0.289251	18.178381	35.591284	64.651918	14.806284	-1.157604	38.679548	0.335735
4	Aarestrasse_-1	1600.259622	0.114674	3.062512	3.141515	2.534391	-2.928118	9.684166	3.097732	1	...	-12.153226	-0.751334	-43.210434	26.111982	31.960392	95.933273	2.699986	-0.129362	27.555015	0.970110

	name	lable
Progressive_index
0	10th Avenue_0	1
1	10th Avenue_1	1
2	10th Street_0	1
3	10th Street_-1	1
4	10th Street_1	1
5	10th Street_2	1
6	11th Avenue_0	1
7	11th Avenue_-1	1
8	11th Avenue_4	1
9	11th Avenue_1	1