2017 NCTU Data Maning HW2

0416037 李家安

Info

Schema

  • Station id, name, lat, lng
  • Station's flow id, time, in-flow, out-flow

Task

  • Using listed algorithms as following:
    • Spatial clustering
      • stations’ geo-information
      • Kmeans (k=?)
        • elbow method
      • DBscan (eps=?,min_sample=?,metric=?)
    • Temporal clustering
      • in-flow and out-flow data in the first week
      • Agglomerative Clustering (affinity=?)
      • PCA (n_components=?) => Agglomerative Clustering(affinity=?)
    • Other

Need

Make some observation, compare different method and parameters, explain the result and see if the output meet your expect.

Data Query

Connect SQL


In [1]:
from sqlalchemy import create_engine

engine = create_engine('mysql://calee:110010@localhost/citybike')

Load Station info


In [2]:
import pandas as pd

station = pd.read_sql_table(table_name='station', con=engine)
station


/usr/local/lib/python3.5/dist-packages/pandas/core/computation/__init__.py:18: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.4.6

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)
Out[2]:
id name latitude longitude
0 72 W 52 St & 11 Ave 40.767272 -73.993929
1 79 Franklin St & W Broadway 40.719116 -74.006667
2 82 St James Pl & Pearl St 40.711174 -74.000165
3 83 Atlantic Ave & Fort Greene Pl 40.683826 -73.976323
4 116 W 17 St & 8 Ave 40.741776 -74.001497
5 119 Park Ave & St Edwards St 40.696089 -73.978034
6 120 Lexington Ave & Classon Ave 40.686768 -73.959282
7 127 Barrow St & Hudson St 40.731724 -74.006744
8 128 MacDougal St & Prince St 40.727103 -74.002971
9 143 Clinton St & Joralemon St 40.692395 -73.993379
10 144 Nassau St & Navy St 40.698399 -73.980689
11 146 Hudson St & Reade St 40.716250 -74.009106
12 150 E 2 St & Avenue C 40.720874 -73.980858
13 151 Cleveland Pl & Spring St 40.722104 -73.997249
14 152 Warren St & Church St 40.714740 -74.009106
15 153 E 40 St & 5 Ave 40.752062 -73.981632
16 157 Henry St & Atlantic Ave 40.690893 -73.996123
17 161 LaGuardia Pl & W 3 St 40.729170 -73.998102
18 164 E 47 St & 2 Ave 40.753231 -73.970325
19 167 E 39 St & 3 Ave 40.748901 -73.976049
20 168 W 18 St & 6 Ave 40.739713 -73.994564
21 173 Broadway & W 49 St 40.760683 -73.984527
22 174 E 25 St & 1 Ave 40.738177 -73.977387
23 195 Liberty St & Broadway 40.709056 -74.010434
24 212 W 16 St & The High Line 40.743349 -74.006818
25 216 Columbia Heights & Cranberry St 40.700379 -73.995481
26 217 Old Fulton St 40.702772 -73.993836
27 223 W 13 St & 7 Ave 40.737815 -73.999947
28 228 E 48 St & 3 Ave 40.754601 -73.971879
29 229 Great Jones St 40.727434 -73.993790
... ... ... ... ...
604 3436 Greenwich St & Hubert St 40.721319 -74.010065
605 3437 Riverside Dr & W 91 St 40.793135 -73.977004
606 3438 E 76 St & 3 Ave 40.772249 -73.958421
607 3440 Fulton St & Adams St 40.692418 -73.989495
608 3441 10 Hudson Yards 40.752957 -74.002640
609 3443 W 52 St & 6 Ave 40.761330 -73.979820
610 3445 Riverside Dr & W 89 St 40.791812 -73.978602
611 3447 E 71 St & 1 Ave 40.767034 -73.956227
612 3449 Eckford St & Engert Ave 40.721463 -73.948009
613 3452 Bayard St & Leonard St 40.719156 -73.948854
614 3453 Devoe St & Lorimer St 40.713352 -73.949103
615 3454 Leonard St & Maujer St 40.710369 -73.947060
616 3455 Schermerhorn St & 3 Ave 40.686808 -73.980362
617 3456 Jackson St & Leonard St 40.716380 -73.948213
618 3457 E 58 St & Madison Ave 40.763026 -73.972095
619 3458 W 55 St & 6 Ave 40.763094 -73.978350
620 3459 E 53 St & 3 Ave 40.757632 -73.969306
621 3461 Murray St & Greenwich St 40.714852 -74.011223
622 3462 E 44 St & 2 Ave 40.751184 -73.971387
623 3463 E 16 St & Irving Pl 40.735367 -73.987974
624 3464 W 37 St & Broadway 40.752271 -73.987706
625 3466 W 45 St & 6 Ave 40.756687 -73.982577
626 3468 NYCBS Depot - STY - Garage 4 40.730380 -73.974750
627 3469 India St & West St 40.731814 -73.959950
628 3470 Gowanus Tech Station 40.669802 -73.994905
629 3472 W 15 St & 10 Ave 40.742754 -74.007474
630 3474 6 Ave & Spring St 40.725256 -74.004121
631 3476 Norman Ave & Leonard St 40.725770 -73.950740
632 3477 39 St & 2 Ave - Citi Bike HQ at Industry City 40.655400 -74.010628
633 3478 2 Ave & 36 St - Citi Bike HQ at Industry City 40.657089 -74.008702

634 rows × 4 columns

Load station flow info


In [3]:
import pandas as pd
import datetime
import numpy as np

query = '''
SELECT * FROM in_out
WHERE time between "2017-07-01" AND "2017-07-07 23:59:59"
ORDER BY time;
'''

flow = pd.read_sql_query(query, con=engine)
flow['time'] = flow['time'] - datetime.datetime(2017, 7, 1, 0, 0, 0)
flow['time'] = flow['time'] / np.timedelta64(1, 's')
flow['time'] = flow[['time']].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
flow


Out[3]:
id time in out
0 546 0.0 0 3
1 3131 0.0 1 0
2 212 0.0 3 1
3 398 0.0 0 3
4 3120 0.0 0 2
5 3466 0.0 1 0
6 153 0.0 1 0
7 3171 0.0 1 0
8 416 0.0 2 0
9 502 0.0 2 2
10 243 0.0 0 1
11 3233 0.0 2 0
12 532 0.0 0 1
13 344 0.0 0 1
14 3381 0.0 1 0
15 474 0.0 1 0
16 468 0.0 0 1
17 505 0.0 5 1
18 3242 0.0 2 0
19 3067 0.0 3 1
20 492 0.0 0 1
21 308 0.0 0 1
22 457 0.0 3 1
23 79 0.0 0 1
24 3300 0.0 1 2
25 440 0.0 1 0
26 514 0.0 0 2
27 3090 0.0 4 0
28 3305 0.0 0 1
29 3125 0.0 1 0
... ... ... ... ...
131107 394 1.0 2 3
131108 3461 1.0 2 2
131109 348 1.0 1 1
131110 3443 1.0 0 2
131111 363 1.0 0 2
131112 3325 1.0 1 1
131113 3142 1.0 3 2
131114 432 1.0 3 2
131115 405 1.0 2 1
131116 3238 1.0 0 1
131117 173 1.0 2 1
131118 3319 1.0 0 1
131119 344 1.0 0 1
131120 3359 1.0 1 0
131121 410 1.0 1 1
131122 469 1.0 7 0
131123 244 1.0 0 3
131124 358 1.0 0 1
131125 3162 1.0 0 1
131126 3463 1.0 3 0
131127 453 1.0 3 2
131128 450 1.0 1 4
131129 3119 1.0 1 0
131130 3082 1.0 0 2
131131 308 1.0 1 0
131132 3175 1.0 1 1
131133 3120 1.0 0 1
131134 387 1.0 2 0
131135 265 1.0 2 1
131136 467 1.0 2 1

131137 rows × 4 columns

Clustering

K-Means

kmeans 切出中心點 並 作圖


In [4]:
from mpl_toolkits.basemap import Basemap
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

my_map = Basemap(projection='merc', lat_0=40.7, lon_0=-73.98,
    resolution = 'h', area_thresh = 0.01,
    llcrnrlon=-74.1, llcrnrlat=40.64,
    urcrnrlon=-73.9, urcrnrlat=40.85)

def draw_kmeans(num):
    kmeans = KMeans(n_clusters=num, random_state=0).fit(station[['latitude','longitude']])
    area = kmeans.cluster_centers_
    
    lon = station['longitude'].tolist()
    lat = station['latitude'].tolist()
    cen_lon = [ a[0] for a in area[:,[1]] ]
    cen_lat = [ a[0] for a in area[:,[0]] ]
    labels = station['id'].tolist()

    fig = plt.figure(frameon=False)
    fig.set_size_inches(15,10)

    my_map.drawcoastlines()
    my_map.drawcountries()
    my_map.fillcontinents(color='coral')
    my_map.drawmapboundary()

    x,y = my_map(lon, lat)
    my_map.plot(x, y, 'bo', markersize=1)

    x,y = my_map(cen_lon, cen_lat)
    my_map.plot(x, y, 'g+', markersize=10)

    plt.show()
    
for num in (2, 3, 5, 10, 15, 20, 30):
    draw_kmeans(num)


/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)

由上面的資料顯示,用 kmeans 時,將 n 設為 20 似乎是不錯的分群,可以大略把每一個小群集切割出來

DBScan

DBScan 做出分群 並 作圖


In [6]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
import numpy as np
from sklearn import metrics
import matplotlib.cm as cm

def draw_DBSCAN(method, min_sup, eps):
    dbscan = DBSCAN(eps=eps, min_samples=min_sup, metric=method).fit(station[['latitude','longitude']])
    lb = dbscan.labels_
    lbmx = max(lb)
    llat = []
    llon = []
    for i in range(-1, lbmx+1):
        la = []
        lo = []
        for idx, j in enumerate(lb):
            if i == j:
                lo.append(station['longitude'].tolist()[idx])
                la.append(station['latitude'].tolist()[idx])
        llat.append(la)
        llon.append(lo)
    # draw
    my_map = Basemap(projection='merc', lat_0=40.7, lon_0=-73.98,
        resolution = 'h', area_thresh = 0.01,
        llcrnrlon=-74.1, llcrnrlat=40.64,
        urcrnrlon=-73.9, urcrnrlat=40.85)

    lon = station['longitude'].tolist()
    lat = station['latitude'].tolist()
    labels = station['id'].tolist()

    fig = plt.figure(frameon=False)
    fig.set_size_inches(15,10)

    my_map.drawcoastlines()
    my_map.drawcountries()
    my_map.fillcontinents(color='coral')
    my_map.drawmapboundary()

    colors = cm.rainbow(np.linspace(0, 1, lbmx+2))
    #print(colors)
    #colors = np.array([0,0,0,1]) + colors
    #print(colors)

    for i in range(lbmx+2):
        x,y = my_map(llon[i], llat[i])
        my_map.plot(x, y, color=colors[i], markersize=2, marker='o', linestyle='')

    plt.show()

for method in ['euclidean']:
    for eps in (0.002, 0.004, 0.006):
        for min_sup in (5, 10, 15):
#             if eps == 0.004 and min_sup == 5: continue
            print('method=', method)
            print('eps=', eps)
            print('min_support=', min_sup)
            draw_DBSCAN(method=method, min_sup=min_sup, eps=eps)


method= euclidean
eps= 0.002
min_support= 5
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.002
min_support= 10
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.002
min_support= 15
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.004
min_support= 5
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.004
min_support= 10
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.004
min_support= 15
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.006
min_support= 5
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.006
min_support= 10
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
method= euclidean
eps= 0.006
min_support= 15
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)

由上面的圖可以看出,若 eps 與 min_support 太低,容易出現非常方散,沒有鑑別度的分群,而若 eps 與 min_support 太大,又容易出現把全部點化成同一區,完全沒有分群到。

在這次的資料中,看起來似乎設定 eps = 0.004 min_support = 2,eps = 0.006 min_support = 10,eps = 0.006 min_support = 5,eps = 0.06 min_support = 2,eps = 0.06 min_support = 1 等,都可以切出比較不錯的分群。

Temporal clustering

Agglomerative

將 in-flow, out-flow, timestemp 抽出來,用 Agglomerative 分群


In [7]:
from sklearn.cluster import AgglomerativeClustering

array = []
for method in ('ward', 'complete', 'average'):
    for aff in ('euclidean', 'l1', 'l2', 'manhattan', 'cosine'):
        if method == 'ward' and not aff == 'euclidean':
            continue
        agg = AgglomerativeClustering(n_clusters=2, affinity=aff, linkage=method, memory="agg_cache")
        agg.fit(flow.iloc[1:][:20000])
        array.append(agg.labels_)
array


Out[7]:
[array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 0, 0, ..., 0, 0, 0]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 1, 1, ..., 1, 1, 1]),
 array([0, 0, 0, ..., 0, 0, 0])]

In [8]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np


def randrange(n, vmin, vmax):
    '''
    Helper function to make an array of random numbers having shape (n, )
    with each number distributed Uniform(vmin, vmax).
    '''
    return (vmax - vmin)*np.random.rand(n) + vmin

def draw(data):
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')

    n = 100

    # For each set of style and range settings, plot n random points in the box
    # defined by x in [23, 32], y in [0, 100], z in [zlow, zhigh].
    xs1 = np.array([])
    xs2 = np.array([])
    ys1 = np.array([])
    ys2 = np.array([])
    zs1 = np.array([])
    zs2 = np.array([])
    s1 = []
    s2 = []
    for idx, d in enumerate(data):
        #print(flow['time'][idx])
        if d == 0:
            xs1 = np.append(xs1, flow['time'][idx])
            ys1 = np.append(ys1, flow['in'][idx])
            zs1 = np.append(zs1, flow['out'][idx])
            s1.append(1)
        else:
            xs2 = np.append(xs2, flow['time'][idx])
            ys2 = np.append(ys2, flow['in'][idx])
            zs2 = np.append(zs2, flow['out'][idx])
            s2.append(1)
        #print(xs1)
#     for c, m, zlow, zhigh in [('r', 'o', -50, -25), ('b', '^', -30, -5)]:
#         xs = randrange(n, 23, 32)
#         ys = randrange(n, 0, 100)
#         zs = randrange(n, zlow, zhigh)
    ax.scatter(xs1, ys1, zs1, c='r', marker='o', s=s1)
    ax.scatter(xs2, ys2, zs2, c='b', marker='^', s=s2)

    ax.set_xlabel('time stemp')
    ax.set_ylabel('in-flow')
    ax.set_zlabel('out-flow')

    plt.show()

for tmp in array:
    #print(tmp)
    draw(tmp)


不管用任何算法,感覺看不出明顯的分佈差異

PCA + Agglomerative

試著先用 PCA 降維


In [23]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3).fit(flow.iloc[1:])
print(pca.components_)


[[  9.99999833e-01   2.35510194e-06  -4.10646657e-04  -4.07537190e-04]
 [  5.78176873e-04   7.37263743e-03   7.33414737e-01   6.79741226e-01]
 [ -1.97542706e-05  -9.93559792e-04   6.79764859e-01  -7.33429443e-01]]

In [ ]: