2017 NCTU Data Maning HW1

0416037 李家安

Info

Group 3
Dataset: New York Citi Bike Trip Histories, first data

Task

What rules should be discover?
What rules should be discover?

Need

what is a transaction
what rules should be discovered(and discretization method)
what algorithm you use(Apriori or FP-growth or something else) a. algorithm code from github is allowed(cite the repository)
top 3 rules
what did you learned, or comparison between different methods you use

Data Preprocessing

由於 hw0 再做 data preprocess 的部份有點不足，因此我在改了一點
另外我希望能將 data 塞入 mysql 以方便一些 sql 原有的 query
因此只會做處理的部份，比較不會印出來

Load Data



In [1]:

    
import pandas as pd
import datetime

df = pd.read_csv('201707-citibike-tripdata.csv')
df.columns = ['tripduration','starttime','stoptime',\
              'start_station_id','start_station_name','start_station_latitude','start_station_longitude',\
              'end_station_id','end_station_name','end_station_latitude','end_station_longitude',\
              'bikeid','usertype','birth_year','gender']









    



/usr/local/lib/python3.5/dist-packages/pandas/core/computation/__init__.py:18: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.4.6

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

Connect sql



In [2]:

    
from sqlalchemy import create_engine

engine = create_engine('mysql://calee0219:110010@localhost/citybike')

Remove NAN

觀察發現只有 birth 有出現 NAN，用 mean 取代沒有紀錄的 NaN



In [3]:

    
print(df.isnull().sum().sum())
print(pd.isnull(df).sum() > 0)
birth_mean = df['birth_year'].mean()
df = df.fillna(birth_mean)









    



228596
tripduration               False
starttime                  False
stoptime                   False
start_station_id           False
start_station_name         False
start_station_latitude     False
start_station_longitude    False
end_station_id             False
end_station_name           False
end_station_latitude       False
end_station_longitude      False
bikeid                     False
usertype                   False
birth_year                  True
gender                     False
dtype: bool

避免出現 start time >= end time



In [4]:

    
df = df.drop(df.index[df['starttime'] >= df['stoptime']])
df = df.reset_index(drop=True)

刪除 speed 過快不合理的狀況

上網查好像腳踏車時速大概可以到 40 上下，所以我把以上的部分刪掉



In [5]:

    
import datetime
import operator
from pyproj import Geod

wgs84_geod = Geod(ellps='WGS84') 

start = [datetime.datetime.strptime(dt, '%Y-%m-%d %H:%M:%S') for dt in df['starttime'].tolist()]
end = [datetime.datetime.strptime(dt, '%Y-%m-%d %H:%M:%S') for dt in df['stoptime'].tolist()]

def Distance(lat1,lon1,lat2,lon2):
    az12,az21,dist = wgs84_geod.inv(lon1,lat1,lon2,lat2)
    return dist

dist = Distance(df['start_station_latitude'].tolist(), df['start_station_longitude'].tolist(), \
                df['end_station_latitude'].tolist(), df['end_station_longitude'].tolist())

speed = list(map(operator.truediv, [x/1000 for x in dist], [ time.seconds/3600 for time in list(map(operator.sub, end, start))]))
zp = list(zip(speed,list(range(df.shape[0]))))
zp.sort()
zp.reverse()
for i in zp[:6]:
    print(i)
    
df = df.drop(df.index[[716622,320615,1393557,1260345]])
df.reset_index(drop=True, inplace=True)









    



(174.5232169421344, 716622)
(145.63586768356728, 320615)
(85.11378347330736, 1393557)
(55.89713944680872, 1260345)
(42.7109874670386, 375393)
(36.911770167382116, 998509)

將處理完的原始 table 先存一份

過程中先轉型



In [6]:

    
from sqlalchemy import types

try:
    df = pd.read_sql_table(table_name='origin', con=engine)
except:
    df['tripduration'].astype(int)
    df['starttime'] = pd.to_datetime(df['starttime'])
    df['stoptime'] = pd.to_datetime(df['stoptime'])
    df['start_station_id'].astype(int)
    df['start_station_name'].astype(str)
    df['start_station_latitude'].astype(float)
    df['start_station_longitude'].astype(float)
    df['end_station_id'].astype(int)
    df['end_station_name'].astype(str)
    df['end_station_latitude'].astype(float)
    df['end_station_longitude'].astype(float)
    df['bikeid'].astype(int)
    df['usertype'].astype(str)
    df['birth_year'].astype(int)
    df['gender'].astype(int)
    df.to_sql(name='origin', con=engine, if_exists='replace',index=False,\
              dtype={'starttime': types.DATETIME, 'stoptime': types.DATETIME, 'birth_year': types.BIGINT})

將原始資料切出只有 station 的資料

只保留 id, name, lat, lng，並且刪掉重複的



In [7]:

    
try:
    station = pd.read_sql_table(table_name='station', con=engine)
except:
    station = pd.DataFrame(df[['start_station_id', 'start_station_name', 'start_station_latitude', 'start_station_longitude']])
    station.columns = ['id', 'name', 'latitude', 'longitude']
    tmp = pd.DataFrame(df[['end_station_id', 'end_station_name', 'end_station_latitude', 'end_station_longitude']])
    tmp.columns = ['id', 'name', 'latitude', 'longitude']
    station = pd.concat([station, tmp])
    station = station.sort_values('id').drop_duplicates().reset_index(drop=True)
    station.to_sql(name='station', con=engine, if_exists='fail',index=False)

確認是否有不當的位置

看起來沒有



In [8]:

    
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np

my_map = Basemap(projection='merc', lat_0=40.7, lon_0=-73.98,
    resolution = 'h', area_thresh = 0.01,
    llcrnrlon=-74.1, llcrnrlat=40.64,
    urcrnrlon=-73.9, urcrnrlat=40.85)

lon = station['longitude'].tolist()
lat = station['latitude'].tolist()
labels = station['id'].tolist()

fig = plt.figure(frameon=False)
fig.set_size_inches(18,12)

my_map.drawcoastlines()
my_map.drawcountries()
my_map.fillcontinents(color='coral')
my_map.drawmapboundary()

x,y = my_map(lon, lat)
my_map.plot(x, y, 'bo', markersize=2)

plt.show()









    



/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)

切出 path table

只留下 tripduration, starttime, stoptime, start station id, end station id, bikeid, usertype, birth year, gender



In [9]:

    
from sqlalchemy import types

try:
    path = pd.read_sql_table(table_name='path', con=engine)
except:
    path = df.drop(['start_station_name', 'start_station_latitude', 'start_station_longitude', 'end_station_name', 'end_station_latitude', 'end_station_longitude'], axis=1)
    path.to_csv('path.csv', index=False)
    path.to_sql(name='path', con=engine, if_exists='fail',index=False,\
                dtype={'starttime': types.DATETIME, 'stoptime': types.DATETIME, 'birth_year': types.BIGINT})

切出 in / out flow table

為了方便 query，留下 id, time, in-flow, out-flow



In [10]:

    
import bisect
import datetime

try:
    in_out = pd.read_sql_table(table_name='in_out', con=engine)
except:
    begin = datetime.datetime(2017, 7, 1, 0, 0, 0)
    end = datetime.datetime(2017, 8, 1, 23, 30, 0)
    date_list = [ end - datetime.timedelta(seconds=x*60*30) for x in range(0, 1536)][::-1]

    table = {}
    for idx, row in path.iterrows():
        start_date = row['starttime']
        start = date_list[bisect.bisect_right(date_list, start_date)]
        end_date = row['stoptime']
        end = date_list[bisect.bisect_right(date_list, end_date)]
        start_tmp = (row['start_station_id'], start)
        if table.get(start_tmp) == None:
            table[start_tmp] = (1,0)
        else:
            tmp = list(table[start_tmp])
            tmp[0] += 1
            table[start_tmp] = tuple(tmp)
        stop_tmp = (row['end_station_id'], start)
        if table.get(stop_tmp) == None:
            table[stop_tmp] = (0,1)
        else:
            tmp = list(table[stop_tmp])
            tmp[1] += 1
            table[stop_tmp] = tuple(tmp)
    tmp_in_out = []
    for key in table.keys():
        tmp_in_out.append([key[0], key[1], table[key][0], table[key][1]])
    in_out = pd.DataFrame(tmp_in_out, columns=['id', 'time', 'in', 'out'])
    in_out.to_sql(name='in_out', con=engine, if_exists='replace',index=False,\
                  dtype={'time': types.DATETIME})

Task

Transactions 1

助教給的 transactions
in / out flow when station_id=519
如助教給的作法，我把 in / out binning 成每 10 個 count 為一組



In [11]:

    
import pandas as pd
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import apriori

transactions = []

for idx, row in in_out.iterrows():
    if row['id'] == 519:
        transactions.append([('in',row['in']//10), ('out',row['out']//10)])

min_sup = 0.01

oht = OnehotTransactions()
oht_ary = oht.fit(transactions).transform(transactions)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
frequent_itemsets = apriori(df, min_support=min_sup, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
fqs = frequent_itemsets[ (frequent_itemsets['length'] >= 2) &
                   (frequent_itemsets['support'] >= min_sup) ].sort_values(['support'], ascending=False)
print(fqs)
for idx, row in fqs.iterrows():
    cof = row['itemsets'][0]









    



     support             itemsets  length
13  0.574257  [(in, 0), (out, 0)]       2
14  0.079970  [(in, 0), (out, 1)]       2
16  0.079970  [(in, 1), (out, 1)]       2
15  0.076161  [(in, 1), (out, 0)]       2
19  0.023610  [(in, 2), (out, 2)]       2
18  0.022087  [(in, 2), (out, 1)]       2
21  0.020564  [(in, 3), (out, 3)]       2
17  0.019802  [(in, 1), (out, 2)]       2
20  0.013709  [(in, 3), (out, 2)]       2
22  0.011424  [(in, 4), (out, 3)]       2



In [12]:

    
import Orange
from orangecontrib.associate.fpgrowth import *

transactions = []

for idx, row in in_out.iterrows():
    if row['id'] == 519:
        transactions.append([('in',row['in']//10), ('out',row['out']//10)])

#transactions = np.array(transactions)        

import pyfpgrowth
patterns = pyfpgrowth.find_frequent_patterns(transactions, 10)
print(patterns)
rules = pyfpgrowth.generate_association_rules(patterns, 0.3)
for key in rules.keys():
    print(key, rules[key])









    



{(('out', 3),): 58, (('out', 1),): 240, (('out', 0),): 857, (('in', 0),): 863, (('in', 0), ('out', 0)): 754, (('in', 4),): 32, (('in', 1), ('out', 2)): 26, (('out', 4),): 31, (('in', 4), ('out', 3)): 15, (('in', 1), ('out', 0)): 100, (('in', 7),): 13, (('out', 5),): 20, (('in', 3), ('out', 2)): 18, (('in', 0), ('out', 1)): 105, (('in', 3), ('out', 3)): 27, (('in', 4), ('out', 4)): 11, (('in', 6),): 17, (('out', 6),): 17, (('out', 2),): 82, (('in', 2), ('out', 1)): 29, (('in', 1), ('out', 1)): 105, (('in', 2), ('out', 2)): 31, (('in', 5),): 12, (('in', 2), ('out', 3)): 12}
(('in', 4),) ((('out', 4),), 0.34375)
(('out', 3),) ((('in', 3),), 0.46551724137931033)
(('out', 4),) ((('in', 4),), 0.3548387096774194)
(('out', 1),) ((('in', 1),), 0.4375)
(('out', 0),) ((('in', 0),), 0.8798133022170361)
(('out', 2),) ((('in', 2),), 0.3780487804878049)
(('in', 0),) ((('out', 0),), 0.8736964078794901)

我的 transaction 是用:
- 如助教給的策資，是用 station_id=519 的 in-flow, out-flow
應該要找到的 rule / discretization method
- 如助教建議的，對於 in / out flow，做了 10 個數量級切一份的 binning
算法
- 如上面兩個，一個是用 apriori、一個是用 fpgroth 分別找出 frequency itemset
top 3 rules
1. ('out', 0) -> ('in', 0)
2. ('in', 0) -> ('out', 0)
3. ('out', 3) -> ('in', 3)
What do I learned
- 用 fpgroth 的有列出了 association rules，但是都可以看出基本上還是以 in-flow = 0, out-flow = 0 為大宗
- 因此我認為 id = 512 的 itemset 基本上沒有什麼有用的 association rules
- in / out flow 似乎看起來在相同的時間區間中，數量級會相同

Transaction 2

我希望能找到距離中心點位置、時間與 in / out flow 的關係

為避免找 frequency itemset 太慢，之後都用 fpgrowth 的方法找



In [23]:

    
query = "SELECT in_out.id, in_out.time, in_out.in, in_out.out, T.latitude, T.longitude FROM in_out left join ( SELECT id, latitude, longitude from station )T ON T.id = in_out.id ORDER BY id"
table = pd.read_sql_query(query, engine)


lat_mean = station['latitude'].mean()
lon_mean = station['longitude'].mean()
#print(lat_mean, lon_mean)

def Distance(lat1,lon1,lat2,lon2):
    az12,az21,dist = wgs84_geod.inv(lon1,lat1,lon2,lat2)
    return dist

from orangecontrib.associate.fpgrowth import *

rem = {}
for idx, row in station.iterrows():
    rem[row['id']] = Distance(lat_mean, lon_mean, row['latitude'], row['longitude'])//1000 # 以公里為單位

from fp_growth import *

transactions = []

for idx, row in table.iterrows():
    rin = row['in'] // 10
    rout = row['out'] // 10
    if rin == 0 or rout == 0: continue
    transactions.append([(rem[row['id']], row['time'].time().isoformat()), ('in',rin), ('out',rout)])

result = {}
for itemset, support in find_frequent_itemsets(transactions, .02*len(transactions), True):
    result[tuple(itemset)] = support/len(transactions)
    
    
def subs(l):
    assert type(l) is list
    if len(l) == 1:
        return [l]
    x = subs(l[1:])
    return x + [[l[0]] + y for y in x]

    
def assRule(freq, min_conf = 0.6):
    assert type(freq) is dict
    result = []
    for item, sup in freq.items():
        for subitem in subs(list(item)):
            sb = [x for x in item if x not in subitem]
            if sb == [] or subitem == []: continue
            if len(subitem) == 1 and (subitem[0][0] == 'in' or subitem[0][0] == 'out'):
                continue
            conf = sup/freq[tuple(subitem)]
            if conf >= min_conf:
                result.append({'from':subitem, 'to':sb, 'sup':sup, 'conf':conf})
    return result
                
rules = assRule(result, 0.8)
#print(rules)
for ru in rules:
    print(ru)









    



{'sup': 0.027171969967822666, 'conf': 0.853932584269663, 'from': [(1.0, '18:30:00')], 'to': [('out', 1)]}
{'sup': 0.028101537361458706, 'conf': 0.8020408163265306, 'from': [(2.0, '18:30:00')], 'to': [('out', 1)]}
{'sup': 0.02373972112978191, 'conf': 0.8623376623376623, 'from': [(2.0, '19:00:00')], 'to': [('out', 1)]}
{'sup': 0.02209510189488738, 'conf': 0.9035087719298245, 'from': [(1.0, '19:00:00')], 'to': [('out', 1)]}
{'sup': 0.022452627815516626, 'conf': 0.8698060941828255, 'from': [(1.0, '18:00:00')], 'to': [('out', 1)]}
{'sup': 0.021737575974258135, 'conf': 0.9101796407185628, 'from': [('in', 1), (1.0, '18:30:00')], 'to': [('out', 1)]}
{'sup': 0.021737575974258135, 'conf': 0.8, 'from': [('out', 1), (1.0, '18:30:00')], 'to': [('in', 1)]}

我的 transaction 是用:
- 與中心點的距離 (km)
- 時間 30 (min) binning
- in-flow
- out-flow
應該要找到的 rule / discretization method
- 我希望找到與中心點距離、時間、in / out flow 關係，例如
  - 我把位置跟中心距離與時間放在一起看，希望能得到兩個一起出來的結果
  - 可能與中心點一段距離中間，可以找到適當的 in / out flow 關係
- discretization method
  - 距離除到公里整數
  - 時間每 30 分鐘一切，一天 24 hr 分成 48 段
  - in / out flow: 每 10 個數量級一個 binning
算法
- 我用的算法是 fp-groth
top 3 rules 以 confidence 來看了話，應該是：
1. ('in', 1), (1, 18:30) -> ('out', 1)
2. (1, 19:00) -> ('out', 1)
3. (1, 18:00) -> ('out', 1)
What do I learned
- 基本上可以看出，與中心點 1 km - 2 km 的距離的站台，在晚上 18:00 - 19:00 時的 in / out flow 大概會在 10 - 20 的數量級之間

Transaction 3

我想試著找出 start time / end time / in-flow / out-flow / speed / distance 之間是否有神秘的關係



In [52]:

    
query = '''
SELECT in_out.id, in_out.time, in_out.in, in_out.out, T1.st_time, T2.en_time
FROM in_out
  LEFT JOIN (
    SELECT start_station_id AS st_id, SEC_TO_TIME(AVG(TIME_TO_SEC(DATE_FORMAT(starttime, "%%H:%%i:%%s")))) AS st_time
    FROM path
    GROUP BY start_station_id
  )T1 ON in_out.id = T1.st_id
  LEFT JOIN (
    SELECT end_station_id AS en_id, SEC_TO_TIME(AVG(TIME_TO_SEC(DATE_FORMAT(stoptime, "%%H:%%i:%%s")))) AS en_time
    FROM path
    GROUP BY end_station_id
  )T2 ON in_out.id = T2.en_id
ORDER BY in_out.id;
'''

table = pd.read_sql_query(query, engine)

transactions = []

for idx, row in table.iterrows():
    rin = row['in'] // 10
    rout = row['out'] // 10
    if rin == 0 or rout == 0: continue
    st = (datetime.datetime.min+row['st_time']).time().replace(second=0, microsecond=0)
    st = st.replace(minute=st.minute//10 * 10).isoformat()
    en = (datetime.datetime.min+row['en_time']).time().replace(second=0, microsecond=0)
    en = en.replace(minute=en.minute//10 * 10).isoformat()
    transactions.append([('stime', st), ('etime', en), ('in',rin), ('out',rout)])

result = {}
for itemset, support in find_frequent_itemsets(transactions, .04*len(transactions), True):
    result[tuple(itemset)] = support/len(transactions)
    
    
def subs(l):
    assert type(l) is list
    if len(l) == 1:
        return [l]
    x = subs(l[1:])
    return x + [[l[0]] + y for y in x]

    
def assRule(freq, min_conf = 0.6):
    assert type(freq) is dict
    result = []
    for item, sup in freq.items():
        for subitem in subs(list(item)):
            sb = [x for x in item if x not in subitem]
            if sb == [] or subitem == []: continue
            if len(subitem) == 1 and (subitem[0][0] == 'in' or subitem[0][0] == 'out'):
                continue
            conf = sup/freq[tuple(subitem)]
            if conf >= min_conf:
                result.append({'from':subitem, 'to':sb, 'sup':sup, 'conf':conf})
    return result
                
rules = assRule(result, 0.9)
#print(rules)
for ru in rules:
    print(ru)









    



{'sup': 0.053485877726135146, 'conf': 0.9385194479297365, 'from': [('in', 1), ('stime', '15:20:00')], 'to': [('out', 1)]}
{'sup': 0.04397568823739721, 'conf': 0.9165424739195231, 'from': [('in', 1), ('etime', '15:10:00')], 'to': [('out', 1)]}
{'sup': 0.06292456203074723, 'conf': 0.9185803757828809, 'from': [('stime', '15:20:00')], 'to': [('out', 1)]}
{'sup': 0.05355738291026099, 'conf': 0.9002403846153846, 'from': [('in', 1), ('etime', '14:10:00')], 'to': [('out', 1)]}
{'sup': 0.11312120128709331, 'conf': 0.9086731763354394, 'from': [('in', 1), ('stime', '15:10:00')], 'to': [('out', 1)]}
{'sup': 0.043689667500893814, 'conf': 0.9038461538461539, 'from': [('out', 1), ('stime', '14:10:00')], 'to': [('in', 1)]}
{'sup': 0.05677511619592421, 'conf': 0.9022727272727273, 'from': [('in', 1), ('etime', '14:20:00')], 'to': [('out', 1)]}

我的 transaction 是用:
- 每個站點各個的平均借車時間
- 每個站點各個的平均還車時間
- in-flow
- out-flow
應該要找到的 rule / discretization method
- 我希望找到借、還車時間、in / out flow 關係，例如
  - 借車時間越接近晚上的，out-flow 越高
- discretization method
  - 平均借車時間 binning 到每十分鐘一個分隔
  - 平均還車時間 binning 到每十分鐘一個分隔
  - in / out flow: 每 10 個數量級一個 binning
算法
- 我用的算法是 fp-groth
top 3 rules 以 confidence 來看了話，應該是：
1. ('in', 1), ('stime', 15:20) -> ('out', 1)
2. ('in', 1), ('etime', '15:10:00') -> ('out', 1)
3. ('stime', '15:20:00') -> ('out', 1)
What do I learned
- 也許可以推測 in/out flow 數量級是 10 - 20 之間的 station，在 15:00 - 15:30 左右的時間，應該要多放一點車提供借出