2017 NCTU Data Maning HW1

0416037 李家安

Info

Task

  • What rules should be discover?
  • What rules should be discover?

Need

  1. what is a transaction
  2. what rules should be discovered(and discretization method)
  3. what algorithm you use(Apriori or FP-growth or something else) a. algorithm code from github is allowed(cite the repository)
  4. top 3 rules
  5. what did you learned, or comparison between different methods you use

Data Preprocessing

  • 由於 hw0 再做 data preprocess 的部份有點不足,因此我在改了一點
  • 另外我希望能將 data 塞入 mysql 以方便一些 sql 原有的 query
  • 因此只會做處理的部份,比較不會印出來

Load Data


In [1]:
import pandas as pd
import datetime

df = pd.read_csv('201707-citibike-tripdata.csv')
df.columns = ['tripduration','starttime','stoptime',\
              'start_station_id','start_station_name','start_station_latitude','start_station_longitude',\
              'end_station_id','end_station_name','end_station_latitude','end_station_longitude',\
              'bikeid','usertype','birth_year','gender']


/usr/local/lib/python3.5/dist-packages/pandas/core/computation/__init__.py:18: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.4.6

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

Connect sql


In [2]:
from sqlalchemy import create_engine

engine = create_engine('mysql://calee0219:110010@localhost/citybike')

Remove NAN

觀察發現只有 birth 有出現 NAN,用 mean 取代沒有紀錄的 NaN


In [3]:
print(df.isnull().sum().sum())
print(pd.isnull(df).sum() > 0)
birth_mean = df['birth_year'].mean()
df = df.fillna(birth_mean)


228596
tripduration               False
starttime                  False
stoptime                   False
start_station_id           False
start_station_name         False
start_station_latitude     False
start_station_longitude    False
end_station_id             False
end_station_name           False
end_station_latitude       False
end_station_longitude      False
bikeid                     False
usertype                   False
birth_year                  True
gender                     False
dtype: bool

避免出現 start time >= end time


In [4]:
df = df.drop(df.index[df['starttime'] >= df['stoptime']])
df = df.reset_index(drop=True)

刪除 speed 過快不合理的狀況

上網查好像腳踏車時速大概可以到 40 上下,所以我把以上的部分刪掉


In [5]:
import datetime
import operator
from pyproj import Geod

wgs84_geod = Geod(ellps='WGS84') 

start = [datetime.datetime.strptime(dt, '%Y-%m-%d %H:%M:%S') for dt in df['starttime'].tolist()]
end = [datetime.datetime.strptime(dt, '%Y-%m-%d %H:%M:%S') for dt in df['stoptime'].tolist()]

def Distance(lat1,lon1,lat2,lon2):
    az12,az21,dist = wgs84_geod.inv(lon1,lat1,lon2,lat2)
    return dist

dist = Distance(df['start_station_latitude'].tolist(), df['start_station_longitude'].tolist(), \
                df['end_station_latitude'].tolist(), df['end_station_longitude'].tolist())

speed = list(map(operator.truediv, [x/1000 for x in dist], [ time.seconds/3600 for time in list(map(operator.sub, end, start))]))
zp = list(zip(speed,list(range(df.shape[0]))))
zp.sort()
zp.reverse()
for i in zp[:6]:
    print(i)
    
df = df.drop(df.index[[716622,320615,1393557,1260345]])
df.reset_index(drop=True, inplace=True)


(174.5232169421344, 716622)
(145.63586768356728, 320615)
(85.11378347330736, 1393557)
(55.89713944680872, 1260345)
(42.7109874670386, 375393)
(36.911770167382116, 998509)

將處理完的原始 table 先存一份

過程中先轉型


In [6]:
from sqlalchemy import types

try:
    df = pd.read_sql_table(table_name='origin', con=engine)
except:
    df['tripduration'].astype(int)
    df['starttime'] = pd.to_datetime(df['starttime'])
    df['stoptime'] = pd.to_datetime(df['stoptime'])
    df['start_station_id'].astype(int)
    df['start_station_name'].astype(str)
    df['start_station_latitude'].astype(float)
    df['start_station_longitude'].astype(float)
    df['end_station_id'].astype(int)
    df['end_station_name'].astype(str)
    df['end_station_latitude'].astype(float)
    df['end_station_longitude'].astype(float)
    df['bikeid'].astype(int)
    df['usertype'].astype(str)
    df['birth_year'].astype(int)
    df['gender'].astype(int)
    df.to_sql(name='origin', con=engine, if_exists='replace',index=False,\
              dtype={'starttime': types.DATETIME, 'stoptime': types.DATETIME, 'birth_year': types.BIGINT})

將原始資料切出只有 station 的資料

只保留 id, name, lat, lng,並且刪掉重複的


In [7]:
try:
    station = pd.read_sql_table(table_name='station', con=engine)
except:
    station = pd.DataFrame(df[['start_station_id', 'start_station_name', 'start_station_latitude', 'start_station_longitude']])
    station.columns = ['id', 'name', 'latitude', 'longitude']
    tmp = pd.DataFrame(df[['end_station_id', 'end_station_name', 'end_station_latitude', 'end_station_longitude']])
    tmp.columns = ['id', 'name', 'latitude', 'longitude']
    station = pd.concat([station, tmp])
    station = station.sort_values('id').drop_duplicates().reset_index(drop=True)
    station.to_sql(name='station', con=engine, if_exists='fail',index=False)

確認是否有不當的位置

看起來沒有


In [8]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np

my_map = Basemap(projection='merc', lat_0=40.7, lon_0=-73.98,
    resolution = 'h', area_thresh = 0.01,
    llcrnrlon=-74.1, llcrnrlat=40.64,
    urcrnrlon=-73.9, urcrnrlat=40.85)

lon = station['longitude'].tolist()
lat = station['latitude'].tolist()
labels = station['id'].tolist()

fig = plt.figure(frameon=False)
fig.set_size_inches(18,12)

my_map.drawcoastlines()
my_map.drawcountries()
my_map.fillcontinents(color='coral')
my_map.drawmapboundary()

x,y = my_map(lon, lat)
my_map.plot(x, y, 'bo', markersize=2)

plt.show()


/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1767: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  axisbgc = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1623: MatplotlibDeprecationWarning: The get_axis_bgcolor function was deprecated in version 2.0. Use get_facecolor instead.
  fill_color = ax.get_axis_bgcolor()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:1698: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3260: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/usr/local/lib/python3.5/dist-packages/mpl_toolkits/basemap/__init__.py:3269: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)

切出 path table

只留下 tripduration, starttime, stoptime, start station id, end station id, bikeid, usertype, birth year, gender


In [9]:
from sqlalchemy import types

try:
    path = pd.read_sql_table(table_name='path', con=engine)
except:
    path = df.drop(['start_station_name', 'start_station_latitude', 'start_station_longitude', 'end_station_name', 'end_station_latitude', 'end_station_longitude'], axis=1)
    path.to_csv('path.csv', index=False)
    path.to_sql(name='path', con=engine, if_exists='fail',index=False,\
                dtype={'starttime': types.DATETIME, 'stoptime': types.DATETIME, 'birth_year': types.BIGINT})

切出 in / out flow table

為了方便 query,留下 id, time, in-flow, out-flow


In [10]:
import bisect
import datetime

try:
    in_out = pd.read_sql_table(table_name='in_out', con=engine)
except:
    begin = datetime.datetime(2017, 7, 1, 0, 0, 0)
    end = datetime.datetime(2017, 8, 1, 23, 30, 0)
    date_list = [ end - datetime.timedelta(seconds=x*60*30) for x in range(0, 1536)][::-1]

    table = {}
    for idx, row in path.iterrows():
        start_date = row['starttime']
        start = date_list[bisect.bisect_right(date_list, start_date)]
        end_date = row['stoptime']
        end = date_list[bisect.bisect_right(date_list, end_date)]
        start_tmp = (row['start_station_id'], start)
        if table.get(start_tmp) == None:
            table[start_tmp] = (1,0)
        else:
            tmp = list(table[start_tmp])
            tmp[0] += 1
            table[start_tmp] = tuple(tmp)
        stop_tmp = (row['end_station_id'], start)
        if table.get(stop_tmp) == None:
            table[stop_tmp] = (0,1)
        else:
            tmp = list(table[stop_tmp])
            tmp[1] += 1
            table[stop_tmp] = tuple(tmp)
    tmp_in_out = []
    for key in table.keys():
        tmp_in_out.append([key[0], key[1], table[key][0], table[key][1]])
    in_out = pd.DataFrame(tmp_in_out, columns=['id', 'time', 'in', 'out'])
    in_out.to_sql(name='in_out', con=engine, if_exists='replace',index=False,\
                  dtype={'time': types.DATETIME})

Task

Transactions 1

  • 助教給的 transactions
  • in / out flow when station_id=519
  • 如助教給的作法,我把 in / out binning 成每 10 個 count 為一組

In [11]:
import pandas as pd
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import apriori

transactions = []

for idx, row in in_out.iterrows():
    if row['id'] == 519:
        transactions.append([('in',row['in']//10), ('out',row['out']//10)])

min_sup = 0.01

oht = OnehotTransactions()
oht_ary = oht.fit(transactions).transform(transactions)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
frequent_itemsets = apriori(df, min_support=min_sup, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
fqs = frequent_itemsets[ (frequent_itemsets['length'] >= 2) &
                   (frequent_itemsets['support'] >= min_sup) ].sort_values(['support'], ascending=False)
print(fqs)
for idx, row in fqs.iterrows():
    cof = row['itemsets'][0]


     support             itemsets  length
13  0.574257  [(in, 0), (out, 0)]       2
14  0.079970  [(in, 0), (out, 1)]       2
16  0.079970  [(in, 1), (out, 1)]       2
15  0.076161  [(in, 1), (out, 0)]       2
19  0.023610  [(in, 2), (out, 2)]       2
18  0.022087  [(in, 2), (out, 1)]       2
21  0.020564  [(in, 3), (out, 3)]       2
17  0.019802  [(in, 1), (out, 2)]       2
20  0.013709  [(in, 3), (out, 2)]       2
22  0.011424  [(in, 4), (out, 3)]       2

In [12]:
import Orange
from orangecontrib.associate.fpgrowth import *

transactions = []

for idx, row in in_out.iterrows():
    if row['id'] == 519:
        transactions.append([('in',row['in']//10), ('out',row['out']//10)])

#transactions = np.array(transactions)        

import pyfpgrowth
patterns = pyfpgrowth.find_frequent_patterns(transactions, 10)
print(patterns)
rules = pyfpgrowth.generate_association_rules(patterns, 0.3)
for key in rules.keys():
    print(key, rules[key])


{(('out', 3),): 58, (('out', 1),): 240, (('out', 0),): 857, (('in', 0),): 863, (('in', 0), ('out', 0)): 754, (('in', 4),): 32, (('in', 1), ('out', 2)): 26, (('out', 4),): 31, (('in', 4), ('out', 3)): 15, (('in', 1), ('out', 0)): 100, (('in', 7),): 13, (('out', 5),): 20, (('in', 3), ('out', 2)): 18, (('in', 0), ('out', 1)): 105, (('in', 3), ('out', 3)): 27, (('in', 4), ('out', 4)): 11, (('in', 6),): 17, (('out', 6),): 17, (('out', 2),): 82, (('in', 2), ('out', 1)): 29, (('in', 1), ('out', 1)): 105, (('in', 2), ('out', 2)): 31, (('in', 5),): 12, (('in', 2), ('out', 3)): 12}
(('in', 4),) ((('out', 4),), 0.34375)
(('out', 3),) ((('in', 3),), 0.46551724137931033)
(('out', 4),) ((('in', 4),), 0.3548387096774194)
(('out', 1),) ((('in', 1),), 0.4375)
(('out', 0),) ((('in', 0),), 0.8798133022170361)
(('out', 2),) ((('in', 2),), 0.3780487804878049)
(('in', 0),) ((('out', 0),), 0.8736964078794901)
  1. 我的 transaction 是用:
    • 如助教給的策資,是用 station_id=519 的 in-flow, out-flow
  2. 應該要找到的 rule / discretization method
    • 如助教建議的,對於 in / out flow,做了 10 個數量級切一份的 binning
  3. 算法
    • 如上面兩個,一個是用 apriori、一個是用 fpgroth 分別找出 frequency itemset
  4. top 3 rules
    1. ('out', 0) -> ('in', 0)
    2. ('in', 0) -> ('out', 0)
    3. ('out', 3) -> ('in', 3)
  5. What do I learned
    • 用 fpgroth 的有列出了 association rules,但是都可以看出基本上還是以 in-flow = 0, out-flow = 0 為大宗
    • 因此我認為 id = 512 的 itemset 基本上沒有什麼有用的 association rules
    • in / out flow 似乎看起來在相同的時間區間中,數量級會相同

Transaction 2

我希望能找到 距離中心點位置、時間與 in / out flow 的關係

為避免找 frequency itemset 太慢,之後都用 fpgrowth 的方法找


In [23]:
query = "SELECT in_out.id, in_out.time, in_out.in, in_out.out, T.latitude, T.longitude FROM in_out left join ( SELECT id, latitude, longitude from station )T ON T.id = in_out.id ORDER BY id"
table = pd.read_sql_query(query, engine)


lat_mean = station['latitude'].mean()
lon_mean = station['longitude'].mean()
#print(lat_mean, lon_mean)

def Distance(lat1,lon1,lat2,lon2):
    az12,az21,dist = wgs84_geod.inv(lon1,lat1,lon2,lat2)
    return dist

from orangecontrib.associate.fpgrowth import *

rem = {}
for idx, row in station.iterrows():
    rem[row['id']] = Distance(lat_mean, lon_mean, row['latitude'], row['longitude'])//1000 # 以公里為單位

from fp_growth import *

transactions = []

for idx, row in table.iterrows():
    rin = row['in'] // 10
    rout = row['out'] // 10
    if rin == 0 or rout == 0: continue
    transactions.append([(rem[row['id']], row['time'].time().isoformat()), ('in',rin), ('out',rout)])

result = {}
for itemset, support in find_frequent_itemsets(transactions, .02*len(transactions), True):
    result[tuple(itemset)] = support/len(transactions)
    
    
def subs(l):
    assert type(l) is list
    if len(l) == 1:
        return [l]
    x = subs(l[1:])
    return x + [[l[0]] + y for y in x]

    
def assRule(freq, min_conf = 0.6):
    assert type(freq) is dict
    result = []
    for item, sup in freq.items():
        for subitem in subs(list(item)):
            sb = [x for x in item if x not in subitem]
            if sb == [] or subitem == []: continue
            if len(subitem) == 1 and (subitem[0][0] == 'in' or subitem[0][0] == 'out'):
                continue
            conf = sup/freq[tuple(subitem)]
            if conf >= min_conf:
                result.append({'from':subitem, 'to':sb, 'sup':sup, 'conf':conf})
    return result
                
rules = assRule(result, 0.8)
#print(rules)
for ru in rules:
    print(ru)


{'sup': 0.027171969967822666, 'conf': 0.853932584269663, 'from': [(1.0, '18:30:00')], 'to': [('out', 1)]}
{'sup': 0.028101537361458706, 'conf': 0.8020408163265306, 'from': [(2.0, '18:30:00')], 'to': [('out', 1)]}
{'sup': 0.02373972112978191, 'conf': 0.8623376623376623, 'from': [(2.0, '19:00:00')], 'to': [('out', 1)]}
{'sup': 0.02209510189488738, 'conf': 0.9035087719298245, 'from': [(1.0, '19:00:00')], 'to': [('out', 1)]}
{'sup': 0.022452627815516626, 'conf': 0.8698060941828255, 'from': [(1.0, '18:00:00')], 'to': [('out', 1)]}
{'sup': 0.021737575974258135, 'conf': 0.9101796407185628, 'from': [('in', 1), (1.0, '18:30:00')], 'to': [('out', 1)]}
{'sup': 0.021737575974258135, 'conf': 0.8, 'from': [('out', 1), (1.0, '18:30:00')], 'to': [('in', 1)]}
  1. 我的 transaction 是用:
    • 與中心點的距離 (km)
    • 時間 30 (min) binning
    • in-flow
    • out-flow
  2. 應該要找到的 rule / discretization method
    • 我希望找到與中心點距離、時間、in / out flow 關係,例如
      • 我把 位置跟中心距離 與 時間 放在一起看,希望能得到兩個一起出來的結果
      • 可能與中心點一段距離中間,可以找到適當的 in / out flow 關係
    • discretization method
      • 距離除到公里整數
      • 時間每 30 分鐘一切,一天 24 hr 分成 48 段
      • in / out flow: 每 10 個數量級一個 binning
  3. 算法
    • 我用的算法是 fp-groth
  4. top 3 rules 以 confidence 來看了話,應該是:
    1. ('in', 1), (1, 18:30) -> ('out', 1)
    2. (1, 19:00) -> ('out', 1)
    3. (1, 18:00) -> ('out', 1)
  5. What do I learned
    • 基本上可以看出,與中心點 1 km - 2 km 的距離的站台,在晚上 18:00 - 19:00 時的 in / out flow 大概會在 10 - 20 的數量級之間

Transaction 3

我想試著找出 start time / end time / in-flow / out-flow / speed / distance 之間是否有神秘的關係


In [52]:
query = '''
SELECT in_out.id, in_out.time, in_out.in, in_out.out, T1.st_time, T2.en_time
FROM in_out
  LEFT JOIN (
    SELECT start_station_id AS st_id, SEC_TO_TIME(AVG(TIME_TO_SEC(DATE_FORMAT(starttime, "%%H:%%i:%%s")))) AS st_time
    FROM path
    GROUP BY start_station_id
  )T1 ON in_out.id = T1.st_id
  LEFT JOIN (
    SELECT end_station_id AS en_id, SEC_TO_TIME(AVG(TIME_TO_SEC(DATE_FORMAT(stoptime, "%%H:%%i:%%s")))) AS en_time
    FROM path
    GROUP BY end_station_id
  )T2 ON in_out.id = T2.en_id
ORDER BY in_out.id;
'''

table = pd.read_sql_query(query, engine)

transactions = []

for idx, row in table.iterrows():
    rin = row['in'] // 10
    rout = row['out'] // 10
    if rin == 0 or rout == 0: continue
    st = (datetime.datetime.min+row['st_time']).time().replace(second=0, microsecond=0)
    st = st.replace(minute=st.minute//10 * 10).isoformat()
    en = (datetime.datetime.min+row['en_time']).time().replace(second=0, microsecond=0)
    en = en.replace(minute=en.minute//10 * 10).isoformat()
    transactions.append([('stime', st), ('etime', en), ('in',rin), ('out',rout)])

result = {}
for itemset, support in find_frequent_itemsets(transactions, .04*len(transactions), True):
    result[tuple(itemset)] = support/len(transactions)
    
    
def subs(l):
    assert type(l) is list
    if len(l) == 1:
        return [l]
    x = subs(l[1:])
    return x + [[l[0]] + y for y in x]

    
def assRule(freq, min_conf = 0.6):
    assert type(freq) is dict
    result = []
    for item, sup in freq.items():
        for subitem in subs(list(item)):
            sb = [x for x in item if x not in subitem]
            if sb == [] or subitem == []: continue
            if len(subitem) == 1 and (subitem[0][0] == 'in' or subitem[0][0] == 'out'):
                continue
            conf = sup/freq[tuple(subitem)]
            if conf >= min_conf:
                result.append({'from':subitem, 'to':sb, 'sup':sup, 'conf':conf})
    return result
                
rules = assRule(result, 0.9)
#print(rules)
for ru in rules:
    print(ru)


{'sup': 0.053485877726135146, 'conf': 0.9385194479297365, 'from': [('in', 1), ('stime', '15:20:00')], 'to': [('out', 1)]}
{'sup': 0.04397568823739721, 'conf': 0.9165424739195231, 'from': [('in', 1), ('etime', '15:10:00')], 'to': [('out', 1)]}
{'sup': 0.06292456203074723, 'conf': 0.9185803757828809, 'from': [('stime', '15:20:00')], 'to': [('out', 1)]}
{'sup': 0.05355738291026099, 'conf': 0.9002403846153846, 'from': [('in', 1), ('etime', '14:10:00')], 'to': [('out', 1)]}
{'sup': 0.11312120128709331, 'conf': 0.9086731763354394, 'from': [('in', 1), ('stime', '15:10:00')], 'to': [('out', 1)]}
{'sup': 0.043689667500893814, 'conf': 0.9038461538461539, 'from': [('out', 1), ('stime', '14:10:00')], 'to': [('in', 1)]}
{'sup': 0.05677511619592421, 'conf': 0.9022727272727273, 'from': [('in', 1), ('etime', '14:20:00')], 'to': [('out', 1)]}
  1. 我的 transaction 是用:
    • 每個站點各個的平均借車時間
    • 每個站點各個的平均還車時間
    • in-flow
    • out-flow
  2. 應該要找到的 rule / discretization method
    • 我希望找到 借、還車時間、in / out flow 關係,例如
      • 借車時間越接近晚上的,out-flow 越高
    • discretization method
      • 平均借車時間 binning 到每十分鐘一個分隔
      • 平均還車時間 binning 到每十分鐘一個分隔
      • in / out flow: 每 10 個數量級一個 binning
  3. 算法
    • 我用的算法是 fp-groth
  4. top 3 rules 以 confidence 來看了話,應該是:
    1. ('in', 1), ('stime', 15:20) -> ('out', 1)
    2. ('in', 1), ('etime', '15:10:00') -> ('out', 1)
    3. ('stime', '15:20:00') -> ('out', 1)
  5. What do I learned
    • 也許可以推測 in/out flow 數量級是 10 - 20 之間的 station,在 15:00 - 15:30 左右的時間,應該要多放一點車提供借出