To do for 07282017:

  1. User_filter: Filter user based on certain features, e.g., consistent with theme, certain time of viewing, or certain time interval before each item viewing.
  2. Recommendation core: It will basically be the collaborative filter (CF), but instead of using real items, I'd like to use features extracted from CNN and dimension-reduced by tSNE to maybe 20 D.
  3. Processor: Input are a. log of user history b. item features Output are a. Top N rank of recommendation item for each user
  4. Evaluator: Evaluate whether the user buy the item within the top N rank of recommended items.

After trial run:

  • tSNE for this amount of sample and the dimension we want may not be feasible. Need to try small portion and time it or try PCA instead

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

In [2]:
os.chdir('/Users/Walkon302/Desktop/deep-learning-models-master/view2buy')

In [3]:
# Read the preprocessed file, containing the user profile and item features from view2buy folder
df = pd.read_pickle('user_fea_for_eval.pkl')

In [4]:
# Drop the first column, which is the original data format.
df.drop('0', axis = 1, inplace = True)

In [5]:
# Check the data
df.head()


Out[5]:
user_id buy_spu buy_sn buy_ct3 view_spu view_sn view_ct3 time_interval view_cnt view_secondes view_features buy_features
0 2469583035 4199682998971011301 10013436 334 220189917005230097 10013861 334 37496 7 45 [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
1 2469583035 4199682998971011301 10013436 334 234826617504419925 10003862 334 170826 2 23 [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0.... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
2 2469583035 4199682998971011301 10013436 334 235671027621670949 10003862 334 426968 2 11 [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
3 1488725183 4199682998971011301 10013436 334 235671027621670949 10003862 334 180564 1 22 [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
4 2469583035 4199682998971011301 10013436 334 245522675097001998 10026364 334 83993 2 7 [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...

In [6]:
# Slice the data into 100k items
df = df.iloc[0:100000, :]

In [7]:
# Calculate the average view sec for all view items per user
avg_view_sec = pd.DataFrame(df.groupby(['user_id', 'buy_spu'])['view_secondes'].mean())

In [8]:
# Reset the index and rename the column
avg_view_sec.reset_index(inplace=True)
avg_view_sec.rename(columns = {'view_secondes':'avg_view_sec'}, inplace=True)

In [9]:
# Check the data
avg_view_sec.head()


Out[9]:
user_id buy_spu avg_view_sec
0 512596 300691773357412424 15.222222
1 814009 77763563263074335 13.128440
2 1165283 77200616039542809 14.714286
3 2164430 32446112180051996 13.486486
4 3603923 25972195386798122 22.380952

In [10]:
# Merge avg item view into data
df = pd.merge(df, avg_view_sec, on=['user_id', 'buy_spu'])

In [11]:
# Calculate the weights for view item vec
df['weight_of_view'] = df['view_secondes']/df['avg_view_sec']

In [12]:
df.head()


Out[12]:
user_id buy_spu buy_sn buy_ct3 view_spu view_sn view_ct3 time_interval view_cnt view_secondes view_features buy_features avg_view_sec weight_of_view
0 2469583035 4199682998971011301 10013436 334 220189917005230097 10013861 334 37496 7 45 [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 1.474403
1 2469583035 4199682998971011301 10013436 334 234826617504419925 10003862 334 170826 2 23 [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0.... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.753584
2 2469583035 4199682998971011301 10013436 334 235671027621670949 10003862 334 426968 2 11 [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.360410
3 2469583035 4199682998971011301 10013436 334 245522675097001998 10026364 334 83993 2 7 [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.229352
4 2469583035 4199682998971011301 10013436 334 296751124749754369 10005367 334 427866 2 12 [0.066, 0.328, 0.043, 0.0, 0.062, 0.016, 0.303... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.393174

In [13]:
# Generate view_item_vec and buy_item_vec
view_item_vec = df['view_features']
buy_item_vec = df['buy_features']

In [14]:
print 'view_item', len(view_item_vec), 'buy_item', len(buy_item_vec)


view_item 100000 buy_item 100000

Try TSNE and time it

  • It turns out that TSNE is too time consuming even for small set of data. It is also because of how I transformed the data. Thus, in the PCA, I used list in the beginning and then transform all data into numpy array at once, which is much faster.

In [ ]:
# Generate TSNE model
model = TSNE(n_components=10, random_state=0)

In [121]:
# Time the tSNE with 250 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:250]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)


CPU times: user 22.3 s, sys: 501 ms, total: 22.8 s
Wall time: 22.8 s

In [114]:
# Time the tSNE with 500 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:500]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)


CPU times: user 1min 23s, sys: 2.57 s, total: 1min 25s
Wall time: 1min 31s

In [113]:
# Time the tSNE with 1000 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:1000]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)


CPU times: user 4min 25s, sys: 6.05 s, total: 4min 31s
Wall time: 4min 33s

Try PCA instead

  • PCA looks resonable. We can process 300k data around 30 secs if it does not blow up my RAM. I will proceed with this setting for first try

In [17]:
# Generate TSNE model
model = PCA(n_components=200, random_state=0)

Append all view_items for PCA processing


In [18]:
%%time
view_item = []
for i in view_item_vec:
    view_item.append(i)
view_item= np.array(view_item)


CPU times: user 5.93 s, sys: 1.15 s, total: 7.08 s
Wall time: 7.1 s

In [19]:
%%time
pca_view_vec = model.fit_transform(view_item)


CPU times: user 1min 1s, sys: 10.3 s, total: 1min 11s
Wall time: 47.3 s

In [20]:
# 200 dimensions of PCA can explain 85% of variables. Beyond that, e.g., 300 D, my computer will run out of memory (8g)
sum(model.explained_variance_ratio_)


Out[20]:
0.84791818306676126

Append all buy_items for PCA processing


In [22]:
%%time
buy_item = []
for i in buy_item_vec:
    buy_item.append(i)
buy_item= np.array(buy_item)


CPU times: user 4.83 s, sys: 656 ms, total: 5.49 s
Wall time: 5.54 s

In [23]:
%%time
pca_buy_vec = model.fit_transform(buy_item)


CPU times: user 1min 2s, sys: 10.6 s, total: 1min 12s
Wall time: 52.3 s

In [24]:
# Incert pca result to data
df['pca_view'] = pca_view_vec.tolist()
df['pca_buy'] = pca_buy_vec.tolist()

In [25]:
# Check the data
df.head()


Out[25]:
user_id buy_spu buy_sn buy_ct3 view_spu view_sn view_ct3 time_interval view_cnt view_secondes view_features buy_features avg_view_sec weight_of_view pca_view pca_buy
0 2469583035 4199682998971011301 10013436 334 220189917005230097 10013861 334 37496 7 45 [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 1.474403 [-5.74788202222, 6.52824387495, -5.06289858541... [-2.17288636144, -2.94799496304, 2.39064835946...
1 2469583035 4199682998971011301 10013436 334 234826617504419925 10003862 334 170826 2 23 [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0.... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.753584 [1.43729181654, 1.08420451467, 7.66008274012, ... [-2.17288636144, -2.94799496305, 2.39064835945...
2 2469583035 4199682998971011301 10013436 334 235671027621670949 10003862 334 426968 2 11 [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.360410 [1.94162062285, -2.13649823253, 10.3696365616,... [-2.17288636142, -2.94799496302, 2.39064835945...
3 2469583035 4199682998971011301 10013436 334 245522675097001998 10026364 334 83993 2 7 [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.229352 [-7.96945015131, 5.27389714219, -0.67157260350... [-2.17288636144, -2.94799496305, 2.39064835948...
4 2469583035 4199682998971011301 10013436 334 296751124749754369 10005367 334 427866 2 12 [0.066, 0.328, 0.043, 0.0, 0.062, 0.016, 0.303... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.393174 [9.4119644988, 2.66645049386, 7.34774326651, 2... [-2.17288636142, -2.94799496304, 2.39064835947...

In [15]:
df = pd.read_pickle('df_weighted.pkl')

In [123]:
# Calculate the weighted pca_view
df['weighted_view_pca'] = df.apply(lambda x: [y*x['weight_of_view'] for y in x['pca_view']], axis=1)

In [16]:
# Calculate the weighted pca_buy
df['weighted_buy_pca'] = df.apply(lambda x: [y*x['weight_of_view'] for y in x['pca_buy']], axis=1)

In [20]:
# Check the data
df.head()


Out[20]:
user_id buy_spu buy_sn buy_ct3 view_spu view_sn view_ct3 time_interval view_cnt view_secondes view_features buy_features avg_view_sec weight_of_view pca_view pca_buy weighted_view_pca
0 2469583035 4199682998971011301 10013436 334 220189917005230097 10013861 334 37496 7 45 [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 1.474403 [-5.74788202222, 6.52824387495, -5.06289858541... [-2.17288636144, -2.94799496304, 2.39064835946... [-8.47469294744, 9.62526059379, -7.46475149794...
1 2469583035 4199682998971011301 10013436 334 234826617504419925 10003862 334 170826 2 23 [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0.... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.753584 [1.43729181654, 1.08420451467, 7.66008274012, ... [-2.17288636144, -2.94799496305, 2.39064835945... [1.08311956687, 0.817038760544, 5.77251286354,...
2 2469583035 4199682998971011301 10013436 334 235671027621670949 10003862 334 426968 2 11 [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.360410 [1.94162062285, -2.13649823253, 10.3696365616,... [-2.17288636142, -2.94799496302, 2.39064835945... [0.699778627213, -0.770014380052, 3.7373161122...
3 2469583035 4199682998971011301 10013436 334 245522675097001998 10026364 334 83993 2 7 [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.229352 [-7.96945015131, 5.27389714219, -0.67157260350... [-2.17288636144, -2.94799496305, 2.39064835948... [-1.82780563197, 1.2095764094, -0.154026208039...
4 2469583035 4199682998971011301 10013436 334 296751124749754369 10005367 334 427866 2 12 [0.066, 0.328, 0.043, 0.0, 0.062, 0.016, 0.303... [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757... 30.520833 0.393174 [9.4119644988, 2.66645049386, 7.34774326651, 2... [-2.17288636142, -2.94799496304, 2.39064835947... [3.70054030806, 1.04837917028, 2.88894206247, ...

Save the file for further processing


In [ ]:
df.to_pickle('top100k_user_pca.pkl')

In [ ]: