To do for 07282017:

User_filter: Filter user based on certain features, e.g., consistent with theme, certain time of viewing, or certain time interval before each item viewing.
Recommendation core: It will basically be the collaborative filter (CF), but instead of using real items, I'd like to use features extracted from CNN and dimension-reduced by tSNE to maybe 20 D.
Processor: Input are a. log of user history b. item features Output are a. Top N rank of recommendation item for each user
Evaluator: Evaluate whether the user buy the item within the top N rank of recommended items.

After trial run:

tSNE for this amount of sample and the dimension we want may not be feasible. Need to try small portion and time it or try PCA instead



In [1]:

    
import pandas as pd
import numpy as np
import os
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA



In [2]:

    
os.chdir('/Users/Walkon302/Desktop/deep-learning-models-master/view2buy')



In [3]:

    
# Read the preprocessed file, containing the user profile and item features from view2buy folder
df = pd.read_pickle('user_fea_for_eval.pkl')



In [4]:

    
# Drop the first column, which is the original data format.
df.drop('0', axis = 1, inplace = True)



In [5]:

    
# Check the data
df.head()









    Out[5]:






  
    
      
      user_id
      buy_spu
      buy_sn
      buy_ct3
      view_spu
      view_sn
      view_ct3
      time_interval
      view_cnt
      view_secondes
      view_features
      buy_features
    
  
  
    
      0
      2469583035
      4199682998971011301
      10013436
      334
      220189917005230097
      10013861
      334
      37496
      7
      45
      [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
    
    
      1
      2469583035
      4199682998971011301
      10013436
      334
      234826617504419925
      10003862
      334
      170826
      2
      23
      [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0....
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
    
    
      2
      2469583035
      4199682998971011301
      10013436
      334
      235671027621670949
      10003862
      334
      426968
      2
      11
      [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
    
    
      3
      1488725183
      4199682998971011301
      10013436
      334
      235671027621670949
      10003862
      334
      180564
      1
      22
      [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
    
    
      4
      2469583035
      4199682998971011301
      10013436
      334
      245522675097001998
      10026364
      334
      83993
      2
      7
      [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...



In [6]:

    
# Slice the data into 100k items
df = df.iloc[0:100000, :]



In [7]:

    
# Calculate the average view sec for all view items per user
avg_view_sec = pd.DataFrame(df.groupby(['user_id', 'buy_spu'])['view_secondes'].mean())



In [8]:

    
# Reset the index and rename the column
avg_view_sec.reset_index(inplace=True)
avg_view_sec.rename(columns = {'view_secondes':'avg_view_sec'}, inplace=True)



In [9]:

    
# Check the data
avg_view_sec.head()









    Out[9]:






  
    
      
      user_id
      buy_spu
      avg_view_sec
    
  
  
    
      0
      512596
      300691773357412424
      15.222222
    
    
      1
      814009
      77763563263074335
      13.128440
    
    
      2
      1165283
      77200616039542809
      14.714286
    
    
      3
      2164430
      32446112180051996
      13.486486
    
    
      4
      3603923
      25972195386798122
      22.380952



In [10]:

    
# Merge avg item view into data
df = pd.merge(df, avg_view_sec, on=['user_id', 'buy_spu'])



In [11]:

    
# Calculate the weights for view item vec
df['weight_of_view'] = df['view_secondes']/df['avg_view_sec']



In [12]:

    
df.head()









    Out[12]:






  
    
      
      user_id
      buy_spu
      buy_sn
      buy_ct3
      view_spu
      view_sn
      view_ct3
      time_interval
      view_cnt
      view_secondes
      view_features
      buy_features
      avg_view_sec
      weight_of_view
    
  
  
    
      0
      2469583035
      4199682998971011301
      10013436
      334
      220189917005230097
      10013861
      334
      37496
      7
      45
      [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      1.474403
    
    
      1
      2469583035
      4199682998971011301
      10013436
      334
      234826617504419925
      10003862
      334
      170826
      2
      23
      [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0....
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.753584
    
    
      2
      2469583035
      4199682998971011301
      10013436
      334
      235671027621670949
      10003862
      334
      426968
      2
      11
      [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.360410
    
    
      3
      2469583035
      4199682998971011301
      10013436
      334
      245522675097001998
      10026364
      334
      83993
      2
      7
      [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.229352
    
    
      4
      2469583035
      4199682998971011301
      10013436
      334
      296751124749754369
      10005367
      334
      427866
      2
      12
      [0.066, 0.328, 0.043, 0.0, 0.062, 0.016, 0.303...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.393174



In [13]:

    
# Generate view_item_vec and buy_item_vec
view_item_vec = df['view_features']
buy_item_vec = df['buy_features']



In [14]:

    
print 'view_item', len(view_item_vec), 'buy_item', len(buy_item_vec)









    



view_item 100000 buy_item 100000

Try TSNE and time it

It turns out that TSNE is too time consuming even for small set of data. It is also because of how I transformed the data. Thus, in the PCA, I used list in the beginning and then transform all data into numpy array at once, which is much faster.



In [ ]:

    
# Generate TSNE model
model = TSNE(n_components=10, random_state=0)



In [121]:

    
# Time the tSNE with 250 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:250]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)









    



CPU times: user 22.3 s, sys: 501 ms, total: 22.8 s
Wall time: 22.8 s



In [114]:

    
# Time the tSNE with 500 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:500]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)









    



CPU times: user 1min 23s, sys: 2.57 s, total: 1min 25s
Wall time: 1min 31s



In [113]:

    
# Time the tSNE with 1000 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:1000]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)









    



CPU times: user 4min 25s, sys: 6.05 s, total: 4min 31s
Wall time: 4min 33s

Try PCA instead

PCA looks resonable. We can process 300k data around 30 secs if it does not blow up my RAM. I will proceed with this setting for first try



In [17]:

    
# Generate TSNE model
model = PCA(n_components=200, random_state=0)

Append all view_items for PCA processing



In [18]:

    
%%time
view_item = []
for i in view_item_vec:
    view_item.append(i)
view_item= np.array(view_item)









    



CPU times: user 5.93 s, sys: 1.15 s, total: 7.08 s
Wall time: 7.1 s



In [19]:

    
%%time
pca_view_vec = model.fit_transform(view_item)









    



CPU times: user 1min 1s, sys: 10.3 s, total: 1min 11s
Wall time: 47.3 s



In [20]:

    
# 200 dimensions of PCA can explain 85% of variables. Beyond that, e.g., 300 D, my computer will run out of memory (8g)
sum(model.explained_variance_ratio_)









    Out[20]:





0.84791818306676126

Append all buy_items for PCA processing



In [22]:

    
%%time
buy_item = []
for i in buy_item_vec:
    buy_item.append(i)
buy_item= np.array(buy_item)









    



CPU times: user 4.83 s, sys: 656 ms, total: 5.49 s
Wall time: 5.54 s



In [23]:

    
%%time
pca_buy_vec = model.fit_transform(buy_item)









    



CPU times: user 1min 2s, sys: 10.6 s, total: 1min 12s
Wall time: 52.3 s



In [24]:

    
# Incert pca result to data
df['pca_view'] = pca_view_vec.tolist()
df['pca_buy'] = pca_buy_vec.tolist()



In [25]:

    
# Check the data
df.head()









    Out[25]:






  
    
      
      user_id
      buy_spu
      buy_sn
      buy_ct3
      view_spu
      view_sn
      view_ct3
      time_interval
      view_cnt
      view_secondes
      view_features
      buy_features
      avg_view_sec
      weight_of_view
      pca_view
      pca_buy
    
  
  
    
      0
      2469583035
      4199682998971011301
      10013436
      334
      220189917005230097
      10013861
      334
      37496
      7
      45
      [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      1.474403
      [-5.74788202222, 6.52824387495, -5.06289858541...
      [-2.17288636144, -2.94799496304, 2.39064835946...
    
    
      1
      2469583035
      4199682998971011301
      10013436
      334
      234826617504419925
      10003862
      334
      170826
      2
      23
      [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0....
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.753584
      [1.43729181654, 1.08420451467, 7.66008274012, ...
      [-2.17288636144, -2.94799496305, 2.39064835945...
    
    
      2
      2469583035
      4199682998971011301
      10013436
      334
      235671027621670949
      10003862
      334
      426968
      2
      11
      [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.360410
      [1.94162062285, -2.13649823253, 10.3696365616,...
      [-2.17288636142, -2.94799496302, 2.39064835945...
    
    
      3
      2469583035
      4199682998971011301
      10013436
      334
      245522675097001998
      10026364
      334
      83993
      2
      7
      [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.229352
      [-7.96945015131, 5.27389714219, -0.67157260350...
      [-2.17288636144, -2.94799496305, 2.39064835948...
    
    
      4
      2469583035
      4199682998971011301
      10013436
      334
      296751124749754369
      10005367
      334
      427866
      2
      12
      [0.066, 0.328, 0.043, 0.0, 0.062, 0.016, 0.303...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.393174
      [9.4119644988, 2.66645049386, 7.34774326651, 2...
      [-2.17288636142, -2.94799496304, 2.39064835947...



In [15]:

    
df = pd.read_pickle('df_weighted.pkl')



In [123]:

    
# Calculate the weighted pca_view
df['weighted_view_pca'] = df.apply(lambda x: [y*x['weight_of_view'] for y in x['pca_view']], axis=1)



In [16]:

    
# Calculate the weighted pca_buy
df['weighted_buy_pca'] = df.apply(lambda x: [y*x['weight_of_view'] for y in x['pca_buy']], axis=1)



In [20]:

    
# Check the data
df.head()









    Out[20]:






  
    
      
      user_id
      buy_spu
      buy_sn
      buy_ct3
      view_spu
      view_sn
      view_ct3
      time_interval
      view_cnt
      view_secondes
      view_features
      buy_features
      avg_view_sec
      weight_of_view
      pca_view
      pca_buy
      weighted_view_pca
    
  
  
    
      0
      2469583035
      4199682998971011301
      10013436
      334
      220189917005230097
      10013861
      334
      37496
      7
      45
      [0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      1.474403
      [-5.74788202222, 6.52824387495, -5.06289858541...
      [-2.17288636144, -2.94799496304, 2.39064835946...
      [-8.47469294744, 9.62526059379, -7.46475149794...
    
    
      1
      2469583035
      4199682998971011301
      10013436
      334
      234826617504419925
      10003862
      334
      170826
      2
      23
      [0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0....
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.753584
      [1.43729181654, 1.08420451467, 7.66008274012, ...
      [-2.17288636144, -2.94799496305, 2.39064835945...
      [1.08311956687, 0.817038760544, 5.77251286354,...
    
    
      2
      2469583035
      4199682998971011301
      10013436
      334
      235671027621670949
      10003862
      334
      426968
      2
      11
      [0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.360410
      [1.94162062285, -2.13649823253, 10.3696365616,...
      [-2.17288636142, -2.94799496302, 2.39064835945...
      [0.699778627213, -0.770014380052, 3.7373161122...
    
    
      3
      2469583035
      4199682998971011301
      10013436
      334
      245522675097001998
      10026364
      334
      83993
      2
      7
      [0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.229352
      [-7.96945015131, 5.27389714219, -0.67157260350...
      [-2.17288636144, -2.94799496305, 2.39064835948...
      [-1.82780563197, 1.2095764094, -0.154026208039...
    
    
      4
      2469583035
      4199682998971011301
      10013436
      334
      296751124749754369
      10005367
      334
      427866
      2
      12
      [0.066, 0.328, 0.043, 0.0, 0.062, 0.016, 0.303...
      [0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
      30.520833
      0.393174
      [9.4119644988, 2.66645049386, 7.34774326651, 2...
      [-2.17288636142, -2.94799496304, 2.39064835947...
      [3.70054030806, 1.04837917028, 2.88894206247, ...

Save the file for further processing



In [ ]:

    
df.to_pickle('top100k_user_pca.pkl')



In [ ]:

	user_id	buy_spu	buy_sn	buy_ct3	view_spu	view_sn	view_ct3	time_interval	view_cnt	view_secondes	view_features	buy_features
0	2469583035	4199682998971011301	10013436	334	220189917005230097	10013861	334	37496	7	45	[0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103...	[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
1	2469583035	4199682998971011301	10013436	334	234826617504419925	10003862	334	170826	2	23	[0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0....	[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
2	2469583035	4199682998971011301	10013436	334	235671027621670949	10003862	334	426968	2	11	[0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...	[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
3	1488725183	4199682998971011301	10013436	334	235671027621670949	10003862	334	180564	1	22	[0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...	[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...
4	2469583035	4199682998971011301	10013436	334	245522675097001998	10026364	334	83993	2	7	[0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0...	[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...

	user_id	buy_spu	avg_view_sec
0	512596	300691773357412424	15.222222
1	814009	77763563263074335	13.128440
2	1165283	77200616039542809	14.714286
3	2164430	32446112180051996	13.486486
4	3603923	25972195386798122	22.380952