Artem Zaika: User authentication at the visited web pages
In this project we will solve the problem of identification of the user behavior on the Internet. This is a complex and interesting challenge at the intersection of data Analytics and behavioral psychology. As an example, Yandex solves the problem of identifying the attacker's mailbox by his behavior. In a nutshell, the attacker will behave not as the owner of the box: he may not delete messages immediately after reading, as did the master, it will be another way to put check boxes, messages, and even to move the mouse. Then this attacker can be identified and "thrown out" from a mailbox, inviting the owner to enter the SMS code.
We will solve this problem: for a sequence of web sites visited in succession one and the same person, we're going to identify this person. The idea is this: Internet users in different ways click on links and it can help to identify them (someone at first in mail, then about football to read, then news, contact, then finally to work, someone from work).
We use data from article "A Tool for Classification of Sequential Data". And although this this article can't be recommend (the methods described are far from state-of-the-art, better to turn to the book "Frequent Pattern Mining", and the latest articles from ICDM), but the data they collected carefully and are of interest.
Data collected from proxy servers of Blaise Pascal University and have a very simple view:
Download the original data can be link there is description in the article.
For this work I'll be using just 10 and 150 users. Link archive capstone_websites_data.zip (~7.1 Mb, and in expanded form ~ 65 Mb)
In [2]:
import warnings
warnings.filterwarnings('ignore')
from glob import glob
from datetime import datetime as dttime
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 25)
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from collections import defaultdict, Counter
from scipy.sparse import hstack as sp_hstack
from tsfresh.transformers import RelevantFeatureAugmenter
import xgboost as xgb
import pickle
import seaborn as sns
%pylab inline
In my project, I will follow these steps: first, I will examine the data collected from the servers of the University Blaise Pascal, research it, try to find information useful for making features. Then I'll train the model and compare the results.
After we downloaded the data, in the directory "capstone_websites_data" should be nested directories "10users" and "150users" contains data of 10 and 150 users, respectively, collected in a separate file for each user. Upload data for a single user.
In [3]:
data = pd.read_csv('capstone_websites_data/150users/user0006.csv', names=['user_id', 'time' ,'sites'])
data.head()
Out[3]:
As we can see, nothing but the time of visit and visited sites we have. This is very little for building the model. What can we do here?
I think each of us has our own habits, not only in life but also in the behavior of the Internet. Someone in the morning, loves to read wall in facebook while drinking his morning coffee, someone reads the news, and someone just listens to music from your favorite streaming service.
Let us look for such regularities in the behavior of each user, they will help us in future to identify a user when we see some behavior, but still do not know who owns this behavior. More specific, we will classify user's session from 10 consecutively visited sites. Object in this task, you will session of the 10 sites visited sequentially by the same user and features – indexes of these 10 sites (later there will be a "bag" of sites, the approach Bag of Words). The target class is the user id.
To accelerate further process of generating and building new features, build a frequency dictionary of sites. {'site_string': (site_id, site_freq)}.
It should look like: {'vk.com': (1, 2), 'google.com': (2, 2), 'yandex.ru': (3, 3), 'facebook.com': (4, 1)}
In [4]:
def build_site_freq(csv_files_mask):
# sites visiting frequency
freq_dict = defaultdict(int)
# site index in dictionary
site_idx = 0
for file_name in sorted(glob(csv_files_mask)):
data = pd.read_csv(file_name, names=['user_id', 'date', 'site'])
# count site frequencies, build companies index
for id, row in enumerate(data_slice.iterrows()):
site_name = row[1].get('site')
if freq_dict[site_name]:
freq_dict[site_name][1] += 1
else:
site_idx += 1
freq_dict[site_name] = [site_idx, 1]
return freq_dict
In [5]:
site_freq_150users = prepare_train_set('capstone_websites_data/150users/*')
In [6]:
with open('capstone_websites_data/site_freq_150users.pkl', 'wb') as site_freq_150users_pkl:
pickle.dump(site_freq_150users, site_freq_150users_pkl)
Implement the function – prepare_train_set_with_fe, (from the "feature engineering"), which will create the following 13 features:
The function should return a new DataFrame. The order in which you add features: site1, ... site10, time_diff1, ..., time_diff9, session_timespan, #unique_sites, start_hour,day_of_week,target (this can be seen just below on how the function is called)
In [8]:
def time_diff(data):
FMT = '%Y-%m-%dT%H:%M:%S'
data_start_idx = data.index.values[:1][0]
ts_min, ts_max = dttime.strptime(data.ix[data_start_idx]['date'], FMT), dttime.strptime(data.ix[data_start_idx]['date'], FMT)
d_len = len(data.index)-1
st_diff = []
for idx in xrange(d_len):
next_st = dttime.strptime(data.ix[data_start_idx+idx+1]['date'], FMT)
present_st = dttime.strptime(data.ix[data_start_idx+idx]['date'], FMT)
tdelta = next_st - present_st
st_diff.append(tdelta.seconds)
if next_st > ts_max:
ts_max = next_st
elif next_st < ts_min:
ts_min = next_st
if len(st_diff) == 0:
st_diff = [0]
return {"day_of_week": 1+ts_min.weekday(), "start_hour": ts_min.hour, "timespan": (ts_max - ts_min).seconds, "time_diff": st_diff}
def prepare_train_set_with_fe(csv_files_mask, site_freq_path, feature_names, session_length=10, window_size=1):
# user sessions matrix
user_session = []
features_len = len(feature_names)
if window_size == 1:
window_size = session_length
with open(site_freq_path, 'rb') as XX_pkl:
freq_dict = pickle.load(XX_pkl)
for file_name in sorted(glob(csv_files_mask)):
data = pd.read_csv(file_name, names=['user_id', 'date', 'site'])
user_id = data['user_id'][0]
end_idx = session_length
rows_count = data.shape[0]
for start_idx in xrange(0, rows_count, window_size):
if end_idx >= rows_count:
data_slice = data[start_idx:rows_count]
else:
data_slice = data[start_idx:end_idx]
if end_idx + window_size > rows_count:
end_idx = rows_count
else:
end_idx += window_size
# build session row
session = np.zeros(features_len, dtype=np.int32)
data_slice_len = len(data_slice)
tm_diff = time_diff(data_slice)
# session ids
session[0:data_slice_len] = [freq_dict[site][0] for site in data_slice['site'].values.tolist()]
# time diff
session[10:10 + len(tm_diff["time_diff"])] = tm_diff["time_diff"]
# timespan
session[features_len - 5] = tm_diff["timespan"]
# uniques
session[features_len - 4] = len(data_slice['site'].unique())
# start_hour
session[features_len - 3] = tm_diff["start_hour"]
# day_of_week
session[features_len - 2] = tm_diff["day_of_week"]
# target
session[features_len - 1] = user_id
user_session.append(session.tolist())
return pd.DataFrame(data=user_session, columns=feature_names)
Apply the function prepare_train_set_with_fe data for 150 users, specify session_length=10.
In [9]:
feature_names = ['site' + str(i) for i in range(1,11)] + \
['time_diff' + str(j) for j in range(1,10)] + \
['session_timespan', '#unique_sites', 'start_hour', 'day_of_week', 'target']
train_data_150users = prepare_train_set_with_fe('capstone_websites_data/150users/*',
site_freq_path='capstone_websites_data/site_freq_150users.pkl',
feature_names=feature_names, session_length=10)
In [139]:
train_data_150users.head()
Out[139]:
If we look carefully, features site1, ..., site10 does not make sense as features in a classification problem. But if we use the idea of bag-of-words from analysis of text is another matter. Let's create a new matrix in which rows correspond to sessions of the 10 sites, and the columns – the indices of the sites. At the intersection of row $i$ and column $j$ will be the number $n_{ij}$ is the number of times the site $j$ found in session room $i$. We will do this using sparse matrices Scipy – csr_matrix.
Create a method that translates data on visits to websites in a bag sites
In [11]:
def getSparseMatrix(data):
n_data, rows, cols = [], [], []
for row in data.iterrows():
counter = Counter(row[1])
for itm in counter.items():
if itm[0] != 0:
rows.append(row[0])
n_data.append(itm[1])
cols.append(itm[0] - 1)
return n_data, rows, cols
In [186]:
X_150users, y_150users = train_data_150users.iloc[:, :-1], train_data_150users.iloc[:, -1]
In [157]:
X_150users.head()
Out[157]:
In [23]:
drc = getSparseMatrix(X_150users[['site' + str(i) for i in range(1,11)]])
X_sparse_150users = csr_matrix((drc[0],(drc[1],drc[2])), shape=(np.shape(X_150users)[0], len(site_freq_150users)), dtype=int)
In [15]:
with open('capstone_websites_data/X_sparse_150users.pkl', 'wb') as X150_pkl:
pickle.dump(X_sparse_150users, X150_pkl)
Visual analysis we need in order to visually assess the change indication of different users. If all users of any feature is changed slightly or not changed at all - we assume such a feature is weak, as it does not carry any signal, which will help us to distinguish users from each other.
Choose 10 users to analyze
In [24]:
id_name_dict = {128: 'Mary-Kate', 39: 'Ashley', 207: 'Lindsey', 127: 'Naomi', 237: 'Avril',
33: 'Bob', 50: 'Bill', 31: 'John', 100: 'Dick', 241: 'Ed'}
train_data_10users = train_data_150users[(train_data_150users.target == 128) |
(train_data_150users.target == 39) |
(train_data_150users.target == 207) |
(train_data_150users.target == 127) |
(train_data_150users.target == 33) |
(train_data_150users.target == 50) |
(train_data_150users.target == 31) |
(train_data_150users.target == 100) |
(train_data_150users.target == 241) |
(train_data_150users.target == 237)]
train_data_10users['target'] = train_data_10users['target'].map(id_name_dict)
In [25]:
color_dic = {'Mary-Kate': 'pink', 'Ashley': 'darkviolet', 'Lindsey':'blueviolet',
'Naomi': 'hotpink', 'Avril': 'orchid',
'Bob': 'firebrick', 'Bill': 'gold', 'John': 'forestgreen',
'Dick': 'slategrey', 'Ed':'brown'}
1. Build the histogram of session length in seconds (session_timespan). The limit x a value of 200 (otherwise we get too heavy in the tail).
In [27]:
s = train_data_10users['session_timespan']
ax = s[s < 200].plot(kind='hist', color='darkviolet', legend=True)
ax.set_ylabel("Number of sessions")
ax.set_xlabel("Session length(s)")
Out[27]:
As the graph shows, the distribution of the lengths of user sessions sloping to the right, dominated by short sessions. This uneven distribution tells us that the sign of the "Long session" will be useful when building the model.
2. Build a histogram of distribution of the number of unique sites in the session(#unique_sites).
In [28]:
ax = train_data_10users['#unique_sites'].plot(kind='hist', color='aqua', legend=True)
ax.set_ylabel("Visits frequency")
ax.set_xlabel("Number of unique sites in session")
Out[28]:
The distribution of the number of unique sites in the session almost even with the average in point 5-6. "The number of unique sites in session" will also be useful when building the model
3. Build a histogram of the distribution of the number of unique sites in the session (#unique_sites) for each of the 10 users separately.
In [29]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))
for idx, (user, sub_df) in enumerate(pd.groupby(train_data_10users, 'target')):
ax = sub_df['#unique_sites'].plot(subplots=True, kind="hist", color=color_dic[user], ax=axes[idx/4, idx%4])
ax[0].legend(labels=[user])
ax[0].set_ylabel("Visits frequency")
ax[0].set_xlabel("Number of unique sites in session")
fig.tight_layout()
As we can see from the graphs the seven distributions of the number of unique site in the session is very similar, but three users - Ashley, Dick, Mary-Kate have two favorite website and often visit them. Feature "Number of unique sites in the session" I also find it useful when building the model.
4. Build a histogram of the distribution of start hour of the session (start_hour).
In [30]:
ax = train_data_10users['start_hour'].plot(kind='hist', color='darkgreen', legend=True)
ax.set_ylabel("Frequency")
ax.set_xlabel("Start time of visiting")
Out[30]:
5. Build a histogram of the distribution of start hour of users session (start_hour) for each of the 10 users separately.
In [31]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))
for idx, (user, sub_df) in enumerate(pd.groupby(train_data_10users, 'target')):
ax = sub_df['start_hour'].plot(subplots=True, kind="hist", color=color_dic[user], ax=axes[idx/4, idx%4])
ax[0].legend(labels=[user])
ax[0].set_ylabel("Frequency")
ax[0].set_xlabel("Start time of visiting")
fig.tight_layout()
As we can see, start time distribution in the session for each user is different. Maybe it depends on the pace of life, habits and age. The majority of visits occur during lunch time, which would mean that these users are the employees of the office and actively use the Internet while lunch break.
6. Build a histogram of the distribution of day of week on which the session began (day_of_week).
In [34]:
ax = train_data_10users['day_of_week'].hist(color='sienna', bins=7)
ax.set_ylabel("Frequency of visiting")
ax.set_xlabel("Day of week(1-Mon, 7-Sun)")
Out[34]:
7. Build a histogram day of week distribution when the session began (day_of_week) for each of the 10 users separately.
In [35]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))
for idx, (user, sub_df) in enumerate(pd.groupby(train_data_10users, 'target')):
ax = sub_df['day_of_week'].plot(subplots=True, kind="hist", color=color_dic[user], ax=axes[idx/4, idx%4],
bins=7)
ax[0].legend(labels=[user])
ax[0].set_ylabel("Frequency")
ax[0].set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
fig.tight_layout()
As we can see the behavior of users differs according to the days of the week.
Let's take a loot at top 10 most visited websites (top10_sites) and the corresponding number of visits (top10_freqs).
In [37]:
top_10_sorted = sorted(site_freq_150users.items(), key=lambda x: x[1][1], reverse=True)
top10_freqs = [freq[1][1] for freq in top_10_sorted[:10]]
top10_sites = [freq[0] for freq in top_10_sorted[:10]]
top10_sites
Out[37]:
9. Frequencies of visits to the top 10 websites.
In [38]:
ax = sns.barplot(top10_sites, top10_freqs)
ax.set(xlabel='Sites visited', ylabel='Frequency')
xticks(rotation=90)
Out[38]:
Popularity of the sites among users is also different. This again points us to take into account the this feature, which will show how many sites from top popular user visited.
Sometimes training of models for 150 users takes huge amount of time, sometimes even more than 24 hours, I decided separate limited 10 users sample and use it for experiments.
In [39]:
with open('capstone_websites_data/X_sparse_10users.pkl', 'rb') as X10_pkl:
X_sparse_10users = pickle.load(X10_pkl)
with open('capstone_websites_data/y_10users.pkl', 'rb') as y10_pkl:
y_10users = pickle.load(y10_pkl)
Divide the sample into 2 parts. One will carry out cross-validation, the second is to evaluate the model trained after cross-validation.
In [40]:
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=7, stratify=y_10users)
Utility function for drawing curves validation after launch of GridSearchCV (or RandomizedCV).
In [41]:
def plot_validation_curves(param_values, grid_cv_results_):
train_mu, train_std = grid_cv_results_['mean_train_score'], grid_cv_results_['std_train_score']
valid_mu, valid_std = grid_cv_results_['mean_test_score'], grid_cv_results_['std_test_score']
train_line = plt.plot(param_values, train_mu, '-', label='train', color='green')
valid_line = plt.plot(param_values, valid_mu, '-', label='test', color='red')
plt.fill_between(param_values, train_mu - train_std, train_mu + train_std, edgecolor='none',
facecolor=train_line[0].get_color(), alpha=0.2)
plt.fill_between(param_values, valid_mu - valid_std, valid_mu + valid_std, edgecolor='none',
facecolor=valid_line[0].get_color(), alpha=0.2)
plt.legend()
1. Train random forest (RandomForestClassifier) of 100 trees. Take a look at the OOB-estimate
In [42]:
from sklearn.ensemble import RandomForestClassifier
In [43]:
forest = RandomForestClassifier(n_estimators=100, random_state=7, oob_score=True, n_jobs=-1 )
forest.fit(X_train, y_train)
Out[43]:
In [44]:
rf_valid_acc = accuracy_score(y_valid, forest.predict(X_valid))
print 'Accuracy score:', rf_valid_acc
print 'OOB score:', forest.oob_score_
2. Train a logistic regression (LogisticRegression) with default parameter C and random_state=7 (for reproducibility). Look at the proportion of correct answers on cross-validation (use skf created earlier) and sample (X_valid, y_valid).
In [45]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=7)
In [46]:
logit = LogisticRegression(random_state=7, n_jobs=-1)
Evaluate Logistic Regression on cross-validation and pending sample
In [47]:
%%time
logit_cv_scores = cross_val_score(logit, X_train, y_train, cv=skf)
print u'Scross-val score ', logit_cv_scores
In [48]:
logit_ac = LogisticRegression(random_state=7, n_jobs=-1)
logit_ac.fit(X_train, y_train)
acc_score = accuracy_score(y_valid, logit_ac.predict(X_valid))
print u'Hold-out score ', acc_score
The average proportion of correct answers on cross-validation for each of the 10 parameters C.
In [49]:
%%time
logit_c_values1 = np.linspace(1e-4, 1e4, 10)
logit_mean_cv_scores1 = [cross_val_score(LogisticRegression(C=sc, random_state=7, n_jobs=-1),
X_train, y_train, cv=skf).mean() for sc in logit_c_values1]
print logit_mean_cv_scores1
Draw a graph of the proportion of correct answers on cross-validation according to С.
In [50]:
plot(logit_c_values1, logit_mean_cv_scores1);
Now do the same, but search parameter C in range np.linspace(1e-3, 1, 30).
In [51]:
%%time
logit_c_values2 = np.linspace(1e-3, 1, 10)
logit_mean_cv_scores2 = [round(cross_val_score(LogisticRegression(C=sc, random_state=7, n_jobs=-1),
X_train, y_train, cv=skf).mean(), 5) for sc in logit_c_values2]
print logit_mean_cv_scores2
In [52]:
plot(logit_c_values2, logit_mean_cv_scores2);
Conclusions:
As can be seen from the experiments, LinearRegression and RandomForest show acceptable quality for the data of 10 users, but in fact, with the growth of the user sampling time and consumption of resources grow significantly. Question is these models are useful in real life is still open.
Another thing can be seen from the experiments, that visualization of the customizable options can significantly speed up their adjustment.
3. As we can see the build LinearRegression and RandomForest can take a lot of time, especially if you have many users. Here to the aid can come Vowpal Wabbit bulky, ugly and repulsive, but as an old tractor crushing all in its path, taking minimum time, making good stuff with good quality.
For VW, we will conduct experiments on 150 users
In [53]:
with open('capstone_websites_data/X_sparse_150users.pkl', 'rb') as X150_pkl:
X_sparse_150users = pickle.load(X150_pkl)
with open('capstone_websites_data/y_150users.pkl', 'rb') as y150_pkl:
y_150users = pickle.load(y150_pkl)
Vowpal Wabbit likes that the class labels were distributed from 1 to K, where K is the number of classes in a classification problem (in our case 150). So we'll have to apply LabelEncoder, and even +1 then add (LabelEncoder translates labels in a range from 0 to K-1).
In [54]:
class_encoder = LabelEncoder().fit(y_150users.astype('str'))
y_for_vw = class_encoder.transform(y_150users.astype('str')) + 1
In [55]:
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_150users, y_for_vw,
test_size=0.3,
random_state=7, stratify=y_for_vw)
Implement a function that transforms the sparse matrix format Vowpal Wabbit.
In [56]:
def sparse_matrix_to_vw(X_sparse, y=None, out_file='tmp.vw'):
with open(out_file, 'w') as out_file_data:
for i in xrange(np.shape(X_sparse)[0]):
if y != None:
s = '{0} | sites'.format(y[i])
else:
s = '1 | sites'
indices_data = zip(X_sparse[i].indices, X_sparse[i].data)
out_file_data.write(s + ''.join(map(lambda pair: ' {0}:{1} '.format(pair[0], pair[1]), indices_data))
+ '\n')
Apply a function to the training sample (X_train, y_train) and hold-out sample (X_valid, y_valid).
In [57]:
sparse_matrix_to_vw(X_train, y_train, 'train_part.vw')
sparse_matrix_to_vw(X_valid, y_valid, 'valid.vw')
Evaluate the first 5 rows of the train, validation.
In [59]:
!head -5 train_part.vw
In [60]:
!head -5 valid.vw
Train the model on a sample of train_part.vw. Point out, that we solve the problem of classification with 150 classes (--oaa), will make 10 passes through the sample (--passes). Set cache file (--cache_file), so the VW will be faster after the first passages in the sample (past the cache file is removed with the help of the argument -k). Also set value of the parameter b=26. Its the number of bits used for hashing, in our case we need more than the default 18.
In [62]:
%%time
!vw -d train_part.vw -k -b 26 --oaa 150 --passes 10 --cache_file train_part.cachef \
-f train_part.model --random_seed=123
Write predictions on a sample valid.vw in vw_valid_pred.csv.
In [63]:
%%time
!vw -i train_part.model -t -d valid.vw -p vw_valid_pred.csv --random_seed=123
Read predictions from vw_valid_pred.csv file and look at the proportion of correct answers on the hold-out part.
In [65]:
vw_valid_pred = pd.read_csv('vw_valid_pred.csv', names=['target'])
acc_score = accuracy_score(y_valid, vw_valid_pred)
print 'accuracy', acc_score
So, we get Accuracy score: 0.4460 on hold-out dataset. Seemed not to be so much, but note how faster model was trained and in addition we used all 150 users! Also note that we have trained model on a sample consisting only of information about visited sites. Let's extend our dataset with more features.
While doing Visual analysis of data, we noticed that user behavior differs according to time and day of the week. Some users more active in the morning, some in the evening. I think these signs are important, because usually if the person was active in the morning, and then started to visit many sites in the evening, and not from a set it is most often visited, - it is a direct hint that this account may be used by someone else. So, I decided to add such features:
In [66]:
def feature_engineering(csv_files_mask, site_freq_path, feature_names, session_length=10, window_size=1, f_type=(True, True)):
FMT = '%Y-%m-%dT%H:%M:%S'
user_session = []
features_len = len(feature_names)
if window_size == 1:
window_size = session_length
with open(site_freq_path, 'rb') as XX_pkl:
freq_dict = pickle.load(XX_pkl)
for file_name in sorted(glob(csv_files_mask)):
data = pd.read_csv(file_name, names=['user_id', 'date', 'site'])
user_id = data['user_id'][0]
end_idx = session_length
rows_count = data.shape[0]
if f_type[0] and f_type[1]:
top_30_sorted = sorted(freq_dict.items(), key=lambda x: x[1][1], reverse=True)
user_top30_sites = [freq[0] for freq in top_30_sorted[:10]]
for start_idx in xrange(0, rows_count, window_size):
if end_idx >= rows_count:
data_slice = data[start_idx:rows_count]
else:
data_slice = data[start_idx:end_idx]
if end_idx + window_size > rows_count:
end_idx = rows_count
else:
end_idx += window_size
# build session row
session = np.zeros(features_len, dtype=np.int32)
data_slice_len = len(data_slice)
tm_diff = time_diff2(data_slice)
if f_type[0] and f_type[1]:
# uniques
uniques = data_slice['site'].unique()
session[0] = len(uniques)
# start_hour
session[1] = tm_diff["start_hour"]
# day_of_week
session[2] = tm_diff["day_of_week"]
# how much site from user's top 20 he saw in session
session[3] = len([st for st in uniques if st in user_top30_sites])
# part of the day
if 6 < tm_diff["start_hour"] < 13:
session[4] = 1
if 12 < tm_diff["start_hour"] < 19:
session[4] = 2
if tm_diff["start_hour"] > 18:
session[4] = 3
if 0 < tm_diff["start_hour"] < 7:
session[4] = 4
if tm_diff["day_of_week"] < 6:
session[5] = 1
else:
session[5] = 2
# target
session[features_len - 1] = user_id
user_session.append(session.tolist())
return pd.DataFrame(data=user_session, columns=feature_names)
Prepare dataset for 10 users only
In [67]:
feature_names = ['#unique_sites', 'start_hour', 'day_of_week', '#of_top_30', 'time_of_day', 'working_day', 'target']
new_features_10users = feature_engineering('capstone_websites_data/10users/*', 'capstone_websites_data/site_freq_10users.pkl',
feature_names, session_length=10)
In [68]:
new_features_10users.head()
Out[68]:
In [69]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))
for idx, (user, sub_df) in enumerate(pd.groupby(new_features_10users, 'target')):
ax = sub_df['time_of_day'].plot(subplots=True, kind="hist", color=color_dic[id_name_dict[user]],
ax=axes[idx/4, idx%4], bins=4)
ax[0].legend(labels=[id_name_dict[user]])
ax[0].set_ylabel("Visits frequency")
ax[0].set_xlabel("Part of the day")
ax[0].set_xticklabels(['Morning', 'Noon', 'Evening', 'Night'])
fig.tight_layout()
In [70]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))
for idx, (user, sub_df) in enumerate(pd.groupby(new_features_10users, 'target')):
ax = sub_df['working_day'].plot(subplots=True, kind="hist", color=color_dic[id_name_dict[user]], legend=False,
ax=axes[idx/4, idx%4], bins=3)
ax[0].yaxis.set_tick_params(pad=35)
ax[0].legend(labels=[id_name_dict[user]])
ax[0].set_ylabel("Visits frequency")
ax[0].set_xticklabels(['Working day', "" , 'Weekend'])
fig.tight_layout()
In [72]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))
for idx, (user, sub_df) in enumerate(pd.groupby(new_features_10users, 'target')):
ax = sub_df['#of_top_30'].plot(subplots=True, kind="hist", color=color_dic[id_name_dict[user]],
legend=False, ax=axes[idx/4, idx%4], bins=10)
ax[0].legend(labels=[user])
ax[0].set_ylabel("Visits frequency")
ax[0].set_xlabel("A number of websites from user TOP30")
fig.tight_layout()
As we see, the new features perfectly show the difference in the behavior of users and thus can help us improve accuracy score.
Prepare dataset for all 150 users
In [73]:
feature_names = ['#unique_sites', 'start_hour', 'day_of_week', '#of_top_30', 'time_of_day', 'working_day', 'target']
new_features_150users = feature_engineering('capstone_websites_data/150users/*', 'capstone_websites_data/site_freq_150users.pkl',
feature_names, session_length=10)
In [74]:
new_features_150users.head()
Out[74]:
In [211]:
y_target = new_features_150users.iloc[:, -1]
nf_150users = new_features_150users.iloc[:, :-1]
nf_150users.head()
Out[211]:
We must remember that the features start_hour, day_of_week, time_of_day , working_day - categorical, they have to be binarized before using
In [234]:
from sklearn.preprocessing import OneHotEncoder
binarized = OneHotEncoder().fit_transform(nf_150users)
_X_150users = sp_hstack((X_sparse_150users, binarized)).tocsr()
print np.shape(_X_150users)
In [235]:
X_train, X_valid, y_train, y_valid = train_test_split(_X_150users, y_for_vw,
test_size=0.3,
random_state=7,
stratify=y_for_vw)
In [229]:
print np.shape(_X_150users)
print np.shape(y_for_vw)
Convert to format VW
In [236]:
sparse_matrix_to_vw(X_train, y_train, 'train_part.vw')
sparse_matrix_to_vw(X_valid, y_valid, 'valid.vw')
In [216]:
!head -5 train_part.vw
In [217]:
!head -5 valid.vw
Train model
In [237]:
%%time
!vw -d train_part.vw -k -b 26 --oaa 150 --passes 10 --cache_file train_part.cachef \
-f train_part.model --random_seed=123
Write predictions on a sample valid.vw in vw_valid_pred.csv.
In [238]:
%%time
!vw -i train_part.model -t -d valid.vw -p vw_valid_pred.csv --random_seed=123
Estimation on hold-out dataset
In [239]:
vw_valid_pred = pd.read_csv('vw_valid_pred.csv', names=['target'])
acc_score = accuracy_score(y_valid, vw_valid_pred)
print 'accuracy', acc_score
We see that adding new features improved the accuracy of our predictions from 0.44 to 0.52
From my work I can make some important conclusions:
You can still spend more time and computing resources and to try to adjust and optimize the algorithm. Visualisation can significantly accelerate the selection of parameters, a optimization using a grid search to better fine - tune the model.