Il s'agit d'une version simplifiée du concours proposé par Cdiscount et paru sur le site datascience.net. Les données d'apprentissage sont accessibles sur demande auprès de Cdiscount mais les solutions de l'échantillon test du concours ne sont pas et ne seront pas rendues publiques. Un échantillon test est donc construit pour l'usage de ce tutoriel. L'objectif est de prévoir la catégorie d'un produit à partir de son descriptif (text mining). Seule la catégorie principale (1er niveau, 47 classes) est prédite au lieu des trois niveaux demandés dans le concours. L'objectif est plutôt de comparer les performances des méthodes et technologies en fonction de la taille de la base d'apprentissage ainsi que d'illustrer sur un exemple complexe le prétraitement de données textuelles.
Le jeux de données complet (15M produits) permet un test en vrai grandeur du passage à l'échelle volume des phases de préparation (munging), vectorisation (hashage, TF-IDF) et d'apprentissage en fonction de la technologie utilisée.
La synthèse des résultats obtenus est développée par Besse et al. 2016 (section 5).
Dans le calepin numéro 2, nous avons créés 2x7 matrices de features correspondant au mêmes échantillons d'apprentissage et de validation des données textuelles de description d'objet de Cdiscount. Ces matrices ont été crées avec les méthodes suivantes.
Count_Vectorizer
. No hashing
.Count_Vectorizer
. Hashing = 300
.TFIDF_vectorizer
. No hashing
. TFIDF_vectorizer
. Hashing = 300
.Word2Vec
. CBOW
Word2Vec
. Skip-Gram
Word2Vec
. Pre-trained
Nous allons maintenant étudiés les performances d'algorithmes de machine learning (Regression logistique
, Forêts aléatoire
, Perceptron multicouche
) sur ces différents features
In [ ]:
#Importation des librairies utilisées
import time
import numpy as np
import pandas as pd
import scipy as sc
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_style("whitegrid")
DATA_DIR = "data/features"
Téléchargement des variables réponses
In [ ]:
Y_train = pd.read_csv("data/cdiscount_train_subset.csv").fillna("")["Categorie1"]
Y_valid = pd.read_csv("data/cdiscount_valid.csv").fillna("")["Categorie1"]
Création d'un dictionnaire contenant les chemins ou des différents objets où sont stockés les matrices de features.
In [ ]:
features_path_dic = {}
parameters = [["count_no_hashing", None, "count"],
["count_300", 300, "count"],
["tfidf_no_hashing", None, "tfidf"],
["tfidf_300",300, "tfidf"]]
for name, nb_hash, vectorizer in parameters:
x_train_path = DATA_DIR +"/vec_train_nb_hash_" + str(nb_hash) + "_vectorizer_" + str(vectorizer)+".npz"
x_valid_path = DATA_DIR +"/vec_valid_nb_hash_" + str(nb_hash) + "_vectorizer_" + str(vectorizer)+".npz"
dic = {"x_train_path" : x_train_path, "x_valid_path" : x_valid_path, "load" : "npz"}
features_path_dic.update({name : dic})
parametersw2v = [["word2vec_cbow","cbow"],
["word2vec_sg","sg"],
["word2vec_online","online"]]
for name, mtype in parametersw2v:
x_train_path = DATA_DIR +"/embedded_train_" + mtype+".npy"
x_valid_path = DATA_DIR +"/embedded_valid_" + mtype+".npy"
dic = {"x_train_path" : x_train_path, "x_valid_path" : x_valid_path, "load" : "npy"}
features_path_dic.update({name : dic})
In [ ]:
metadata_list_lr = []
param_grid = {"C" : [10,1,0.1]}
#param_grid = {"C" : [1]}
for name, dic in features_path_dic.items():
x_train_path = dic["x_train_path"]
x_valid_path = dic["x_valid_path"]
load = dic["load"]
print("Load features : " + name)
if load == "npz":
X_train = sc.sparse.load_npz(x_train_path)
X_valid = sc.sparse.load_npz(x_valid_path)
else :
X_train = np.load(x_train_path)
X_valid = np.load(x_valid_path)
print("start Learning :" + name)
ts = time.time()
gs = GridSearchCV(LogisticRegression(), param_grid=param_grid, verbose=15)
gs.fit(X_train,Y_train.values)
te=time.time()
t_learning = te-ts
print("start prediction :" + name)
ts = time.time()
score_train=gs.score(X_train,Y_train)
score_valid=gs.score(X_valid,Y_valid)
te=time.time()
t_predict = te-ts
metadata = {"name":name, "learning_time" : t_learning, "predict_time":t_predict, "score_train": score_train, "score_valid": score_valid}
metadata_list_lr.append(metadata)
pickle.dump(metadata_list_lr, open("data/metadata_lr_part13.pkl","wb"))
In [ ]:
metadata_list_lr = pickle.load(open("data/metadata_lr_part13.pkl","rb"))
metadata_list_lr_sorted = sorted(metadata_list_lr, key = lambda x : x["name"])
xlabelticks = [metadata["name"] for metadata in metadata_list_lr_sorted]
fig = plt.figure(figsize=(20,6))
key_plot = [key for key in metadata_list_lr[0].keys() if key !="name"]
for iplot, key in enumerate(key_plot):
ax = fig.add_subplot(1,4,iplot+1)
for i,metadata in enumerate(metadata_list_lr_sorted):
if key=="learning_time":
scale=60
ylabel="Time(mn)"
elif key=="predict_time":
scale=1
ylabel="Time(seconds)"
else:
scale=0.01
ylabel = 'Accuracy (pcy)'
ax.scatter(i,metadata[key]/scale, s=100)
ax.text(i,metadata[key]/scale,"%.2f"%(metadata[key]/scale), ha="left", va="top")
ax.set_xticks(np.arange(7))
ax.set_xticklabels(xlabelticks, rotation=45, fontsize=15, ha="right")
ax.set_title(key, fontsize=20)
ax.set_ylabel(ylabel, fontsize=15)
plt.tight_layout()
plt.show()
Q Comment expliquer le long de temps d'apprentissage de la regression logistique sur les modèles issues de Word2Vec?
Q Comment expliquer la différence de qualité d'apprentissage en fonction du hashing ?
In [ ]:
metadata_list_rf = []
param_grid = {"n_estimators" : [100,500]}
for name, dic in features_path_dic.items():
x_train_path = dic["x_train_path"]
x_valid_path = dic["x_valid_path"]
load = dic["load"]
print("Load features : " + name)
if load == "npz":
X_train = sc.sparse.load_npz(x_train_path)
X_valid = sc.sparse.load_npz(x_valid_path)
else :
X_train = np.load(x_train_path)
X_valid = np.load(x_valid_path)
print("start Learning :" + name)
ts = time.time()
gs = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, verbose=15)
gs.fit(X_train,Y_train.values)
te=time.time()
t_learning = te-ts
print("start prediction :" + name)
ts = time.time()
score_train=gs.score(X_train,Y_train)
score_valid=gs.score(X_valid,Y_valid)
te=time.time()
t_predict = te-ts
metadata = {"name":name, "learning_time" : t_learning, "predict_time":t_predict, "score_train": score_train, "score_valid": score_valid}
metadata_list_rf.append(metadata)
pickle.dump(metadata_list_rf, open("data/metadata_rf_part13.pkl","wb"))
In [ ]:
In [ ]:
metadata_list_rf = pickle.load(open("data/metadata_rf_part13.pkl","rb"))
metadata_list_rf_sorted = sorted(metadata_list_rf, key = lambda x : x["name"])
xlabelticks = [metadata["name"] for metadata in metadata_list_rf_sorted]
fig = plt.figure(figsize=(20,6))
key_plot = [key for key in metadata_list_rf[0].keys() if key !="name"]
for iplot, key in enumerate(key_plot):
ax = fig.add_subplot(1,4,iplot+1)
for i,metadata in enumerate(metadata_list_rf_sorted):
if key=="learning_time":
scale=60
ylabel="Time(mn)"
elif key=="predict_time":
scale=1
ylabel="Time(seconds)"
else:
scale=0.01
ylabel = 'Accuracy (pcy)'
ax.scatter(i,metadata[key]/scale, s=100)
ax.text(i,metadata[key]/scale,"%.2f"%(metadata[key]/scale), ha="left", va="top")
ax.set_xticks(np.arange(7))
ax.set_xticklabels(xlabelticks, rotation=45, fontsize=15, ha="right")
ax.set_title(key, fontsize=20)
ax.set_ylabel(ylabel, fontsize=15)
plt.tight_layout()
plt.show()
In [ ]:
metadata_list_mlp = []
param_grid = {"hidden_layer_sizes" : [32,64,128, 256]}
for name, dic in features_path_dic.items():
x_train_path = dic["x_train_path"]
x_valid_path = dic["x_valid_path"]
load = dic["load"]
print("Load features : " + name)
if load == "npz":
X_train = sc.sparse.load_npz(x_train_path)
X_valid = sc.sparse.load_npz(x_valid_path)
else :
X_train = np.load(x_train_path)
X_valid = np.load(x_valid_path)
print("start Learning :" + name)
ts = time.time()
gs = GridSearchCV(MLPClassifier(learning_rate = "adaptive", ), param_grid=param_grid, verbose=15)
gs.fit(X_train,Y_train.values)
te=time.time()
t_learning = te-ts
print("start prediction :" + name)
ts = time.time()
score_train=gs.score(X_train,Y_train)
score_valid=gs.score(X_valid,Y_valid)
te=time.time()
t_predict = te-ts
metadata = {"name":name, "learning_time" : t_learning, "predict_time":t_predict, "score_train": score_train, "score_valid": score_valid}
metadata_list_mlp.append(metadata)
pickle.dump(metadata_list_mlp, open("data/metadata_mlp_part13.pkl","wb"))
In [ ]:
metadata_list_mlp = pickle.load(open("data/metadata_mlp_part13.pkl","rb"))
metadata_list_mlp_sorted = sorted(metadata_list_mlp, key = lambda x : x["name"])
xlabelticks = [metadata["name"] for metadata in metadata_list_mlp_sorted]
fig = plt.figure(figsize=(20,6))
key_plot = [key for key in metadata_list_mlp[0].keys() if key !="name"]
for iplot, key in enumerate(key_plot):
ax = fig.add_subplot(1,4,iplot+1)
for i,metadata in enumerate(metadata_list_mlp_sorted):
if key=="learning_time":
scale=60
ylabel="Time(mn)"
elif key=="predict_time":
scale=1
ylabel="Time(seconds)"
else:
scale=0.01
ylabel = 'Accuracy (pcy)'
ax.scatter(i,metadata[key]/scale, s=100)
ax.text(i,metadata[key]/scale,"%.2f"%(metadata[key]/scale), ha="left", va="top")
ax.set_xticks(np.arange(7))
ax.set_xticklabels(xlabelticks, rotation=45, fontsize=15, ha="right")
ax.set_title(key, fontsize=20)
ax.set_ylabel(ylabel, fontsize=15)
plt.tight_layout()
plt.show()
In [ ]:
2
In [ ]: