Flujo de trabajo para utilizar el Contexto Geográfico como entrada para clasificar el sentimiento en tuits.

Imports generales


In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from helpers.models import fit_model
from helpers.helpers import make_binary, class_info
# set random state for camparability
random_state = np.random.RandomState(0)

Preprocesamiento

Leer los datos y seleccionar las variables del estudio. También recodificamos la variable "intervalo" como numérica


In [2]:
# read data
context = pd.read_csv('data/muestra_variables.csv')
# select variable columns
cols_select = context.columns[6:]
variables = context.ix[:,cols_select]
for c in ['no_se','uname','content','cve_mza']:
    del variables[c]

# reclass intervalo as numerical
def intervalo_to_numbers(x):
    equiv = {'sun':0,'mon':1,'tue':2,'wed':3,'thu':4,'fri':5,'sat':6,'sun':7}
    interval = 0.16666*int(x.split('.')[1])
    day = x.split('.')[0]
    valor = equiv[day] + interval
    return valor

reclass = variables['intervalo'].apply(intervalo_to_numbers)

# drop old 'intervalo' column and replace it with numerical values
del variables['intervalo']
variables = variables.join(reclass,how='inner')

Obtener los datos como np.array y separar los datos en predictor (X) y objetivo (Y)


In [19]:
data = variables.as_matrix()
data_Y = data[:,0]
data_X = data[:,1:]
print("Initial label distribution")
class_info(data_Y)


Initial label distribution
   1.0:    7353  =   66.9%
   2.0:    1120  =   10.2%
   3.0:    1762  =   16.0%
   4.0:     758  =    6.9%

Eliminar los datos con etiqueta 4 (no sé que sean)


In [20]:
data_X, data_Y = data_X[data_Y != 4], data_Y[data_Y != 4]

Hacemos dos binarizaciones de los datos, en una agregamos las clases Pos y Neu (etiquetas 1 y 2) y en la otra agregamos Neg Y Neu (etiquetas 3 y 2).

En el caso de la primera, el problema se convierte en encontrar todos los tuit no-positivos. Mientras que en la segunda, el problema es encontrar todos los no-negativos. Entonces, la etiqueta positiva en el primer caso son los no-negativos, mientras que en el segundo caso son los no-positivos.


In [21]:
Y_pos_neu = make_binary(data_Y, set((1.,2.)))
Y_neg_neu = make_binary(data_Y, set((3.,2.)))
print("Label distribution after binarization")
print("Pos + Neu")
class_info(Y_pos_neu)
print()
print("Neg + Neu")
class_info(Y_neg_neu)


Label distribution after binarization
Pos + Neu
     0:    1762  =   17.2%
     1:    8473  =   82.8%

Neg + Neu
     0:    7353  =   71.8%
     1:    2882  =   28.2%

Separamos en muestras de prueba (40%) y entrenamiento para ambas binarizaciones.

Más adelante podemos utilizar una estrategia de folds e iterar sobre ellos


In [22]:
(X_train_pos_neu, X_test_pos_neu, 
Y_train_pos_neu, Y_test_pos_neu) = train_test_split(data_X, Y_pos_neu,
                                                    test_size=0.4,
                                                    random_state=random_state)

(X_train_neg_neu, X_test_neg_neu, 
Y_train_neg_neu, Y_test_neg_neu) = train_test_split(data_X, Y_neg_neu,
                                                    test_size=0.4,
                                                    random_state=random_state)

Reescalamos las muestras de entrenamiento


In [23]:
X_pos_neu_s = preprocessing.scale(X_train_pos_neu)
X_neg_neu_s = preprocessing.scale(X_train_neg_neu)

Entrenamiento con las muestras sin balancear.

Primero vamos a entrenar SVMs con diferentes métricas utilizando las muestras originales, sin balancear. El primer paso es definir el espacio de parámetros de búsqueda param_grid y las métricas a evaluar:


In [8]:
param_grid = {'C': [1, 10, 100, 1000], 'gamma': [0.01,0.001, 0.0001],
              'kernel': ['rbf']}
metrics = ['f1','accuracy','average_precision','roc_auc','recall']

Ahora sí, ajustamos las SVM con diferentes métricas, primero para la binarización Pos + Neu:


In [9]:
fitted_models_pos_neu = {}
for metric in metrics:
    fitted_models_pos_neu[metric] = fit_model(X_pos_neu_s,Y_train_pos_neu,
                                                param_grid,metric,6)

for metric, model in fitted_models_pos_neu.items():
    print ("Using metric {}".format(metric))
    print("Best parameters set found on development set:")
    print()
    print(model.best_params_)
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in model.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))

    print()


Using metric average_precision
Best parameters set found on development set:

{'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
Grid scores on development set:

0.843 (+/-0.011) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.860 (+/-0.020) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.860 (+/-0.016) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.843 (+/-0.021) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.862 (+/-0.019) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.862 (+/-0.020) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.840 (+/-0.024) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.858 (+/-0.020) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.862 (+/-0.018) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.844 (+/-0.018) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.858 (+/-0.017) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.860 (+/-0.013) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric f1
Best parameters set found on development set:

{'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
Grid scores on development set:

0.909 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.909 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.909 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.901 (+/-0.005) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.908 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.909 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.866 (+/-0.008) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.905 (+/-0.005) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.909 (+/-0.001) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.841 (+/-0.015) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.885 (+/-0.012) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.907 (+/-0.003) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric recall
Best parameters set found on development set:

{'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
Grid scores on development set:

1.000 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
1.000 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
1.000 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.981 (+/-0.011) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.999 (+/-0.002) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
1.000 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.898 (+/-0.019) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.989 (+/-0.011) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.999 (+/-0.003) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.843 (+/-0.029) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.946 (+/-0.020) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.995 (+/-0.006) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric roc_auc
Best parameters set found on development set:

{'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
Grid scores on development set:

0.496 (+/-0.022) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.538 (+/-0.035) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.541 (+/-0.027) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.503 (+/-0.037) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.541 (+/-0.031) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.547 (+/-0.050) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.510 (+/-0.032) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.532 (+/-0.037) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.549 (+/-0.039) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.519 (+/-0.022) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.534 (+/-0.042) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.542 (+/-0.025) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric accuracy
Best parameters set found on development set:

{'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
Grid scores on development set:

0.833 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.833 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.833 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.821 (+/-0.009) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.832 (+/-0.002) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.833 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.768 (+/-0.012) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.827 (+/-0.008) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.833 (+/-0.002) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.734 (+/-0.022) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.796 (+/-0.020) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.831 (+/-0.005) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Ahora evaluamos sobre la mustra de prueba, para obtener los scores de validación:


In [26]:
#X_pos_neu_s_test = preprocessing.scale(X_test_pos_neu)
for metric, model in fitted_models_pos_neu.items():
    this_estimator = fitted_models_pos_neu[metric].best_estimator_ 
    this_score = this_estimator.score(X_pos_neu_s_test, Y_test_pos_neu)
    y_pred = this_estimator.fit(X_pos_neu_s_test, Y_test_pos_neu).predict(X_pos_neu_s_test)
    #conf_matrix = confusion_matrix(Y_test_pos_neu,y_pred)
    df_confusion = pd.crosstab(Y_test_pos_neu, y_pred, 
                               rownames=['Actual'], 
                               colnames=['Predicted'], margins=True)
    print ("Using metric {}".format(metric))
    print("Validation score {}".format(this_score))
    print("Confusion Matrix:")
    print(df_confusion)
    print()


Using metric average_precision
Validation score 0.827552515876893
Confusion Matrix:
Predicted  0     1   All
Actual                  
0          2   706   708
1          0  3386  3386
All        2  4092  4094

Using metric f1
Validation score 0.8270639960918417
Confusion Matrix:
Predicted     1   All
Actual               
0           708   708
1          3386  3386
All        4094  4094

Using metric recall
Validation score 0.8270639960918417
Confusion Matrix:
Predicted     1   All
Actual               
0           708   708
1          3386  3386
All        4094  4094

Using metric roc_auc
Validation score 0.827552515876893
Confusion Matrix:
Predicted  0     1   All
Actual                  
0          2   706   708
1          0  3386  3386
All        2  4092  4094

Using metric accuracy
Validation score 0.8270639960918417
Confusion Matrix:
Predicted     1   All
Actual               
0           708   708
1          3386  3386
All        4094  4094

Ahora lo mismo pero con la otra binarización, para hacer los dos casos comparables vamos a voltear las etiquetas de las clases:


In [24]:
Y_train_neg_neu = np.array([1 if val == 0 else 0 for val in Y_train_neg_neu])
fitted_models_neg_neu = {}
for metric in metrics:
    fitted_models_neg_neu[metric] = fit_model(X_neg_neu_s,Y_train_neg_neu,
                                                param_grid,metric,6)

for metric, model in fitted_models_neg_neu.items():
    print ("Using metric {}".format(metric))
    print("Best parameters set found on development set:")
    print()
    print(model.best_params_)
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in model.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))

    print()


Using metric average_precision
Best parameters set found on development set:

{'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
Grid scores on development set:

0.717 (+/-0.028) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.731 (+/-0.021) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.734 (+/-0.023) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.723 (+/-0.023) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.727 (+/-0.018) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.740 (+/-0.027) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.737 (+/-0.020) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.727 (+/-0.011) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.736 (+/-0.020) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.742 (+/-0.010) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.736 (+/-0.023) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.733 (+/-0.021) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric f1
Best parameters set found on development set:

{'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
Grid scores on development set:

0.833 (+/-0.002) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.834 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.834 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.810 (+/-0.010) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.832 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.834 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.750 (+/-0.014) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.823 (+/-0.010) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.832 (+/-0.003) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.733 (+/-0.018) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.791 (+/-0.012) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.829 (+/-0.004) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric recall
Best parameters set found on development set:

{'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
Grid scores on development set:

0.998 (+/-0.003) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
1.000 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
1.000 (+/-0.000) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.933 (+/-0.022) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.994 (+/-0.003) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.999 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.783 (+/-0.021) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.970 (+/-0.019) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.995 (+/-0.006) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.741 (+/-0.028) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.887 (+/-0.026) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.986 (+/-0.005) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric roc_auc
Best parameters set found on development set:

{'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
Grid scores on development set:

0.493 (+/-0.035) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.502 (+/-0.027) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.510 (+/-0.025) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.508 (+/-0.031) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.501 (+/-0.022) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.515 (+/-0.030) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.531 (+/-0.034) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.503 (+/-0.008) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.507 (+/-0.019) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.536 (+/-0.031) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.518 (+/-0.026) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.509 (+/-0.013) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Using metric accuracy
Best parameters set found on development set:

{'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
Grid scores on development set:

0.714 (+/-0.003) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.01}
0.716 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.715 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.687 (+/-0.014) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.01}
0.713 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.715 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.627 (+/-0.020) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
0.702 (+/-0.015) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.713 (+/-0.005) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.615 (+/-0.022) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.01}
0.664 (+/-0.016) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.709 (+/-0.006) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}

Y sus métricas sobre la muestra test:


In [27]:
X_neg_neu_s_test = preprocessing.scale(X_test_neg_neu)
for metric, model in fitted_models_neg_neu.items():
    this_estimator = fitted_models_neg_neu[metric].best_estimator_ 
    this_score = this_estimator.score(X_neg_neu_s_test, Y_test_neg_neu)
    y_pred = this_estimator.fit(X_neg_neu_s_test, Y_test_neg_neu).predict(X_neg_neu_s_test)
    #conf_matrix = confusion_matrix(Y_test_pos_neu,y_pred)
    df_confusion = pd.crosstab(Y_test_neg_neu, y_pred, 
                               rownames=['Actual'], 
                               colnames=['Predicted'], margins=True)
    print ("Using metric {}".format(metric))
    print("Validation score {}".format(this_score))
    print()
    print("Confusion Matrix:")
    print(df_confusion)


Using metric average_precision
Validation score 0.3793356130923302

Confusion Matrix:
Predicted     0     1   All
Actual                     
0          2950    11  2961
1            23  1110  1133
All        2973  1121  4094
Using metric f1
Validation score 0.27674645823155836

Confusion Matrix:
Predicted     0  1   All
Actual                  
0          2961  0  2961
1          1129  4  1133
All        4090  4  4094
Using metric recall
Validation score 0.27674645823155836

Confusion Matrix:
Predicted     0  1   All
Actual                  
0          2961  0  2961
1          1129  4  1133
All        4090  4  4094
Using metric roc_auc
Validation score 0.3793356130923302

Confusion Matrix:
Predicted     0     1   All
Actual                     
0          2950    11  2961
1            23  1110  1133
All        2973  1121  4094
Using metric accuracy
Validation score 0.27674645823155836

Confusion Matrix:
Predicted     0  1   All
Actual                  
0          2961  0  2961
1          1129  4  1133
All        4090  4  4094