model 04 linear_model with dictVectorizer

Using RMSE

Load train, test, questions data from pklz

First of all, we need to read those three data set.


In [1]:
import gzip
import cPickle as pickle

In [2]:
with gzip.open("../data/train.pklz", "rb") as train_file:
    train_set = pickle.load(train_file)

with gzip.open("../data/test.pklz", "rb") as test_file:
    test_set = pickle.load(test_file)

with gzip.open("../data/questions.pklz", "rb") as questions_file:
    questions = pickle.load(questions_file)

Make training set

For training model, we might need to make feature and lable pair. In this case, we will use only uid, qid, and position for feature.


In [3]:
print train_set[1]
print questions[1].keys()


{'answer': 'cole', 'qid': 1, 'uid': 0, 'position': 61.0}
['answer', 'category', 'group', 'pos_token', 'question']

In [4]:
X = []
Y = []

for key in train_set:
    # We only care about positive case at this time
    #if train_set[key]['position'] < 0:
    #    continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X.append(feat)
    Y.append([pos])

In [5]:
print len(X)
print len(Y)
print X[0], Y[0]


28494
28494
{'q_length': 77, 'qid': '1', 'category': 'fine arts', 'answer': 'thomas cole', 'uid': '0'} [61.0]

It means that user 0 tried to solve question number 1 which has 77 tokens for question and he or she answered at 61st token.

Train model and make predictions

Let's train model and make predictions.


In [6]:
from sklearn.feature_extraction import DictVectorizer


vec = DictVectorizer()
X = vec.fit_transform(X)

In [7]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.cross_validation import train_test_split, cross_val_score
import math

X_train, X_test, Y_train, Y_test = train_test_split (X, Y)

regressor = LinearRegression()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'Linear Cross validation RMSE scores:', scores.mean()
print scores

regressor = Ridge()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'Ridge Cross validation RMSE scores:', scores.mean()
print scores

regressor = Lasso()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'Lasso Cross validation RMSE scores:', scores.mean()
print scores

regressor = ElasticNet()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'ElasticNet Cross validation RMSE scores:', scores.mean()
print scores


Linear Cross validation RMSE scores: 82.1830583745
[ 76.67270464  84.86393907  77.6004619   72.8021877   76.24757558
  77.65462341  80.35628102  94.48375011  97.37825476  83.77080556]
Ridge Cross validation RMSE scores: 81.5163799117
[ 75.42551662  84.36449222  76.78385805  72.27574686  75.59057276
  76.94214904  79.64612587  93.98620172  96.90823254  83.24090344]
Lasso Cross validation RMSE scores: 84.8407993557
[  77.81940559   89.31984484   79.45613965   73.92431074   77.7030486
   80.77462146   83.03076112   98.11867225  102.51332651   85.74786278]
ElasticNet Cross validation RMSE scores: 85.0222606404
[  77.67035781   89.52297457   79.89587485   74.34862438   77.93037888
   80.96532162   83.09002602   98.34951419  102.64014747   85.8093866 ]

In [8]:
a = [{1: 2}, {2: 3}]
b = [{3: 2}, {4: 3}]
c = a + b
print c[:len(a)]
print c[len(a):]


[{1: 2}, {2: 3}]
[{3: 2}, {4: 3}]

In [9]:
X_train = []
Y_train = []

for key in train_set:
    # We only care about positive case at this time
    #if train_set[key]['position'] < 0:
    #    continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X_train.append(feat)
    Y_train.append(pos)

X_test = []
Y_test = []

for key in test_set:
    uid = test_set[key]['uid']
    qid = test_set[key]['qid']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X_test.append(feat)
    Y_test.append(key)

print "Before transform: ", len(X_test)
X_train_length = len(X_train)
X = vec.fit_transform(X_train + X_test)
X_train = X[:X_train_length]
X_test = X[X_train_length:]


Before transform:  4749

In [10]:
# regressor = LinearRegression()
regressor = Ridge()
regressor.fit(X_train, Y_train)


Out[10]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, solver='auto', tol=0.001)

In [11]:
predictions = regressor.predict(X_test)
predictions = sorted([[id, predictions[index]] for index, id in enumerate(Y_test)])
print len(predictions)
predictions[:5]


4749
Out[11]:
[[7, 34.876806438006554],
 [14, 55.506806653610681],
 [21, 36.62606574100618],
 [28, 24.247734632069555],
 [35, 74.999528288463722]]

Here is 4749 predictions.

Writing submission.

OK, let's writing submission into guess.csv file. In the given submission form, we realized that we need to put header. So, we will insert header at the first of predictions, and then make it as a file.


In [12]:
import csv


predictions.insert(0,["id", "position"])
with open('guess.csv', 'wb') as fp:
    writer = csv.writer(fp, delimiter=',')
    writer.writerows(predictions)

All right. Let's submit!