model05

Model

  • Linear models: LinearRegression, Ridge, Lasso, ElasticNet

Features

  • uid
  • qid
  • q_length
  • category
  • answer
  • avg_per_uid: average response time per user
  • avg_per_qid: average response time per question

In [13]:
import gzip
import cPickle as pickle

In [14]:
with gzip.open("../data/train.pklz", "rb") as train_file:
    train_set = pickle.load(train_file)

with gzip.open("../data/test.pklz", "rb") as test_file:
    test_set = pickle.load(test_file)

with gzip.open("../data/questions.pklz", "rb") as questions_file:
    questions = pickle.load(questions_file)

Make training set

For training model, we might need to make feature and lable pair. In this case, we will use only uid, qid, and position for feature.


In [15]:
print train_set[1]
print questions[1].keys()


{'answer': 'cole', 'qid': 1, 'uid': 0, 'position': 61.0}
['answer', 'category', 'group', 'pos_token', 'question']

In [16]:
X = []
Y = []
avg_time_per_user = {}
avg_time_per_que = {}

for key in train_set:
    # We only care about positive case at this time
    #if train_set[key]['position'] < 0:
    #    continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    
    # Calculate average response time per user
    temp = 0; num = 0
    if uid not in avg_time_per_user.keys():
        for keysubset in train_set:
            if train_set[keysubset]['uid'] == uid:
                temp += train_set[keysubset]['position']
                num += 1
        avg_time_per_user[uid] = temp/num
        temp=0; num = 0

    # Calculate average response time per question
    temp=0; num = 0
    if qid not in avg_time_per_que.keys():
        for keysubset in train_set:
            if train_set[keysubset]['qid'] == qid:
                temp += train_set[keysubset]['position']
                num += 1
        avg_time_per_que[qid] = temp/num
        temp=0; num = 0
    
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer, "avg_per_uid": avg_time_per_user[uid], "avg_per_qid":avg_time_per_que[qid]}
    X.append(feat)
    Y.append([pos])

In [17]:
print len(X)
print len(Y)
print X[0], Y[0]


28494
28494
{'category': 'fine arts', 'avg_per_uid': 55.708333333333336, 'avg_per_qid': 51.0, 'uid': '0', 'qid': '1', 'q_length': 77, 'answer': 'thomas cole'} [61.0]

It means that user 0 tried to solve question number 1 which has 77 tokens for question and he or she answered at 61st token.

Train model and make predictions

Let's train model and make predictions.


In [18]:
from sklearn.feature_extraction import DictVectorizer


vec = DictVectorizer()
X = vec.fit_transform(X)
print X[0]


  (0, 920)	1.0
  (0, 1020)	51.0
  (0, 1021)	55.7083333333
  (0, 1026)	1.0
  (0, 1033)	77.0
  (0, 1034)	1.0
  (0, 6958)	1.0

In [19]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.cross_validation import train_test_split, cross_val_score
import math

X_train, X_test, Y_train, Y_test = train_test_split (X, Y)

regressor = LinearRegression()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'Linear Cross validation RMSE scores:', scores.mean()
print scores

regressor = Ridge()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'Ridge Cross validation RMSE scores:', scores.mean()
print scores

regressor = Lasso()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'Lasso Cross validation RMSE scores:', scores.mean()
print scores

regressor = ElasticNet()
scores = cross_val_score(regressor, X, Y, cv=10, scoring= 'mean_squared_error')
# Flip the sign of MSE and take sqrt of that values.
for ii in xrange(len(scores)):
    scores[ii] = math.sqrt(-1*scores[ii])
print 'ElasticNet Cross validation RMSE scores:', scores.mean()
print scores


Linear Cross validation RMSE scores: 70.2538395585
[ 68.39945816  73.69032503  66.82839455  63.55430443  63.46247558
  67.52452534  68.75825003  78.33195237  80.50150679  71.4872033 ]
Ridge Cross validation RMSE scores: 70.2511715134
[ 66.96680032  73.97541369  66.47868624  63.23195054  63.31794102
  67.80471778  69.02672314  79.52439608  80.78110191  71.40398441]
Lasso Cross validation RMSE scores: 68.8414140964
[ 65.55219108  72.41009122  65.33503897  61.77487946  62.15263106
  66.44985357  67.68136606  77.41133691  79.46512163  70.18163101]
ElasticNet Cross validation RMSE scores: 68.8415869663
[ 65.55229672  72.41029208  65.33515005  61.77533796  62.1532062
  66.44999469  67.6812933   77.41127578  79.46538065  70.18164222]

In [20]:
a = [{1: 2}, {2: 3}]
b = [{3: 2}, {4: 3}]
c = a + b
print c[:len(a)]
print c[len(a):]


[{1: 2}, {2: 3}]
[{3: 2}, {4: 3}]

In [21]:
X_train = []
Y_train = []

for key in train_set:
    # We only care about positive case at this time
    #if train_set[key]['position'] < 0:
    #    continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X_train.append(feat)
    Y_train.append(pos)

X_test = []
Y_test = []

for key in test_set:
    uid = test_set[key]['uid']
    qid = test_set[key]['qid']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X_test.append(feat)
    Y_test.append(key)

print "Before transform: ", len(X_test)
X_train_length = len(X_train)
X = vec.fit_transform(X_train + X_test)
X_train = X[:X_train_length]
X_test = X[X_train_length:]


Before transform:  4749

In [28]:
regressor = Ridge()
regressor.fit(X_train, Y_train)


Out[28]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, solver='auto', tol=0.001)

In [29]:
predictions = regressor.predict(X_test)
predictions = sorted([[id, predictions[index]] for index, id in enumerate(Y_test)])
print len(predictions)
predictions[:5]


4749
Out[29]:
[[7, 34.876806438006554],
 [14, 55.506806653610681],
 [21, 36.62606574100618],
 [28, 24.247734632069555],
 [35, 74.999528288463722]]

Here is 4749 predictions.

Writing submission.

OK, let's writing submission into guess.csv file. In the given submission form, we realized that we need to put header. So, we will insert header at the first of predictions, and then make it as a file.


In [30]:
import csv


predictions.insert(0,["id", "position"])
with open('guess.csv', 'wb') as fp:
    writer = csv.writer(fp, delimiter=',')
    writer.writerows(predictions)

All right. Let's submit!