model 03 linear_model with dictVectorizer

Load train, test, questions data from pklz

First of all, we need to read those three data set.



In [23]:

    
import gzip
import cPickle as pickle



In [24]:

    
with gzip.open("../data/train.pklz", "rb") as train_file:
    train_set = pickle.load(train_file)

with gzip.open("../data/test.pklz", "rb") as test_file:
    test_set = pickle.load(test_file)

with gzip.open("../data/questions.pklz", "rb") as questions_file:
    questions = pickle.load(questions_file)

Make training set

For training model, we might need to make feature and lable pair. In this case, we will use only uid, qid, and position for feature.



In [25]:

    
print train_set[1]
print questions[1].keys()









    



{'answer': 'cole', 'qid': 1, 'uid': 0, 'position': 61.0}
['answer', 'category', 'group', 'pos_token', 'question']



In [26]:

    
X = []
Y = []

for key in train_set:
    # We only care about positive case at this time
    if train_set[key]['position'] < 0:
        continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X.append(feat)
    Y.append([pos])



In [27]:

    
print len(X)
print len(Y)
print X[0], Y[0]









    



20819
20819
{'q_length': 77, 'qid': '1', 'category': 'fine arts', 'answer': 'thomas cole', 'uid': '0'} [61.0]

It means that user 0 tried to solve question number 1 which has 77 tokens for question and he or she answered at 61st token.

Train model and make predictions

Let's train model and make predictions.



In [28]:

    
from sklearn.feature_extraction import DictVectorizer


vec = DictVectorizer()
X = vec.fit_transform(X)



In [29]:

    
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.cross_validation import train_test_split, cross_val_score


X_train, X_test, Y_train, Y_test = train_test_split (X, Y)

regressor = LinearRegression()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores

regressor = Ridge()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores

regressor = Lasso()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores

regressor = ElasticNet()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores









    



Cross validation r-squared scores: 0.403165663739
[ 0.33473043  0.41661155  0.41591862  0.39790069  0.36957474  0.35839188
  0.37661004  0.48799347  0.39546295  0.47846227]
Cross validation r-squared scores: 0.401224022105
[ 0.34303511  0.42162677  0.4232788   0.39211215  0.32800715  0.35658753
  0.3945086   0.48386448  0.38650055  0.48271909]
Cross validation r-squared scores: 0.270919114599
[ 0.26622343  0.31533955  0.2629531   0.20948286  0.27379395  0.21278792
  0.22111511  0.34427635  0.25370233  0.34951655]
Cross validation r-squared scores: 0.271713588634
[ 0.26709454  0.31595901  0.26393381  0.21039138  0.27486122  0.21352889
  0.22170847  0.34463347  0.25453584  0.35048926]



In [30]:

    
a = [{1: 2}, {2: 3}]
b = [{3: 2}, {4: 3}]
c = a + b
print c[:len(a)]
print c[len(a):]









    



[{1: 2}, {2: 3}]
[{3: 2}, {4: 3}]



In [35]:

    
X_train = []
Y_train = []

for key in train_set:
    # We only care about positive case at this time
    if train_set[key]['position'] < 0:
        continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X_train.append(feat)
    Y_train.append(pos)

X_test = []
Y_test = []

for key in test_set:
    uid = test_set[key]['uid']
    qid = test_set[key]['qid']
    q_length = max(questions[qid]['pos_token'].keys())
    category = questions[qid]['category'].lower()
    answer = questions[qid]['answer'].lower()
    feat = {"uid": str(uid), "qid": str(qid), "q_length": q_length, "category": category, "answer": answer}
    X_test.append(feat)
    Y_test.append(key)

print "Before transform: ", len(X_test)
X_train_length = len(X_train)
X = vec.fit_transform(X_train + X_test)
X_train = X[:X_train_length]
X_test = X[X_train_length:]









    



Before transform:  4749



In [44]:

    
# regressor = LinearRegression()
regressor = Ridge()
regressor.fit(X_train, Y_train)









    Out[44]:





Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, solver='auto', tol=0.001)



In [45]:

    
predictions = regressor.predict(X_test)
predictions = sorted([[id, predictions[index]] for index, id in enumerate(Y_test)])
print len(predictions)
predictions[:5]









    



4749






    Out[45]:





[[7, 23.033451312008083],
 [14, 55.545269901372571],
 [21, 69.150945002240292],
 [28, 44.660760734187321],
 [35, 71.862678877980883]]

Here is 4749 predictions.

Writing submission.

OK, let's writing submission into guess.csv file. In the given submission form, we realized that we need to put header. So, we will insert header at the first of predictions, and then make it as a file.



In [46]:

    
import csv


predictions.insert(0,["id", "position"])
with open('guess.csv', 'wb') as fp:
    writer = csv.writer(fp, delimiter=',')
    writer.writerows(predictions)

All right. Let's submit!



In [ ]: