First model

Here is the first model by our own. In this note, you can explorer how read data, train model, make prediction, and save it into submission form.

Load train, test, questions data from pklz

First of all, we need to read those three data set.


In [1]:
import gzip
import cPickle as pickle

In [6]:
with gzip.open("../data/train.pklz", "rb") as train_file:
    train_set = pickle.load(train_file)

with gzip.open("../data/test.pklz", "rb") as test_file:
    test_set = pickle.load(test_file)

with gzip.open("../data/questions.pklz", "rb") as questions_file:
    questions = pickle.load(questions_file)

Let's take a look the loaded data set.


In [8]:
print "train_set: ", len(train_set)
print "test_set: ", len(test_set)
print "questions: ", len(questions)


train_set:  28494
test_set:  4749
questions:  7949

In [21]:
print sorted(train_set.keys())[:10]
print train_set[1]
print train_set[1].keys()


[1, 2, 3, 4, 5, 6, 8, 9, 10, 11]
{'answer': 'cole', 'qid': 1, 'uid': 0, 'position': 61.0}
['answer', 'qid', 'uid', 'position']

In [24]:
print sorted(test_set.keys())[:10]
print test_set[7]
print test_set[7].keys()


[7, 14, 21, 28, 35, 42, 49, 56, 63, 70]
{'qid': 1, 'uid': 6}
['qid', 'uid']

In [25]:
print sorted(questions.keys())[:10]
print questions[1]
print questions[1].keys()


[1, 4, 5, 6, 11, 19, 35, 36, 42, 44]
{'answer': 'thomas cole', 'category': 'Fine Arts', 'group': 'test', 'pos_token': {0: '', 1: 'painters', 2: 'indulgence', 4: 'visual', 5: 'fantasy', 7: 'appreciation', 9: 'different', 10: 'historic', 11: 'architectural', 12: 'styles', 15: 'seen', 18: '1840', 19: 'architects', 20: 'dream', 23: 'series', 25: 'paintings', 28: 'last', 31: 'mohicans', 33: 'made', 35: 'three', 36: 'year', 37: 'trip', 39: 'europe', 41: '1829', 45: 'better', 46: 'known', 49: 'trip', 50: 'four', 51: 'years', 52: 'earlier', 56: 'journeyed', 59: 'hudson', 60: 'river', 63: 'catskill', 64: 'mountains', 65: 'ftp', 66: 'name', 68: 'this_painter', 71: 'oxbow', 74: 'voyage', 76: 'life', 77: 'series'}, 'question': "This painter's indulgence of visual fantasy, and appreciation of different historic architectural styles can be seen in his 1840 Architect's Dream. After a series of paintings on The Last of the Mohicans, he made a three year trip to Europe in 1829, but he is better known for a trip four years earlier in which he journeyed up the Hudson River to the Catskill Mountains. FTP, name this painter of The Oxbow and The Voyage of Life series."}
['answer', 'category', 'group', 'pos_token', 'question']

Make training set

For training model, we might need to make feature and lable pair. In this case, we will use only uid, qid, and position for feature.


In [43]:
X_train = []
Y_train = []

for key in train_set:
    # We only care about positive case at this time
    if train_set[key]['position'] < 0:
        continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    feat = [uid, qid, q_length]
    X_train.append(feat)
    Y_train.append([pos])

In [44]:
print len(X_train)
print len(Y_train)


20819
20819

In [45]:
print X_train[0], Y_train[0]


[0, 1, 77] [61.0]

It means that user 0 tried to solve question number 1 which has 77 tokens for question and he or she answered at 61st token.

Train model and make predictions

Let's train model and make predictions. We will use simple Linear Regression at this moment.


In [77]:
from sklearn.linear_model import LinearRegression


model = LinearRegression()
model.fit(X_train, Y_train)


Out[77]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Let's make test set for testing.


In [78]:
X_test = []
test_id = []

for key in test_set:
    test_id.append(key)
    uid = test_set[key]['uid']
    qid = test_set[key]['qid']
    q_length = max(questions[qid]['pos_token'].keys())
    feat = [uid, qid, q_length]
    X_test.append(feat)

In [81]:
predictions = model.predict(X_test)
predictions = sorted([[id, predictions[index][0]] for index, id in enumerate(test_id)])
print len(predictions)
predictions[:5]


4749
Out[81]:
[[7, 55.969826511813125],
 [14, 53.595157324938121],
 [21, 54.428857298633709],
 [28, 44.809868026370609],
 [35, 71.263467403313086]]

Here is 4749 predictions.

Writing submission.

OK, let's writing submission into guess.csv file. In the given submission form, we realized that we need to put header. So, we will insert header at the first of predictions, and then make it as a file.


In [82]:
import csv


predictions.insert(0,["id", "position"])
with open('guess.csv', 'wb') as fp:
    writer = csv.writer(fp, delimiter=',')
    writer.writerows(predictions)

All right. Let's submit!

And... we got... 5th ranked. It's worse than the first submission. Let's think about why.

5 new CU_K-ml_Stars 97.07613 1 Sun, 05 Apr 2015 23:15:52


In [ ]: