model06: refactoring model05

This model is just for code clean up. So, it means there has no difference between model05 and model06 in terms of result. However, this version is might be better for your code base for next step than model 01 to model 05.

Explain this model

Model

  • Linear models: LinearRegression, Ridge, Lasso, ElasticNet

Features

  • uid
  • qid
  • q_length
  • category
  • answer
  • avg_per_uid: average response time per user
  • avg_per_qid: average response time per question

Let's start our experimemt

Step1: Read train and test data

Read files for train and test set

We alread made given csv files as a pickled data for our convenience.


In [138]:
import gzip
import cPickle as pickle


with gzip.open("../data/train.pklz", "rb") as train_file:
    train_set = pickle.load(train_file)

with gzip.open("../data/test.pklz", "rb") as test_file:
    test_set = pickle.load(test_file)

with gzip.open("../data/questions.pklz", "rb") as questions_file:
    questions = pickle.load(questions_file)

What they have?

Just look at what each set have.


In [139]:
print "* train_set:", train_set[1]
print "* test_set:", test_set[7]
print "* question keys:", questions[1].keys()
"* question contents:", questions[1]


* train_set: {'answer': 'cole', 'qid': 1, 'uid': 0, 'position': 61.0}
* test_set: {'qid': 1, 'uid': 6}
* question keys: ['answer', 'category', 'group', 'pos_token', 'question']
Out[139]:
('* question contents:',
 {'answer': 'thomas cole',
  'category': 'Fine Arts',
  'group': 'test',
  'pos_token': {0: '',
   1: 'painters',
   2: 'indulgence',
   4: 'visual',
   5: 'fantasy',
   7: 'appreciation',
   9: 'different',
   10: 'historic',
   11: 'architectural',
   12: 'styles',
   15: 'seen',
   18: '1840',
   19: 'architects',
   20: 'dream',
   23: 'series',
   25: 'paintings',
   28: 'last',
   31: 'mohicans',
   33: 'made',
   35: 'three',
   36: 'year',
   37: 'trip',
   39: 'europe',
   41: '1829',
   45: 'better',
   46: 'known',
   49: 'trip',
   50: 'four',
   51: 'years',
   52: 'earlier',
   56: 'journeyed',
   59: 'hudson',
   60: 'river',
   63: 'catskill',
   64: 'mountains',
   65: 'ftp',
   66: 'name',
   68: 'this_painter',
   71: 'oxbow',
   74: 'voyage',
   76: 'life',
   77: 'series'},
  'question': "This painter's indulgence of visual fantasy, and appreciation of different historic architectural styles can be seen in his 1840 Architect's Dream. After a series of paintings on The Last of the Mohicans, he made a three year trip to Europe in 1829, but he is better known for a trip four years earlier in which he journeyed up the Hudson River to the Catskill Mountains. FTP, name this painter of The Oxbow and The Voyage of Life series."})

Step2: Feature Engineering

We might want to use some set of feature based on given data.


In [140]:
from collections import defaultdict

"""
Calculate average position(response time) per user(uid) and question(qid).
"""
def get_avg_pos(data):
    pos_uid = defaultdict(list)
    pos_qid = defaultdict(list)

    for key in data:
        uid = data[key]['uid']
        qid = data[key]['qid']
        pos = data[key]['position']
        pos_uid[uid].append(pos)
        pos_qid[qid].append(pos)

    avg_pos_uid = {}
    avg_pos_qid = {}

    for key in pos_uid:
        avg_pos_uid[key] = sum(pos_uid[key]) / len(pos_uid[key])

    for key in pos_qid:
        avg_pos_qid[key] = sum(pos_qid[key]) / len(pos_qid[key])
    
    return [avg_pos_uid, avg_pos_qid]


"""
Make feature vectors for given data set
"""
def featurize(data, avg_pos):
    X = []
    avg_pos_uid = avg_pos[0]
    avg_pos_qid = avg_pos[1]
    for key in data:
        uid = data[key]['uid']
        qid = data[key]['qid']
        q_length = max(questions[qid]['pos_token'].keys())
        category = questions[qid]['category'].lower()
        answer = questions[qid]['answer'].lower()
        if uid in avg_pos_uid:
            pos_uid = avg_pos_uid[uid]
        else:
            pos_uid = sum(avg_pos_uid.values()) / float(len(avg_pos_uid.values()))
            
        if qid in avg_pos_qid:
            pos_qid = avg_pos_qid[qid]
        else:
            pos_qid = sum(avg_pos_qid.values()) / float(len(avg_pos_qid.values()))
            
        feat = {"uid": str(uid),
                "qid": str(qid),
                "q_length": q_length,
                "category": category,
                "answer": answer,
                "avg_pos_uid": pos_uid,
                "avg_pos_qid": pos_qid
               }
        X.append(feat)
    
    return X


"""
Get positions
"""
def get_positions(data):
    Y = []
    for key in data:
        position = data[key]['position']
        Y.append([position])
    
    return Y

Look at the feature vector.


In [141]:
X_train = featurize(train_set, get_avg_pos(train_set))
Y_train = get_positions(train_set)
print len(X_train)
print len(Y_train)
print X_train[0], Y_train[0]


28494
28494
{'category': 'fine arts', 'uid': '0', 'qid': '1', 'avg_pos_uid': 55.708333333333336, 'q_length': 77, 'answer': 'thomas cole', 'avg_pos_qid': 51.0} [61.0]

Step3: Cross varidation


In [142]:
from sklearn.feature_extraction import DictVectorizer


vec = DictVectorizer()
X_train = vec.fit_transform(X_train)
print X_train[0]


  (0, 920)	1.0
  (0, 1020)	51.0
  (0, 1021)	55.7083333333
  (0, 1026)	1.0
  (0, 1033)	77.0
  (0, 1034)	1.0
  (0, 6958)	1.0

In [143]:
from sklearn import linear_model
from sklearn.cross_validation import train_test_split, cross_val_score
import math
from numpy import abs, sqrt


regressor_names = """
LinearRegression
Ridge
Lasso
ElasticNet
"""
print "=== Linear Cross validation RMSE scores:"
for regressor in regressor_names.split():
    scores = cross_val_score(getattr(linear_model, regressor)(), X_train, Y_train, cv=10,\
                             scoring='mean_squared_error')
    print regressor, sqrt(abs(scores)).mean()


=== Linear Cross validation RMSE scores:
LinearRegression 70.2538305442
Ridge 70.2511715134
Lasso 68.8414140964
ElasticNet 68.8415869663

Step4: Prediction

Training model


In [149]:
X_train = featurize(train_set, get_avg_pos(train_set))
X_test = featurize(test_set, get_avg_pos(train_set))
for x in X_test[:10]:
    print x

X_train_length = len(X_train)
X = vec.fit_transform(X_train + X_test)
X_train = X[:X_train_length]
X_test = X[X_train_length:]


{'category': 'physics', 'uid': '62', 'qid': '113722', 'avg_pos_uid': 31.233590733590734, 'q_length': 112, 'answer': 'ferromagnetism', 'avg_pos_qid': 113.0}
{'category': 'mathematics', 'uid': '131', 'qid': '9967', 'avg_pos_uid': 36.31506849315068, 'q_length': 104, 'answer': 'david hilbert', 'avg_pos_qid': 15.571428571428571}
{'category': 'literature', 'uid': '20', 'qid': '103709', 'avg_pos_uid': 55.74681753889675, 'q_length': 121, 'answer': 'duino elegies', 'avg_pos_qid': 41.74693710638917}
{'category': 'literature', 'uid': '115', 'qid': '4841', 'avg_pos_uid': 63.7429718875502, 'q_length': 101, 'answer': 'light in august', 'avg_pos_qid': 35.23076923076923}
{'category': 'fine arts', 'uid': '6', 'qid': '1', 'avg_pos_uid': 36.80373831775701, 'q_length': 77, 'answer': 'thomas cole', 'avg_pos_qid': 51.0}
{'category': 'social studies', 'uid': '64', 'qid': '113725', 'avg_pos_uid': 62.46045694200352, 'q_length': 147, 'answer': 'karl marx', 'avg_pos_qid': 101.33333333333333}
{'category': 'mathematics', 'uid': '45', 'qid': '9972', 'avg_pos_uid': 55.51875, 'q_length': 108, 'answer': 'quicksort', 'avg_pos_qid': -87.0}
{'category': 'literature', 'uid': '43', 'qid': '5408', 'avg_pos_uid': 33.85839160839161, 'q_length': 92, 'answer': 'midnights children', 'avg_pos_qid': 55.4}
{'category': 'social studies', 'uid': '119', 'qid': '108613', 'avg_pos_uid': 58.797814207650276, 'q_length': 108, 'answer': 'beowulf', 'avg_pos_qid': 77.77777777777777}
{'category': 'literature', 'uid': '12', 'qid': '5', 'avg_pos_uid': 31.45751633986928, 'q_length': 74, 'answer': 'swing', 'avg_pos_qid': 75.0}

Testing model(Prediction)


In [150]:
regressor = Lasso()
regressor.fit(X_train, Y_train)


Out[150]:
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [133]:
predictions = regressor.predict(X_test)
predictions = sorted([[id, predictions[index]] for index, id in enumerate(test_set.keys())])
print len(predictions)
predictions[:5]


4749
Out[133]:
[[7, 48.586204647017659],
 [14, 67.451992463706517],
 [21, 37.980573324627933],
 [28, 61.857326253311747],
 [35, 74.678727574716788]]

Step5: Writing submission.


In [134]:
import csv


predictions.insert(0,["id", "position"])
with open('guess.csv', 'wb') as fp:
    writer = csv.writer(fp, delimiter=',')
    writer.writerows(predictions)

It scores 85.85977