model 02

Load train, test, questions data from pklz

First of all, we need to read those three data set.


In [1]:
import gzip
import cPickle as pickle

In [2]:
with gzip.open("../data/train.pklz", "rb") as train_file:
    train_set = pickle.load(train_file)

with gzip.open("../data/test.pklz", "rb") as test_file:
    test_set = pickle.load(test_file)

with gzip.open("../data/questions.pklz", "rb") as questions_file:
    questions = pickle.load(questions_file)

Make training set

For training model, we might need to make feature and lable pair. In this case, we will use only uid, qid, and position for feature.


In [3]:
X = []
Y = []

for key in train_set:
    # We only care about positive case at this time
    if train_set[key]['position'] < 0:
        continue
    uid = train_set[key]['uid']
    qid = train_set[key]['qid']
    pos = train_set[key]['position']
    q_length = max(questions[qid]['pos_token'].keys())
    feat = [uid, qid, q_length]
    X.append(feat)
    Y.append([pos])

In [4]:
print len(X)
print len(Y)
print X[0], Y[0]


20819
20819
[0, 1, 77] [61.0]

It means that user 0 tried to solve question number 1 which has 77 tokens for question and he or she answered at 61st token.

Train model and make predictions

Let's train model and make predictions. We will use simple Linear Regression at this moment.


In [5]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.cross_validation import train_test_split, cross_val_score


X_train, X_test, Y_train, Y_test = train_test_split (X, Y)

regressor = LinearRegression()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores

regressor = Ridge()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores

regressor = Lasso()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores

regressor = ElasticNet()
scores = cross_val_score(regressor, X, Y, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores


Cross validation r-squared scores: 0.270443792668
[ 0.27141933  0.31467769  0.2553982   0.20843913  0.27169493  0.20757046
  0.22137434  0.34932668  0.25841249  0.34612469]
Cross validation r-squared scores: 0.270443792439
[ 0.27141933  0.31467769  0.2553982   0.20843913  0.27169493  0.20757046
  0.22137435  0.34932668  0.25841248  0.34612468]
Cross validation r-squared scores: 0.27043903016
[ 0.27139779  0.31474643  0.25536653  0.20835685  0.27168556  0.20768149
  0.22161699  0.34940817  0.25826054  0.34586995]
Cross validation r-squared scores: 0.270438825246
[ 0.27142147  0.31472377  0.25538215  0.20838956  0.27167852  0.2076529
  0.22154786  0.349398    0.25828273  0.3459113 ]

In [182]:
from sklearn.linear_model import SGDRegressor
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split


X_scaler = StandardScaler()
Y_scaler = StandardScaler()
X_train, X_test, Y_train, Y_test = train_test_split (X, Y)
X_train = X_scaler.fit_transform(X_train)
Y_train = Y_scaler.fit_transform(Y_train)
X_test = X_scaler.fit_transform(X_test)
Y_test = Y_scaler.fit_transform(Y_test)

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html

There has four loss-function. ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. Among those, squared_loss is the best in this case.


In [153]:
regressor = SGDRegressor(loss='squared_loss', penalty='l1')
scores = cross_val_score(regressor, X_train, Y_train, cv=10)
print 'Cross validation r-squared scores:', scores.mean()
print scores


/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Cross validation r-squared scores: 0.314327371132
[ 0.31767204  0.26263766  0.29267964  0.34132513  0.33904661  0.29226138
  0.29671757  0.29541327  0.37105898  0.33446145]
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

In [154]:
X_test = []
test_id = []

for key in test_set:
    test_id.append(key)
    uid = test_set[key]['uid']
    qid = test_set[key]['qid']
    q_length = max(questions[qid]['pos_token'].keys())
    feat = [uid, qid, q_length]
    X_test.append(feat)

    
X_scaler = StandardScaler()
Y_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X)
Y_train = Y_scaler.fit_transform(Y)
X_test = X_scaler.fit_transform(X_test)

In [155]:
regressor.fit(X_train, Y_train)
predictions = regressor.predict(X_test)
predictions = Y_scaler.inverse_transform(predictions)
predictions = sorted([[id, predictions[index]] for index, id in enumerate(test_id)])
print len(predictions)
predictions[:5]


4749
/home/sanghee/.pyenv/versions/ml/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[155]:
[[7, 54.823288886368189],
 [14, 52.562402581610151],
 [21, 53.435212507132782],
 [28, 44.051237652777864],
 [35, 69.858409893632682]]

Here is 4749 predictions.

Writing submission.

OK, let's writing submission into guess.csv file. In the given submission form, we realized that we need to put header. So, we will insert header at the first of predictions, and then make it as a file.


In [156]:
import csv


predictions.insert(0,["id", "position"])
with open('guess.csv', 'wb') as fp:
    writer = csv.writer(fp, delimiter=',')
    writer.writerows(predictions)

All right. Let's submit!

And... we got... 5th ranked. It's worse than the first submission. Let's think about why.

5 new CU_K-ml_Stars 96.50206 2 Mon, 06 Apr 2015 21:13:50


In [ ]: