MNIST with Vowpal Wabbit

Convert CSV to VW format

see http://hunch.net/~vw/usage.html

VW Input format

[Label] [Importance] ['Tag]|Namespace Features |Namespace Features ... |Namespace Features

Label

is the real number that we are trying to predict for this example. If the label is omitted, then no training will be performed with the corresponding example, although VW will still compute a prediction.

Importance (importance weight)

is a non-negative real number indicating the relative importance of this example over the others. Omitting this gives a default importance of 1 to the example.

Tag

is a string that serves as an identifier for the example. It is reported back when predictions are made. It doesn't have to be unique. The default value if it is not provided is the empty string. If you provide a tag without a weight you need to disambiguate: either make the tag touch the | (no trailing spaces) or mark it with a leading single-quote '. If you don't provide a tag, you need to have a space before the |.

Namespace

is an identifier of a source of information for the example optionally followed by a float (e.g., MetricFeatures:3.28), which acts as a global scaling of all the values of the features in this namespace. If value is omitted, the default is 1. It is important that the namespace not have a space between the separator | as otherwise it is interpreted as a feature.

Features

is a sequence of whitespace separated strings, each of which is optionally followed by a float (e.g., NumberOfLegs:4.0 HasStripes). Each string is a feature and the value is the feature value for that example. Omitting a feature means that its value is zero. Including a feature but omitting its value means that its value is 1.


In [1]:
import sys, csv
from itertools import izip

Function to convert to VW format


In [2]:
def convert_XY_toVW(inputX_file_path, inputY_file_path, output_file_path):
    with open(inputX_file_path, 'rb') as inputX_f, \
         open(inputY_file_path, 'rb') as inputY_f, \
         open(output_file_path, 'wb') as output_f:
        readerX   = csv.reader(inputX_f)
        readerY   = csv.reader(inputY_f)

        # for each line of trainY, trainX
        for row, (X_line, Y_line) in enumerate(izip(readerX, readerY)):

            # write the Y label and the Namespace
            # NOTE Y label goes from 1...10
            # =============================
            output_line = str(int(Y_line[0])+1) + " |image "

            # for each non-zero comma-separated value in the csv line
            for i, item in enumerate(X_line):
                if float(item) != 0.0:
                    # write pixel_no:value
                    output_line += "pxl" + str(i) + ":" + str(item) + " "

            output_f.write(output_line+"\n")

Convert the training data to VW format


In [17]:
inputX_file_path = '../data/trainX.csv' 
inputY_file_path = '../data/trainY.csv' 

output_file_path = '../vw/data/mnist_train.vw'

In [18]:
convert_XY_toVW(inputX_file_path, inputY_file_path, output_file_path)

Convert the test data to VW format


In [20]:
inputX_file_path = '../data/testX.csv' 
inputY_file_path = '../data/testY.csv' 

output_file_path = '../vw/data/mnist_test.vw'

In [21]:
convert_XY_toVW(inputX_file_path, inputY_file_path, output_file_path)

Convert the training PCA data to VW format


In [3]:
inputX_file_path = '../data/trainX_pca.csv' 
inputY_file_path = '../data/trainY.csv' 

output_file_path = '../vw/data/mnist_train_pca.vw'

In [4]:
convert_XY_toVW(inputX_file_path, inputY_file_path, output_file_path)

Convert the test PCA data to VW format


In [5]:
inputX_file_path = '../data/testX_pca.csv' 
inputY_file_path = '../data/testY.csv' 

output_file_path = '../vw/data/minst_test_pca.vw'

In [6]:
convert_XY_toVW(inputX_file_path, inputY_file_path, output_file_path)

Convert the Kaggle training data to VW format


In [23]:
inputX_file_path = '../kaggle/data/kaggle_trainX.csv' 
inputY_file_path = '../kaggle/data/kaggle_trainY.csv' 

output_file_path = '../vw/data/kaggle_train.vw'

In [24]:
convert_XY_toVW(inputX_file_path, inputY_file_path, output_file_path)

Convert Kaggle PCA training data to VW format


In [7]:
inputX_file_path = '../kaggle/data/kaggle_trainX_pca.csv' 
inputY_file_path = '../kaggle/data/kaggle_trainY.csv' 

output_file_path = '../vw/data/kaggle_train_pca.vw'

In [8]:
convert_XY_toVW(inputX_file_path, inputY_file_path, output_file_path)

Convert the Kaggle test data (X only) to VW format


In [26]:
inputX_file_path = '../kaggle/data/kaggle_testX_deskewed.csv'

output_file_path = '../vw/data/kaggle_test.vw'

In [27]:
with open(inputX_file_path, 'rb') as inputX_f, \
     open(output_file_path, 'wb') as output_f:
    readerX   = csv.reader(inputX_f)

    # for each line of testX
    for row, X_line in enumerate(readerX):

        # write the Y label and the Namespace
        # NOTE Y label for an unlabeled test set is 1
        # ===========================================
        output_line = str(1) + " |image "

        # for each non-zero comma-separated value in the csv line
        for i, item in enumerate(X_line):
            if float(item) != 0.0:
                # write pixel_number:value
                output_line += "pxl" + str(i) + ":" + str(item) + " "

        output_f.write(output_line+"\n")

Convert the Kaggle PCA test data (X only) to VW format


In [9]:
inputX_file_path = '../kaggle/data/kaggle_testX_pca.csv'

output_file_path = '../vw/data/kaggle_test_pca.vw'

In [10]:
with open(inputX_file_path, 'rb') as inputX_f, \
     open(output_file_path, 'wb') as output_f:
    readerX   = csv.reader(inputX_f)

    # for each line of testX
    for row, X_line in enumerate(readerX):

        # write the Y label and the Namespace
        # NOTE Y label for an unlabeled test set is 1
        # ===========================================
        output_line = str(1) + " |image "

        # for each non-zero comma-separated value in the csv line
        for i, item in enumerate(X_line):
            if float(item) != 0.0:
                # write pixel_number:value
                output_line += "pxl" + str(i) + ":" + str(item) + " "

        output_f.write(output_line+"\n")

In [ ]: