vowpal wabbit: The format of training and testing data file is: [Label] [Importance [Tag]]|Namespace Features |Namespace Features ... |Namespace Features Namespace=String[:Value] Features=(String[:Value] )* Label is the real number that we are trying to predict for this example. If the label is omitted, then no training will be performed with the corresponding example, although VW will still compute a prediction. classification: {+1,-1} Importance (importance weight) is a non-negative real number indicating the relative importance of this example over the others. Omitting this gives a default importance of 1 to the example. Tag is a string that serves as an identifier for the example. It is reported back when predictions are made. It doesn't have to be unique. The default value if it is not provided is the empty string. If you provide a tag without a weight you need to disambiguate: either make the tag touch the | (no trailing spaces) or mark it with a leading single-quote '. If you don't provide a tag, you need to have a space before the |. Namespace is an identifier of a source of information for the example optionally followed by a float (e.g., MetricFeatures:3.28), which acts as a global scaling of all the values of the features in this namespace. If value is omitted, the default is 1. It is important that the namespace not have a space between the separator | as otherwise it is interpreted as a feature. Features is a sequence of whitespace separated strings, each of which is optionally followed by a float (e.g., NumberOfLegs:4.0 HasStripes). Each string is a feature and the value is the feature value for that example. Omitting a feature means that its value is zero. Including a feature but omitting its value means that its value is 1.

The csv data have been saved in hdf5


In [1]:
import tables
import time
import numpy as np
import cPickle
from itertools import izip

In [2]:
file_handler = tables.open_file("click_data.h5", mode = "r")

In [3]:
X = file_handler.root.train.train_raw.X

In [4]:
y = file_handler.root.train.train_raw.y

In [5]:
X_t = file_handler.root.test.test_raw.X_t

training data


In [6]:
colnames = X.colnames

In [7]:
lines = X.shape[0]

In [8]:
%%time
i = 0
with open('train.vw', 'wb') as fw:
    for row, target in izip(X.iterrows(),y.iterrows()):
        out = "{0} |".format(target[0]*2-1)
        for name in colnames:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
        out += '\n'
        fw.write(out)
        
        i+=1
        if (i % 1000000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


1000000 lines of 40428967 written (2%)
2000000 lines of 40428967 written (4%)
3000000 lines of 40428967 written (7%)
4000000 lines of 40428967 written (9%)
5000000 lines of 40428967 written (12%)
6000000 lines of 40428967 written (14%)
7000000 lines of 40428967 written (17%)
8000000 lines of 40428967 written (19%)
9000000 lines of 40428967 written (22%)
10000000 lines of 40428967 written (24%)
11000000 lines of 40428967 written (27%)
12000000 lines of 40428967 written (29%)
13000000 lines of 40428967 written (32%)
14000000 lines of 40428967 written (34%)
15000000 lines of 40428967 written (37%)
16000000 lines of 40428967 written (39%)
17000000 lines of 40428967 written (42%)
18000000 lines of 40428967 written (44%)
19000000 lines of 40428967 written (46%)
20000000 lines of 40428967 written (49%)
21000000 lines of 40428967 written (51%)
22000000 lines of 40428967 written (54%)
23000000 lines of 40428967 written (56%)
24000000 lines of 40428967 written (59%)
25000000 lines of 40428967 written (61%)
26000000 lines of 40428967 written (64%)
27000000 lines of 40428967 written (66%)
28000000 lines of 40428967 written (69%)
29000000 lines of 40428967 written (71%)
30000000 lines of 40428967 written (74%)
31000000 lines of 40428967 written (76%)
32000000 lines of 40428967 written (79%)
33000000 lines of 40428967 written (81%)
34000000 lines of 40428967 written (84%)
35000000 lines of 40428967 written (86%)
36000000 lines of 40428967 written (89%)
37000000 lines of 40428967 written (91%)
38000000 lines of 40428967 written (93%)
39000000 lines of 40428967 written (96%)
40000000 lines of 40428967 written (98%)
CPU times: user 15min 36s, sys: 13.6 s, total: 15min 50s
Wall time: 19min

with namespace


In [8]:
timelist = ['day','hour']
ban = ['banner_pos']
site = ['site_id', 'site_domain', 'site_category']
app = ['app_id', 'app_domain', 'app_category']
device = ['device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type']
clist = ['C1', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']

In [9]:
%%time
i = 0
with open('train.vw', 'wb') as fw:
    for row, target in izip(X.iterrows(),y.iterrows()):
        out = "{0}".format(target[0]*2-1)
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
        
        out += " |b"
        for name in ban:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |s"
        for name in site:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |a"
        for name in app:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |d"
        for name in device:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
             
        out += '\n'
        fw.write(out)
        
        i+=1
        
        if (i % 1000000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


1000000 lines of 40428967 written (2%)
2000000 lines of 40428967 written (4%)
3000000 lines of 40428967 written (7%)
4000000 lines of 40428967 written (9%)
5000000 lines of 40428967 written (12%)
6000000 lines of 40428967 written (14%)
7000000 lines of 40428967 written (17%)
8000000 lines of 40428967 written (19%)
9000000 lines of 40428967 written (22%)
10000000 lines of 40428967 written (24%)
11000000 lines of 40428967 written (27%)
12000000 lines of 40428967 written (29%)
13000000 lines of 40428967 written (32%)
14000000 lines of 40428967 written (34%)
15000000 lines of 40428967 written (37%)
16000000 lines of 40428967 written (39%)
17000000 lines of 40428967 written (42%)
18000000 lines of 40428967 written (44%)
19000000 lines of 40428967 written (46%)
20000000 lines of 40428967 written (49%)
21000000 lines of 40428967 written (51%)
22000000 lines of 40428967 written (54%)
23000000 lines of 40428967 written (56%)
24000000 lines of 40428967 written (59%)
25000000 lines of 40428967 written (61%)
26000000 lines of 40428967 written (64%)
27000000 lines of 40428967 written (66%)
28000000 lines of 40428967 written (69%)
29000000 lines of 40428967 written (71%)
30000000 lines of 40428967 written (74%)
31000000 lines of 40428967 written (76%)
32000000 lines of 40428967 written (79%)
33000000 lines of 40428967 written (81%)
34000000 lines of 40428967 written (84%)
35000000 lines of 40428967 written (86%)
36000000 lines of 40428967 written (89%)
37000000 lines of 40428967 written (91%)
38000000 lines of 40428967 written (93%)
39000000 lines of 40428967 written (96%)
40000000 lines of 40428967 written (98%)
CPU times: user 16min 9s, sys: 13.3 s, total: 16min 22s
Wall time: 19min 31s

with one hot encoding some features


In [9]:
f = open('indexdiconehot.pkl', 'rb')
indexdic = cPickle.load(f)
f.close()

In [10]:
names = set(['day', 'hour', 'banner_pos', 'site_category', 'app_category', 
         'device_type', 'device_conn_type', 
         'C1', 'C15', 'C16', 'C18', 'C20'])

In [11]:
%%time
i = 0
with open('train1.vw', 'wb') as fw:
    for row, target in izip(X.iterrows(),y.iterrows()):
        out = "{0}".format(target[0]*2-1)
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
        
        out += " |b"
        for name in ban:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
             
        out += '\n'
        fw.write(out)
        
        i+=1
        
        if (i % 1000000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


1000000 lines of 40428967 written (2%)
2000000 lines of 40428967 written (4%)
3000000 lines of 40428967 written (7%)
4000000 lines of 40428967 written (9%)
5000000 lines of 40428967 written (12%)
6000000 lines of 40428967 written (14%)
7000000 lines of 40428967 written (17%)
8000000 lines of 40428967 written (19%)
9000000 lines of 40428967 written (22%)
10000000 lines of 40428967 written (24%)
11000000 lines of 40428967 written (27%)
12000000 lines of 40428967 written (29%)
13000000 lines of 40428967 written (32%)
14000000 lines of 40428967 written (34%)
15000000 lines of 40428967 written (37%)
16000000 lines of 40428967 written (39%)
17000000 lines of 40428967 written (42%)
18000000 lines of 40428967 written (44%)
19000000 lines of 40428967 written (46%)
20000000 lines of 40428967 written (49%)
21000000 lines of 40428967 written (51%)
22000000 lines of 40428967 written (54%)
23000000 lines of 40428967 written (56%)
24000000 lines of 40428967 written (59%)
25000000 lines of 40428967 written (61%)
26000000 lines of 40428967 written (64%)
27000000 lines of 40428967 written (66%)
28000000 lines of 40428967 written (69%)
29000000 lines of 40428967 written (71%)
30000000 lines of 40428967 written (74%)
31000000 lines of 40428967 written (76%)
32000000 lines of 40428967 written (79%)
33000000 lines of 40428967 written (81%)
34000000 lines of 40428967 written (84%)
35000000 lines of 40428967 written (86%)
36000000 lines of 40428967 written (89%)
37000000 lines of 40428967 written (91%)
38000000 lines of 40428967 written (93%)
39000000 lines of 40428967 written (96%)
40000000 lines of 40428967 written (98%)
CPU times: user 17min 58s, sys: 9.13 s, total: 18min 8s
Wall time: 20min 39s

with one hot encoding all the features: version 1


In [9]:
f = open('indexdicless.pkl', 'rb')
indexdic = cPickle.load(f)
f.close()

In [10]:
%%time
i = 0
with open('train2.vw', 'wb') as fw:
    for row, target in izip(X.iterrows(),y.iterrows()):
        out = "{0}".format(target[0]*2-1)
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break

        out += " |b"
        for name in ban:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
             
        out += '\n'
        fw.write(out)
            
        i+=1
        
        if (i % 1000000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


1000000 lines of 40428967 written (2%)
2000000 lines of 40428967 written (4%)
3000000 lines of 40428967 written (7%)
4000000 lines of 40428967 written (9%)
5000000 lines of 40428967 written (12%)
6000000 lines of 40428967 written (14%)
7000000 lines of 40428967 written (17%)
8000000 lines of 40428967 written (19%)
9000000 lines of 40428967 written (22%)
10000000 lines of 40428967 written (24%)
11000000 lines of 40428967 written (27%)
12000000 lines of 40428967 written (29%)
13000000 lines of 40428967 written (32%)
14000000 lines of 40428967 written (34%)
15000000 lines of 40428967 written (37%)
16000000 lines of 40428967 written (39%)
17000000 lines of 40428967 written (42%)
18000000 lines of 40428967 written (44%)
19000000 lines of 40428967 written (46%)
20000000 lines of 40428967 written (49%)
21000000 lines of 40428967 written (51%)
22000000 lines of 40428967 written (54%)
23000000 lines of 40428967 written (56%)
24000000 lines of 40428967 written (59%)
25000000 lines of 40428967 written (61%)
26000000 lines of 40428967 written (64%)
27000000 lines of 40428967 written (66%)
28000000 lines of 40428967 written (69%)
29000000 lines of 40428967 written (71%)
30000000 lines of 40428967 written (74%)
31000000 lines of 40428967 written (76%)
32000000 lines of 40428967 written (79%)
33000000 lines of 40428967 written (81%)
34000000 lines of 40428967 written (84%)
35000000 lines of 40428967 written (86%)
36000000 lines of 40428967 written (89%)
37000000 lines of 40428967 written (91%)
38000000 lines of 40428967 written (93%)
39000000 lines of 40428967 written (96%)
40000000 lines of 40428967 written (98%)
CPU times: user 21min 2s, sys: 9.7 s, total: 21min 12s
Wall time: 22min 45s

with one hot encoding all the features: version 2


In [9]:
f = open('indexdicless2.pkl', 'rb')
indexdic = cPickle.load(f)
f.close()

In [10]:
%%time
i = 0
with open('train1.vw', 'wb') as fw:
    for row, target in izip(X.iterrows(),y.iterrows()):
        out = "{0}".format(target[0]*2-1)
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break

        out += " |b"
        for name in ban:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
             
        out += '\n'
        fw.write(out)
            
        i+=1
        
        if (i % 1000000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


1000000 lines of 40428967 written (2%)
2000000 lines of 40428967 written (4%)
3000000 lines of 40428967 written (7%)
4000000 lines of 40428967 written (9%)
5000000 lines of 40428967 written (12%)
6000000 lines of 40428967 written (14%)
7000000 lines of 40428967 written (17%)
8000000 lines of 40428967 written (19%)
9000000 lines of 40428967 written (22%)
10000000 lines of 40428967 written (24%)
11000000 lines of 40428967 written (27%)
12000000 lines of 40428967 written (29%)
13000000 lines of 40428967 written (32%)
14000000 lines of 40428967 written (34%)
15000000 lines of 40428967 written (37%)
16000000 lines of 40428967 written (39%)
17000000 lines of 40428967 written (42%)
18000000 lines of 40428967 written (44%)
19000000 lines of 40428967 written (46%)
20000000 lines of 40428967 written (49%)
21000000 lines of 40428967 written (51%)
22000000 lines of 40428967 written (54%)
23000000 lines of 40428967 written (56%)
24000000 lines of 40428967 written (59%)
25000000 lines of 40428967 written (61%)
26000000 lines of 40428967 written (64%)
27000000 lines of 40428967 written (66%)
28000000 lines of 40428967 written (69%)
29000000 lines of 40428967 written (71%)
30000000 lines of 40428967 written (74%)
31000000 lines of 40428967 written (76%)
32000000 lines of 40428967 written (79%)
33000000 lines of 40428967 written (81%)
34000000 lines of 40428967 written (84%)
35000000 lines of 40428967 written (86%)
36000000 lines of 40428967 written (89%)
37000000 lines of 40428967 written (91%)
38000000 lines of 40428967 written (93%)
39000000 lines of 40428967 written (96%)
40000000 lines of 40428967 written (98%)
CPU times: user 21min 9s, sys: 11.2 s, total: 21min 20s
Wall time: 22min 59s

with one hot encoding all the features: version 3


In [9]:
f = open('indexdicless3.pkl', 'rb')
indexdic = cPickle.load(f)
f.close()

In [10]:
%%time
i = 0
with open('train1.vw', 'wb') as fw:
    for row, target in izip(X.iterrows(),y.iterrows()):
        out = "{0}".format(target[0]*2-1)
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break

        out += " |b"
        for name in ban:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                print 'error'
                break
             
        out += '\n'
        fw.write(out)
            
        i+=1
        
        if (i % 1000000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


1000000 lines of 40428967 written (2%)
2000000 lines of 40428967 written (4%)
3000000 lines of 40428967 written (7%)
4000000 lines of 40428967 written (9%)
5000000 lines of 40428967 written (12%)
6000000 lines of 40428967 written (14%)
7000000 lines of 40428967 written (17%)
8000000 lines of 40428967 written (19%)
9000000 lines of 40428967 written (22%)
10000000 lines of 40428967 written (24%)
11000000 lines of 40428967 written (27%)
12000000 lines of 40428967 written (29%)
13000000 lines of 40428967 written (32%)
14000000 lines of 40428967 written (34%)
15000000 lines of 40428967 written (37%)
16000000 lines of 40428967 written (39%)
17000000 lines of 40428967 written (42%)
18000000 lines of 40428967 written (44%)
19000000 lines of 40428967 written (46%)
20000000 lines of 40428967 written (49%)
21000000 lines of 40428967 written (51%)
22000000 lines of 40428967 written (54%)
23000000 lines of 40428967 written (56%)
24000000 lines of 40428967 written (59%)
25000000 lines of 40428967 written (61%)
26000000 lines of 40428967 written (64%)
27000000 lines of 40428967 written (66%)
28000000 lines of 40428967 written (69%)
29000000 lines of 40428967 written (71%)
30000000 lines of 40428967 written (74%)
31000000 lines of 40428967 written (76%)
32000000 lines of 40428967 written (79%)
33000000 lines of 40428967 written (81%)
34000000 lines of 40428967 written (84%)
35000000 lines of 40428967 written (86%)
36000000 lines of 40428967 written (89%)
37000000 lines of 40428967 written (91%)
38000000 lines of 40428967 written (93%)
39000000 lines of 40428967 written (96%)
40000000 lines of 40428967 written (98%)
CPU times: user 20min 47s, sys: 9.86 s, total: 20min 56s
Wall time: 22min 35s

test data


In [11]:
colnames = X_t.colnames

In [12]:
lines = X_t.shape[0]

In [11]:
%%time
i = 0
with open('test.vw', 'wb') as fw:
    for row in X_t.iterrows():
        out = " |"
        for name in colnames:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
        out += '\n'
        fw.write(out)
        
        i+=1
        if (i % 100000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


100000 lines of 4577464 written (2%)
200000 lines of 4577464 written (4%)
300000 lines of 4577464 written (6%)
400000 lines of 4577464 written (8%)
500000 lines of 4577464 written (10%)
600000 lines of 4577464 written (13%)
700000 lines of 4577464 written (15%)
800000 lines of 4577464 written (17%)
900000 lines of 4577464 written (19%)
1000000 lines of 4577464 written (21%)
1100000 lines of 4577464 written (24%)
1200000 lines of 4577464 written (26%)
1300000 lines of 4577464 written (28%)
1400000 lines of 4577464 written (30%)
1500000 lines of 4577464 written (32%)
1600000 lines of 4577464 written (34%)
1700000 lines of 4577464 written (37%)
1800000 lines of 4577464 written (39%)
1900000 lines of 4577464 written (41%)
2000000 lines of 4577464 written (43%)
2100000 lines of 4577464 written (45%)
2200000 lines of 4577464 written (48%)
2300000 lines of 4577464 written (50%)
2400000 lines of 4577464 written (52%)
2500000 lines of 4577464 written (54%)
2600000 lines of 4577464 written (56%)
2700000 lines of 4577464 written (58%)
2800000 lines of 4577464 written (61%)
2900000 lines of 4577464 written (63%)
3000000 lines of 4577464 written (65%)
3100000 lines of 4577464 written (67%)
3200000 lines of 4577464 written (69%)
3300000 lines of 4577464 written (72%)
3400000 lines of 4577464 written (74%)
3500000 lines of 4577464 written (76%)
3600000 lines of 4577464 written (78%)
3700000 lines of 4577464 written (80%)
3800000 lines of 4577464 written (83%)
3900000 lines of 4577464 written (85%)
4000000 lines of 4577464 written (87%)
4100000 lines of 4577464 written (89%)
4200000 lines of 4577464 written (91%)
4300000 lines of 4577464 written (93%)
4400000 lines of 4577464 written (96%)
4500000 lines of 4577464 written (98%)
CPU times: user 1min 25s, sys: 1.6 s, total: 1min 27s
Wall time: 1min 48s

with namespace


In [13]:
timelist = ['day','hour']
ban = ['banner_pos']
site = ['site_id', 'site_domain', 'site_category']
app = ['app_id', 'app_domain', 'app_category']
device = ['device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type']
clist = ['C1', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']

In [13]:
%%time
i = 0
with open('test.vw', 'wb') as fw:
    for row in X_t.iterrows():
        out = " "
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
        
        out += " |b"
        for name in ban:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |s"
        for name in site:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |a"
        for name in app:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |d"
        for name in device:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            out += " {0}_{1}".format(name, value)
            
        out += '\n'
        fw.write(out)
        
        i+=1
        
        if (i % 100000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


100000 lines of 4577464 written (2%)
200000 lines of 4577464 written (4%)
300000 lines of 4577464 written (6%)
400000 lines of 4577464 written (8%)
500000 lines of 4577464 written (10%)
600000 lines of 4577464 written (13%)
700000 lines of 4577464 written (15%)
800000 lines of 4577464 written (17%)
900000 lines of 4577464 written (19%)
1000000 lines of 4577464 written (21%)
1100000 lines of 4577464 written (24%)
1200000 lines of 4577464 written (26%)
1300000 lines of 4577464 written (28%)
1400000 lines of 4577464 written (30%)
1500000 lines of 4577464 written (32%)
1600000 lines of 4577464 written (34%)
1700000 lines of 4577464 written (37%)
1800000 lines of 4577464 written (39%)
1900000 lines of 4577464 written (41%)
2000000 lines of 4577464 written (43%)
2100000 lines of 4577464 written (45%)
2200000 lines of 4577464 written (48%)
2300000 lines of 4577464 written (50%)
2400000 lines of 4577464 written (52%)
2500000 lines of 4577464 written (54%)
2600000 lines of 4577464 written (56%)
2700000 lines of 4577464 written (58%)
2800000 lines of 4577464 written (61%)
2900000 lines of 4577464 written (63%)
3000000 lines of 4577464 written (65%)
3100000 lines of 4577464 written (67%)
3200000 lines of 4577464 written (69%)
3300000 lines of 4577464 written (72%)
3400000 lines of 4577464 written (74%)
3500000 lines of 4577464 written (76%)
3600000 lines of 4577464 written (78%)
3700000 lines of 4577464 written (80%)
3800000 lines of 4577464 written (83%)
3900000 lines of 4577464 written (85%)
4000000 lines of 4577464 written (87%)
4100000 lines of 4577464 written (89%)
4200000 lines of 4577464 written (91%)
4300000 lines of 4577464 written (93%)
4400000 lines of 4577464 written (96%)
4500000 lines of 4577464 written (98%)
CPU times: user 1min 24s, sys: 1.12 s, total: 1min 25s
Wall time: 1min 47s

with one hot encoding some features


In [ ]:
f = open('indexdiconehot.pkl', 'rb')
indexdic = cPickle.load(f)
f.close()

In [ ]:
names = set(['day', 'hour', 'banner_pos', 'site_category', 'app_category', 
         'device_type', 'device_conn_type', 
         'C1', 'C15', 'C16', 'C18', 'C20'])

In [ ]:
%%time
i = 0
with open('test1.vw', 'wb') as fw:
    for row in X_t.iterrows():
        out = " "
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
        
        out += " |b"
        for name in ban:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if name in names:
                out += " {0}".format(indexdic[name][value])
            else:
                out += " {0}_{1}".format(name, value)
            
        out += '\n'
        fw.write(out)
        
        i+=1
        
        if (i % 100000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)

In [ ]:

with one hot encoding all the features: version 1


In [14]:
%%time
i = 0
with open('test2.vw', 'wb') as fw:
    for row in X_t.iterrows():
        out = " "
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(617958)

        out += " |b"
        for name in ban:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(617958)
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(617958)
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(617958)
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(617958)
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(617958)
             
        out += '\n'
        fw.write(out)
            
        i+=1
        
        if (i % 100000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


100000 lines of 4577464 written (2%)
200000 lines of 4577464 written (4%)
300000 lines of 4577464 written (6%)
400000 lines of 4577464 written (8%)
500000 lines of 4577464 written (10%)
600000 lines of 4577464 written (13%)
700000 lines of 4577464 written (15%)
800000 lines of 4577464 written (17%)
900000 lines of 4577464 written (19%)
1000000 lines of 4577464 written (21%)
1100000 lines of 4577464 written (24%)
1200000 lines of 4577464 written (26%)
1300000 lines of 4577464 written (28%)
1400000 lines of 4577464 written (30%)
1500000 lines of 4577464 written (32%)
1600000 lines of 4577464 written (34%)
1700000 lines of 4577464 written (37%)
1800000 lines of 4577464 written (39%)
1900000 lines of 4577464 written (41%)
2000000 lines of 4577464 written (43%)
2100000 lines of 4577464 written (45%)
2200000 lines of 4577464 written (48%)
2300000 lines of 4577464 written (50%)
2400000 lines of 4577464 written (52%)
2500000 lines of 4577464 written (54%)
2600000 lines of 4577464 written (56%)
2700000 lines of 4577464 written (58%)
2800000 lines of 4577464 written (61%)
2900000 lines of 4577464 written (63%)
3000000 lines of 4577464 written (65%)
3100000 lines of 4577464 written (67%)
3200000 lines of 4577464 written (69%)
3300000 lines of 4577464 written (72%)
3400000 lines of 4577464 written (74%)
3500000 lines of 4577464 written (76%)
3600000 lines of 4577464 written (78%)
3700000 lines of 4577464 written (80%)
3800000 lines of 4577464 written (83%)
3900000 lines of 4577464 written (85%)
4000000 lines of 4577464 written (87%)
4100000 lines of 4577464 written (89%)
4200000 lines of 4577464 written (91%)
4300000 lines of 4577464 written (93%)
4400000 lines of 4577464 written (96%)
4500000 lines of 4577464 written (98%)
CPU times: user 1min 53s, sys: 952 ms, total: 1min 54s
Wall time: 2min 3s

with one hot encoding all the features: version 2


In [14]:
%%time
i = 0
with open('test1.vw', 'wb') as fw:
    for row in X_t.iterrows():
        out = " "
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(947464)

        out += " |b"
        for name in ban:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(947464)
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(947464)
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(947464)
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(947464)
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(947464)
             
        out += '\n'
        fw.write(out)
        
            
        i+=1
        
        if (i % 100000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


100000 lines of 4577464 written (2%)
200000 lines of 4577464 written (4%)
300000 lines of 4577464 written (6%)
400000 lines of 4577464 written (8%)
500000 lines of 4577464 written (10%)
600000 lines of 4577464 written (13%)
700000 lines of 4577464 written (15%)
800000 lines of 4577464 written (17%)
900000 lines of 4577464 written (19%)
1000000 lines of 4577464 written (21%)
1100000 lines of 4577464 written (24%)
1200000 lines of 4577464 written (26%)
1300000 lines of 4577464 written (28%)
1400000 lines of 4577464 written (30%)
1500000 lines of 4577464 written (32%)
1600000 lines of 4577464 written (34%)
1700000 lines of 4577464 written (37%)
1800000 lines of 4577464 written (39%)
1900000 lines of 4577464 written (41%)
2000000 lines of 4577464 written (43%)
2100000 lines of 4577464 written (45%)
2200000 lines of 4577464 written (48%)
2300000 lines of 4577464 written (50%)
2400000 lines of 4577464 written (52%)
2500000 lines of 4577464 written (54%)
2600000 lines of 4577464 written (56%)
2700000 lines of 4577464 written (58%)
2800000 lines of 4577464 written (61%)
2900000 lines of 4577464 written (63%)
3000000 lines of 4577464 written (65%)
3100000 lines of 4577464 written (67%)
3200000 lines of 4577464 written (69%)
3300000 lines of 4577464 written (72%)
3400000 lines of 4577464 written (74%)
3500000 lines of 4577464 written (76%)
3600000 lines of 4577464 written (78%)
3700000 lines of 4577464 written (80%)
3800000 lines of 4577464 written (83%)
3900000 lines of 4577464 written (85%)
4000000 lines of 4577464 written (87%)
4100000 lines of 4577464 written (89%)
4200000 lines of 4577464 written (91%)
4300000 lines of 4577464 written (93%)
4400000 lines of 4577464 written (96%)
4500000 lines of 4577464 written (98%)
CPU times: user 1min 57s, sys: 1.02 s, total: 1min 58s
Wall time: 2min 9s

with one hot encoding all the features: version 3


In [14]:
%%time
i = 0
with open('test1.vw', 'wb') as fw:
    for row in X_t.iterrows():
        out = " "
        
        out += " |t"
        for name in timelist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(636705)

        out += " |b"
        for name in ban:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(636705)
            
        out += " |s"
        for name in site:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(636705)
            
        out += " |a"
        for name in app:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(636705)
            
        out += " |d"
        for name in device:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(636705)
            
        out += " |c"
        for name in clist:
            value = str(row[name])
            if value in indexdic[name]:
                out += " {0}".format(indexdic[name][value])
            elif 'other' in indexdic[name]:
                out += " {0}".format(indexdic[name]['other'])
            else:
                out += " {0}".format(636705)
             
        out += '\n'
        fw.write(out)
        
            
        i+=1
        
        if (i % 100000) == 0:
            print "{0} lines of {1} written ({2}%)".format(i, lines, 100*i/lines)


100000 lines of 4577464 written (2%)
200000 lines of 4577464 written (4%)
300000 lines of 4577464 written (6%)
400000 lines of 4577464 written (8%)
500000 lines of 4577464 written (10%)
600000 lines of 4577464 written (13%)
700000 lines of 4577464 written (15%)
800000 lines of 4577464 written (17%)
900000 lines of 4577464 written (19%)
1000000 lines of 4577464 written (21%)
1100000 lines of 4577464 written (24%)
1200000 lines of 4577464 written (26%)
1300000 lines of 4577464 written (28%)
1400000 lines of 4577464 written (30%)
1500000 lines of 4577464 written (32%)
1600000 lines of 4577464 written (34%)
1700000 lines of 4577464 written (37%)
1800000 lines of 4577464 written (39%)
1900000 lines of 4577464 written (41%)
2000000 lines of 4577464 written (43%)
2100000 lines of 4577464 written (45%)
2200000 lines of 4577464 written (48%)
2300000 lines of 4577464 written (50%)
2400000 lines of 4577464 written (52%)
2500000 lines of 4577464 written (54%)
2600000 lines of 4577464 written (56%)
2700000 lines of 4577464 written (58%)
2800000 lines of 4577464 written (61%)
2900000 lines of 4577464 written (63%)
3000000 lines of 4577464 written (65%)
3100000 lines of 4577464 written (67%)
3200000 lines of 4577464 written (69%)
3300000 lines of 4577464 written (72%)
3400000 lines of 4577464 written (74%)
3500000 lines of 4577464 written (76%)
3600000 lines of 4577464 written (78%)
3700000 lines of 4577464 written (80%)
3800000 lines of 4577464 written (83%)
3900000 lines of 4577464 written (85%)
4000000 lines of 4577464 written (87%)
4100000 lines of 4577464 written (89%)
4200000 lines of 4577464 written (91%)
4300000 lines of 4577464 written (93%)
4400000 lines of 4577464 written (96%)
4500000 lines of 4577464 written (98%)
CPU times: user 1min 56s, sys: 920 ms, total: 1min 57s
Wall time: 2min

In [ ]: