In [1]:
%load_ext vimception



In [2]:
%load_ext autoreload
%autoreload 2

Long input strings

For certain tasks it might make more sense to tokenize input strings first and then extract features on these string lists rather than on the original character lists.

To demonstrate this I'll take some example strings from highered and learn models using these two feature extraction techniques.

Training examples


In [5]:
X = [(u'caring hands a step ahead', u'el valor little tykes ii'),
  (u'dulles', u"chicago public schools o'keeffe, isabell c."),
  (u'erie neighborhood house fcch-carmen l. vega site',
   u'erie neighborhood house fcch-servia galva site'),
  (u'chicago public schools dvorak math & science tech academy, anton',
   u'chicago public schools perez, manuel'),
  (u'v & j day care center', u"henry booth house granny's day care center"),
  (u'home of life community dev. corp. - home of life just for you',
   u'urban family and community centers'),
  (u'carole robertson center for learning fcch-ileana gonzalez',
   u'carole robertson center for learning fcch-rhonda culverson'),
  (u'bethel new life bethel child development',
   u'mary crane league mary crane center (lake & pulaski)'),
  (u'easter seals society of metropolitan chicago - stepping stones early/childhood lear',
   u"marcy newberry association kenyatta's day care"),
  (u'westside holistic family services westside holistic family services',
   u'childserv lawndale'),
  
  (u'higgins', u'higgins'),
  (u'ymca south side', u'ymca of metropolitan chicago - south side ymca'),
  (u'chicago commons association paulo freire',
   u'chicago commons association paulo freire'),
  (u'fresh start daycare, inc.',
   u'easter seals society of metropolitan chicago fresh start day care center'),
  (u'el valor teddy bear 3', u'teddy bear 3'),
  (u'chicago child care society chicago child care society',
   u'chicago child care society-child and family dev center'),
  (u'hull house - uptown', u'uptown family care center')]
Y = [u'distinct',
  u'distinct',
  u'distinct',
  u'distinct',
  u'distinct',
  u'distinct',
  u'distinct',
  u'distinct',
  u'distinct',
  u'distinct',
  u'match',
  u'match',
  u'match',
  u'match',
  u'match',
  u'match',
  u'match']

In [6]:
from pyhacrf import StringPairFeatureExtractor, Hacrf
from scipy.optimize import fmin_l_bfgs_b
import numpy as np

Character level features


In [7]:
# Extract features
feature_extractor = StringPairFeatureExtractor(match=True, numeric=True)
X_extracted = feature_extractor.fit_transform(X)

In [9]:
%%timeit -n1 -r1
# Train model
model = Hacrf(l2_regularization=1.0, optimizer=fmin_l_bfgs_b, optimizer_kwargs={'maxfun': 10})
model.fit(X_extracted, Y, verbosity=1)


Iteration  Log-likelihood |gradient|
         0     -11.78      650.6
         1     -609.0  1.571e+03
         2     -54.72  1.567e+03
         3     -11.31      560.6
         4     -10.83      142.5
         5     -10.78      118.5
         6      -10.7      143.8
         7     -10.43      249.6
         8     -10.13      328.6
         9     -9.796      250.5
        10     -9.573      102.2
1 loops, best of 1: 8.73 s per loop

In [10]:
%%timeit -n1 -r1
# Evaluate
from sklearn.metrics import confusion_matrix
predictions = model.predict(X_extracted)
print(confusion_matrix(Y, predictions))
print(model.predict_proba(X_extracted))


[[8 2]
 [4 3]]
[[ 0.64197473  0.35802527]
 [ 0.351784    0.648216  ]
 [ 0.6553065   0.3446935 ]
 [ 0.87671132  0.12328868]
 [ 0.47772325  0.52227675]
 [ 0.878586    0.121414  ]
 [ 0.70987436  0.29012564]
 [ 0.64765774  0.35234226]
 [ 0.93360185  0.06639815]
 [ 0.92714317  0.07285683]
 [ 0.48782793  0.51217207]
 [ 0.40930797  0.59069203]
 [ 0.59444836  0.40555164]
 [ 0.39622435  0.60377565]
 [ 0.63782341  0.36217659]
 [ 0.69982284  0.30017716]
 [ 0.5777424   0.4222576 ]]
1 loops, best of 1: 1.67 s per loop

Token level features


In [14]:
from pyhacrf import PairFeatureExtractor

In [15]:
tokX = [[sentence.split(' ') for sentence in pair] for pair in X]

In [16]:
real = [
    lambda i, j, s1, s2: 1.0,
    lambda i, j, s1, s2: 1.0 if s1[i] == s2[j] else 0.0,
    lambda i, j, s1, s2: 1.0 if s1[i] == s2[j] and len(s1[i]) >= 6 else 0.0,
    lambda i, j, s1, s2: 1.0 if s1[i].isdigit() and s2[j].isdigit() and s1[i] == s2[j] else 0.0,
    lambda i, j, s1, s2: 1.0 if s1[i].isalpha() and s2[j].isalpha() and s1[i] == s2[j] else 0.0,
    lambda i, j, s1, s2: 1.0 if not s1[i].isalpha() and not s2[j].isalpha() else 0.0
]
# Other ideas are:
#  to look up whether words are dictionary words,
#  longest common subsequence,
#  standard edit distance
feature_extractor = PairFeatureExtractor(real=real)
X_extracted = feature_extractor.fit_transform(tokX)

In [17]:
#%%timeit -n1 -r1
# Train model
model = Hacrf(l2_regularization=1.0, optimizer=fmin_l_bfgs_b, optimizer_kwargs={'maxfun': 400})
model.fit(X_extracted, Y, verbosity=10)


Iteration  Log-likelihood |gradient|
         0     -11.78      113.8
        10     -8.721      16.12
        20     -8.366      1.147
        30     -8.362    0.06527
        40     -8.362   0.005777
Out[17]:
<pyhacrf.pyhacrf.Hacrf at 0x1068fb750>

In [18]:
%%timeit -n1 -r1
# Evaluate
from sklearn.metrics import confusion_matrix
predictions = model.predict(X_extracted)
print(confusion_matrix(Y, predictions))
print(model.predict_proba(X_extracted))


[[9 1]
 [2 5]]
[[ 0.72215688  0.27784312]
 [ 0.41200325  0.58799675]
 [ 0.56910178  0.43089822]
 [ 0.92672238  0.07327762]
 [ 0.56921501  0.43078499]
 [ 0.98737206  0.01262794]
 [ 0.56762697  0.43237303]
 [ 0.70141322  0.29858678]
 [ 0.97308327  0.02691673]
 [ 0.94721007  0.05278993]
 [ 0.32690805  0.67309195]
 [ 0.20741219  0.79258781]
 [ 0.30060707  0.69939293]
 [ 0.47280063  0.52719937]
 [ 0.4531238   0.5468762 ]
 [ 0.59051241  0.40948759]
 [ 0.66717449  0.33282551]]
1 loops, best of 1: 30.8 ms per loop

Edit distance and word frequency features

Let's also add the the Levenschtein distance as a features.

When we peek at the training examples, it looks as if less common words should be more informative of a match - let's add a feature for the word frequency as well.


In [19]:
import editdistance

In [20]:
editdistance.eval('cheese', 'kaas')


Out[20]:
5L

In [ ]:
tokX = [[sentence.split(' ') for sentence in pair] for pair in X]

In [48]:
real = [
    lambda i, j, s1, s2: 1.0,
    lambda i, j, s1, s2: 1.0 if s1[i] == s2[j] else 0.0,
    lambda i, j, s1, s2: 1.0 if s1[i].isdigit() and s2[j].isdigit() and s1[i] == s2[j] else 0.0,
    lambda i, j, s1, s2: 1.0 if not s1[i].isalpha() and not s2[j].isalpha() else 0.0,
    lambda i, j, s1, s2: editdistance.eval(s1[i], s2[j]),
    lambda i, j, s1, s2: np.log(editdistance.eval(s1[i], s2[j]) + 1),
    lambda i, j, s1, s2: (editdistance.eval(s1[i], s2[j])) / max(len(s1[i]), len(s2[j])),
    lambda i, j, s1, s2: 1.0 - (editdistance.eval(s1[i], s2[j])) / max(len(s1[i]), len(s2[j]))
]
# Other ideas are:
#  to look up whether words are dictionary words,
#  longest common subsequence,
#  standard edit distance

In [46]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.cross_validation import train_test_split

In [51]:
# Train model
errors_val = []
errors_train = []
for i, featureset in enumerate([[0, 1],
                                [0, 1, 2],
                                [0, 1, 2, 3],
                                [0, 4], 
                                [0, 1, 4], 
                                [0, 1, 2, 3, 4],
                                [0, 5],
                                [0, 1, 5],
                                [0, 1, 2, 3, 5],
                                [0, 6],
                                [0, 1, 6],
                                [0, 1, 2, 3, 6],
                                [0, 7],
                                [0, 1, 7],
                                [0, 1, 2, 3, 7]]):
    print '{:4}{:18}'.format(i, featureset),
    errs_val = []
    errs_train = []
    for repeat in xrange(15):
        x_train, x_val, y_train, y_val = train_test_split(tokX, Y, test_size=0.2)
        feature_extractor = PairFeatureExtractor(real=[real[f] for f in featureset])
        X_extracted = feature_extractor.fit_transform(x_train)

        model = Hacrf(l2_regularization=1.0, optimizer=fmin_l_bfgs_b, optimizer_kwargs={'maxfun': 400})
        model.fit(X_extracted, y_train)
        
        predictions = model.predict(X_extracted)
        err_train = 1.0 - accuracy_score(y_train, predictions)
        
        X_extracted = feature_extractor.transform(x_val)
        predictions = model.predict(X_extracted)
        err_val = 1.0 - accuracy_score(y_val, predictions)
        if repeat % 10 == 0:
            print '{:.2f}'.format(err_train),
            print '{:.2f}'.format(err_val),
        errs_val.append(err_val)
        errs_train.append(err_train)
    print '  => {:.2f} +- {:.2f} | {:.2f} +- {:.2f}'.format(np.average(errs_train), 
                                                            np.std(errs_train),
                                                            np.average(errs_val), 
                                                            np.std(errs_val))
    errors_train.append(errs_train)
    errors_val.append(errs_val)


   0[0, 1]             0.46 0.25 0.31 0.00   => 0.28 +- 0.11 | 0.43 +- 0.21
   1[0, 1, 2]          0.23 0.25 0.31 0.75   => 0.24 +- 0.09 | 0.50 +- 0.24
   2[0, 1, 2, 3]       0.23 0.50 0.15 0.75   => 0.21 +- 0.05 | 0.57 +- 0.19
   3[0, 4]             0.08 0.25 0.08 0.75   => 0.12 +- 0.04 | 0.40 +- 0.22
   4[0, 1, 4]          0.08 0.25 0.23 0.25   => 0.13 +- 0.07 | 0.42 +- 0.20
   5[0, 1, 2, 3, 4]    0.15 0.25 0.08 0.50   => 0.09 +- 0.07 | 0.43 +- 0.17
   6[0, 5]             0.15 0.50 0.23 0.00   => 0.17 +- 0.07 | 0.40 +- 0.18
   7[0, 1, 5]          0.23 0.25 0.15 0.50   => 0.17 +- 0.09 | 0.40 +- 0.29
   8[0, 1, 2, 3, 5]    0.23 0.25 0.15 0.50   => 0.16 +- 0.05 | 0.52 +- 0.17
   9[0, 6]             0.31 0.50 0.31 0.75   => 0.24 +- 0.05 | 0.42 +- 0.24
  10[0, 1, 6]          0.15 0.75 0.23 0.75   => 0.22 +- 0.09 | 0.52 +- 0.27
  11[0, 1, 2, 3, 6]    0.08 0.50 0.00 0.50   => 0.14 +- 0.08 | 0.53 +- 0.20
  12[0, 7]             0.23 0.75 0.23 0.50   => 0.24 +- 0.07 | 0.52 +- 0.23
  13[0, 1, 7]          0.23 0.75 0.23 0.50   => 0.24 +- 0.09 | 0.52 +- 0.23
  14[0, 1, 2, 3, 7]    0.23 0.50 0.15 0.75   => 0.21 +- 0.03 | 0.38 +- 0.22

Conclusion

It seems that tokenising the text not only speeds up training and scoring by 40x, it also improves the predictions. We definitely need more data to do this properly though.


In [11]:
from time import sleep

In [21]:
from IPython import parallel
c = parallel.Client()
view = c.load_balanced_view()


/Users/dirkocoetsee/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.py:446: RuntimeWarning: 
            Controller appears to be listening on localhost, but not on this machine.
            If this is true, you should specify Client(...,sshserver='you@192.168.43.8')
            or instruct your controller to listen on an external IP.
  RuntimeWarning)

In [23]:
def k():
    sleep(8)
    print 'kaas'

In [37]:
%%px --noblock
from time import sleep
sleep(15)
print 'kaas'
a=4


Out[37]:
<AsyncResult: execute>

In [38]:
1+1


Out[38]:
2

In [39]:
%pxresult


[stdout:0] kaas

In [ ]: