In this notebook we train a basic CRF model for Named Entity Recognition on CoNLL2002 data (following https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb) and check its weights to see what it learned.
To follow this tutorial you need NLTK > 3.x and sklearn-crfsuite Python packages. The tutorial uses Python 3.
In [1]:
import nltk
import sklearn_crfsuite
import eli5
CoNLL 2002 datasets contains a list of Spanish sentences, with Named Entities annotated. It uses IOB2 encoding. CoNLL 2002 data also provide POS tags.
In [2]:
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
train_sents[0]
Out[2]:
[('Melbourne', 'NP', 'B-LOC'),
('(', 'Fpa', 'O'),
('Australia', 'NP', 'B-LOC'),
(')', 'Fpt', 'O'),
(',', 'Fc', 'O'),
('25', 'Z', 'O'),
('may', 'NC', 'O'),
('(', 'Fpa', 'O'),
('EFE', 'NC', 'B-ORG'),
(')', 'Fpt', 'O'),
('.', 'Fp', 'O')]
POS tags can be seen as pre-extracted features. Let's extract more features (word parts, simplified POS tags, lower/title/upper flags, features of nearby words) and convert them to sklear-crfsuite format - each sentence should be converted to a list of dicts. This is a very simple baseline; you certainly can do better.
In [3]:
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
})
else:
features['EOS'] = True
return features
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
return [label for token, postag, label in sent]
def sent2tokens(sent):
return [token for token, postag, label in sent]
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]
This is how features extracted from a single token look like:
In [4]:
X_train[0][1]
Out[4]:
{'+1:postag': 'NP',
'+1:postag[:2]': 'NP',
'+1:word.istitle()': True,
'+1:word.isupper()': False,
'+1:word.lower()': 'australia',
'-1:postag': 'NP',
'-1:postag[:2]': 'NP',
'-1:word.istitle()': True,
'-1:word.isupper()': False,
'-1:word.lower()': 'melbourne',
'bias': 1.0,
'postag': 'Fpa',
'postag[:2]': 'Fp',
'word.isdigit()': False,
'word.istitle()': False,
'word.isupper()': False,
'word.lower()': '(',
'word[-3:]': '('}
In [5]:
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=20,
all_possible_transitions=False,
)
crf.fit(X_train, y_train);
In [6]:
eli5.show_weights(crf, top=30)
Out[6]:
From \ To
O
B-LOC
I-LOC
B-MISC
I-MISC
B-ORG
I-ORG
B-PER
I-PER
O
3.281
2.204
0.0
2.101
0.0
3.468
0.0
2.325
0.0
B-LOC
-0.259
-0.098
4.058
0.0
0.0
0.0
0.0
-0.212
0.0
I-LOC
-0.173
-0.609
3.436
0.0
0.0
0.0
0.0
0.0
0.0
B-MISC
-0.673
-0.341
0.0
0.0
4.069
-0.308
0.0
-0.331
0.0
I-MISC
-0.803
-0.998
0.0
-0.519
4.977
-0.817
0.0
-0.611
0.0
B-ORG
-0.096
-0.242
0.0
-0.57
0.0
-1.012
4.739
-0.306
0.0
I-ORG
-0.339
-1.758
0.0
-0.841
0.0
-1.382
5.062
-0.472
0.0
B-PER
-0.4
-0.851
0.0
0.0
0.0
-1.013
0.0
-0.937
4.329
I-PER
-0.676
-0.47
0.0
0.0
0.0
0.0
0.0
-0.659
3.754
y=O
top features
y=B-LOC
top features
y=I-LOC
top features
y=B-MISC
top features
y=I-MISC
top features
y=B-ORG
top features
y=I-ORG
top features
y=B-PER
top features
y=I-PER
top features
Weight?
Feature
+4.416
postag[:2]:Fp
+3.116
BOS
+2.401
bias
+2.297
postag[:2]:Fc
+2.297
word.lower():,
+2.297
postag:Fc
+2.297
word[-3:]:,
+2.124
postag[:2]:CC
+2.124
postag:CC
+1.984
EOS
+1.859
word.lower():y
+1.684
postag:RG
+1.684
postag[:2]:RG
+1.610
word.lower():-
+1.610
postag[:2]:Fg
+1.610
word[-3:]:-
+1.610
postag:Fg
+1.582
postag:Fp
+1.582
word[-3:]:.
+1.582
word.lower():.
+1.372
word[-3:]:y
+1.187
postag:CS
+1.187
postag[:2]:CS
+1.150
word[-3:]:(
+1.150
postag:Fpa
+1.150
word.lower():(
… 16444 more positive …
… 3771 more negative …
-2.106
postag:NP
-2.106
postag[:2]:NP
-3.723
word.isupper()
-6.166
word.istitle()
Weight?
Feature
+2.530
word.istitle()
+2.224
-1:word.lower():en
+0.906
word[-3:]:rid
+0.905
word.lower():madrid
+0.646
word.lower():españa
+0.640
word[-3:]:ona
+0.595
word[-3:]:aña
+0.595
+1:postag[:2]:Fp
+0.515
word.lower():parís
+0.514
word[-3:]:rís
+0.424
word.lower():barcelona
+0.420
-1:postag:Fg
+0.420
-1:word.lower():-
+0.420
-1:postag[:2]:Fg
+0.413
-1:word.isupper()
+0.390
-1:postag[:2]:Fp
+0.389
-1:postag:Fpa
+0.389
-1:word.lower():(
+0.388
word.lower():san
+0.385
postag:NC
… 2282 more positive …
… 413 more negative …
-0.389
-1:word.lower():"
-0.389
-1:postag:Fe
-0.389
-1:postag[:2]:Fe
-0.406
-1:postag[:2]:VM
-0.646
word[-3:]:ión
-0.759
-1:word.lower():del
-0.818
bias
-0.986
postag:SP
-0.986
postag[:2]:SP
-1.354
-1:word.istitle()
Weight?
Feature
+0.886
-1:word.istitle()
+0.664
-1:word.lower():de
+0.582
word[-3:]:de
+0.578
word.lower():de
+0.529
-1:word.lower():san
+0.444
+1:word.istitle()
+0.441
word.istitle()
+0.335
-1:word.lower():la
+0.262
postag:SP
+0.262
postag[:2]:SP
+0.235
word[-3:]:la
+0.228
word[-3:]:iro
+0.226
word[-3:]:oja
+0.218
word[-3:]:del
+0.215
word.lower():del
+0.213
-1:postag:NC
+0.213
-1:postag[:2]:NC
+0.205
-1:word.lower():nueva
… 1665 more positive …
… 258 more negative …
-0.206
-1:postag[:2]:Z
-0.206
-1:postag:Z
-0.213
-1:postag[:2]:CC
-0.213
-1:postag:CC
-0.219
-1:word.lower():en
-0.222
+1:word.isupper()
-0.235
+1:postag:VMI
-0.342
word.isupper()
-0.366
+1:postag[:2]:AQ
-0.366
+1:postag:AQ
-0.392
+1:postag[:2]:VM
-1.690
BOS
Weight?
Feature
+1.770
word.isupper()
+0.693
word.istitle()
+0.606
word.lower():"
+0.606
word[-3:]:"
+0.606
postag:Fe
+0.606
postag[:2]:Fe
+0.538
+1:word.istitle()
+0.508
-1:word.lower():"
+0.508
-1:postag:Fe
+0.508
-1:postag[:2]:Fe
+0.484
-1:postag[:2]:DA
+0.484
-1:postag:DA
+0.479
+1:word.isupper()
+0.457
postag[:2]:NC
+0.457
postag:NC
+0.400
word.lower():liga
+0.399
word[-3:]:iga
+0.367
-1:word.lower():la
+0.354
postag:Z
+0.354
postag[:2]:Z
+0.332
-1:word.lower():del
+0.286
+1:postag[:2]:Z
+0.286
+1:postag:Z
+0.284
+1:postag:NC
+0.284
+1:postag[:2]:NC
… 2284 more positive …
… 314 more negative …
-0.308
BOS
-0.377
-1:postag[:2]:VM
-0.908
postag[:2]:SP
-0.908
postag:SP
-1.094
-1:word.istitle()
Weight?
Feature
+1.364
-1:word.istitle()
+0.675
-1:word.lower():de
+0.597
+1:postag:Fe
+0.597
+1:word.lower():"
+0.597
+1:postag[:2]:Fe
+0.369
-1:postag:NC
+0.369
-1:postag[:2]:NC
+0.324
-1:word.lower():liga
+0.318
word[-3:]:de
+0.304
word.lower():de
+0.303
word.isdigit()
+0.261
-1:postag[:2]:SP
+0.261
-1:postag:SP
+0.258
-1:word.lower():copa
+0.240
word.lower():campeones
+0.235
word[-3:]:000
+0.234
+1:postag:Z
+0.234
+1:postag[:2]:Z
+0.229
word.lower():2000
… 3675 more positive …
… 573 more negative …
-0.235
EOS
-0.264
-1:word.lower():y
-0.265
word.lower():y
-0.265
+1:postag:VMI
-0.274
postag[:2]:VM
-0.306
-1:postag:CC
-0.306
-1:postag[:2]:CC
-0.320
postag:CC
-0.320
postag[:2]:CC
-0.370
+1:postag[:2]:VM
-0.641
bias
Weight?
Feature
+2.695
word.lower():efe
+2.519
word.isupper()
+2.084
word[-3:]:EFE
+1.174
word.lower():gobierno
+1.142
word.istitle()
+1.018
-1:word.lower():del
+0.958
word[-3:]:rno
+0.671
word[-3:]:PP
+0.671
word.lower():pp
+0.667
-1:word.lower():al
+0.555
-1:word.lower():el
+0.499
word[-3:]:eal
+0.413
word.lower():real
+0.393
word.lower():ayuntamiento
+0.391
postag:AQ
+0.391
postag[:2]:AQ
… 3518 more positive …
… 619 more negative …
-0.430
-1:postag[:2]:AQ
-0.430
-1:postag:AQ
-0.450
+1:word.lower():de
-0.455
postag[:2]:Z
-0.455
postag:Z
-0.500
-1:word.istitle()
-0.642
-1:word.lower():los
-0.664
-1:word.lower():de
-0.707
-1:word.isupper()
-0.746
-1:word.lower():en
-0.747
-1:postag[:2]:VM
-1.100
bias
-1.289
postag[:2]:SP
-1.289
postag:SP
Weight?
Feature
+1.499
-1:word.istitle()
+1.200
-1:word.lower():de
+0.539
-1:word.lower():real
+0.511
word[-3:]:rid
+0.446
word[-3:]:de
+0.433
word.lower():de
+0.428
-1:postag:SP
+0.428
-1:postag[:2]:SP
+0.399
word.lower():madrid
+0.368
word[-3:]:la
+0.365
-1:word.lower():consejo
+0.363
word.istitle()
+0.352
-1:word.lower():comisión
+0.336
postag[:2]:AQ
+0.336
postag:AQ
+0.332
+1:postag:Fpa
+0.332
+1:word.lower():(
+0.311
-1:word.lower():estados
+0.306
word.lower():unidos
… 3473 more positive …
… 703 more negative …
-0.304
postag[:2]:NP
-0.304
postag:NP
-0.306
-1:word.lower():a
-0.384
+1:postag[:2]:NC
-0.384
+1:postag:NC
-0.391
-1:word.isupper()
-0.507
+1:postag:AQ
-0.507
+1:postag[:2]:AQ
-0.535
postag[:2]:VM
-0.540
postag:VMI
-1.195
bias
Weight?
Feature
+1.698
word.istitle()
+0.683
-1:postag:VMI
+0.601
+1:postag[:2]:VM
+0.589
postag:NP
+0.589
postag[:2]:NP
+0.589
+1:postag:VMI
+0.565
-1:word.lower():a
+0.520
word[-3:]:osé
+0.503
word.lower():josé
+0.476
-1:postag[:2]:VM
+0.472
postag:NC
+0.472
postag[:2]:NC
+0.452
-1:postag[:2]:Fc
+0.452
-1:word.lower():,
+0.452
-1:postag:Fc
… 4117 more positive …
… 351 more negative …
-0.472
-1:word.lower():en
-0.475
-1:postag[:2]:Fe
-0.475
-1:word.lower():"
-0.475
-1:postag:Fe
-0.543
word.lower():la
-0.572
-1:word.lower():de
-0.693
-1:word.istitle()
-0.712
postag[:2]:SP
-0.712
postag:SP
-0.778
-1:word.lower():del
-0.818
-1:postag[:2]:DA
-0.818
-1:postag:DA
-0.923
-1:word.lower():la
-1.319
postag:DA
-1.319
postag[:2]:DA
Weight?
Feature
+2.742
-1:word.istitle()
+0.736
word.istitle()
+0.660
-1:word.lower():josé
+0.598
-1:postag[:2]:AQ
+0.598
-1:postag:AQ
+0.510
-1:postag[:2]:VM
+0.487
-1:word.lower():juan
+0.419
-1:word.lower():maría
+0.413
-1:postag:VMI
+0.345
-1:word.lower():luis
+0.319
-1:word.lower():manuel
+0.315
postag[:2]:NC
+0.315
postag:NC
+0.309
-1:word.lower():carlos
… 3903 more positive …
… 365 more negative …
-0.301
postag[:2]:NP
-0.301
postag:NP
-0.301
word[-3:]:ión
-0.305
postag[:2]:Fe
-0.305
word.lower():"
-0.305
postag:Fe
-0.305
word[-3:]:"
-0.305
+1:word.lower():que
-0.324
-1:word.lower():el
-0.377
+1:postag[:2]:Z
-0.377
+1:postag:Z
-0.396
postag:VMI
-0.433
+1:postag:SP
-0.433
+1:postag[:2]:SP
-0.485
postag[:2]:VM
-1.431
bias
Transition features make sense: at least model learned that I-ENITITY must follow B-ENTITY. It also learned that some transitions are unlikely, e.g. it is not common in this dataset to have a location right after an organization name (I-ORG -> B-LOC has a large negative weight).
Features don't use gazetteers, so model had to remember some geographic names from the training data, e.g. that España is a location.
If we regularize CRF more, we can expect that only features which are generic will remain, and memoized tokens will go. With L1 regularization (c1 parameter) coefficients of most features should be driven to zero. Let's check what effect does regularization have on CRF weights:
In [7]:
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=200,
c2=0.1,
max_iterations=20,
all_possible_transitions=False,
)
crf.fit(X_train, y_train)
eli5.show_weights(crf, top=30)
Out[7]:
From \ To
O
B-LOC
I-LOC
B-MISC
I-MISC
B-ORG
I-ORG
B-PER
I-PER
O
3.232
1.76
0.0
2.026
0.0
2.603
0.0
1.593
0.0
B-LOC
0.035
0.0
2.773
0.0
0.0
0.0
0.0
0.0
0.0
I-LOC
-0.02
0.0
3.099
0.0
0.0
0.0
0.0
0.0
0.0
B-MISC
-0.382
0.0
0.0
0.0
4.758
0.0
0.0
0.0
0.0
I-MISC
-0.256
0.0
0.0
0.0
4.155
0.0
0.0
0.0
0.0
B-ORG
0.161
0.0
0.0
0.0
0.0
0.0
3.344
0.0
0.0
I-ORG
-0.126
-0.081
0.0
0.0
0.0
0.0
4.048
0.0
0.0
B-PER
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.449
I-PER
-0.085
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2.254
y=O
top features
y=B-LOC
top features
y=I-LOC
top features
y=B-MISC
top features
y=I-MISC
top features
y=B-ORG
top features
y=I-ORG
top features
y=B-PER
top features
y=I-PER
top features
Weight?
Feature
+3.363
BOS
+2.842
bias
+2.478
postag[:2]:Fp
+0.665
-1:word.isupper()
+0.439
+1:postag[:2]:AQ
+0.439
+1:postag:AQ
+0.400
postag[:2]:Fc
+0.400
word.lower():,
+0.400
word[-3:]:,
+0.400
postag:Fc
+0.391
postag:CC
+0.391
postag[:2]:CC
+0.365
EOS
+0.363
+1:postag:NC
+0.363
+1:postag[:2]:NC
+0.315
postag:SP
+0.315
postag[:2]:SP
+0.302
+1:word.isupper()
… 15 more positive …
… 14 more negative …
-0.216
postag:AQ
-0.216
postag[:2]:AQ
-0.334
-1:postag:SP
-0.334
-1:postag[:2]:SP
-0.417
postag[:2]:NP
-0.417
postag:NP
-0.547
postag[:2]:NC
-0.547
postag:NC
-0.547
word.lower():de
-0.600
word[-3:]:de
-3.552
word.isupper()
-5.446
word.istitle()
Weight?
Feature
+1.417
-1:word.lower():en
+1.183
word.istitle()
+0.498
+1:postag[:2]:Fp
+0.150
+1:word.lower():,
+0.150
+1:postag:Fc
+0.150
+1:postag[:2]:Fc
+0.098
-1:postag[:2]:Fp
+0.081
-1:postag:Fpa
+0.081
-1:word.lower():(
+0.080
postag[:2]:NP
+0.080
postag:NP
+0.056
-1:postag:SP
+0.056
-1:postag[:2]:SP
+0.022
postag:NC
+0.022
postag[:2]:NC
+0.019
BOS
-0.008
+1:word.istitle()
-0.028
-1:word.lower():del
-0.572
-1:word.istitle()
Weight?
Feature
+0.788
-1:word.istitle()
+0.248
word[-3:]:de
+0.237
word.lower():de
+0.199
-1:word.lower():de
+0.190
postag[:2]:SP
+0.190
postag:SP
+0.060
-1:postag:SP
+0.060
-1:postag[:2]:SP
+0.040
+1:word.istitle()
Weight?
Feature
+0.349
word.isupper()
+0.053
-1:postag[:2]:DA
+0.053
-1:postag:DA
+0.030
word.istitle()
-0.009
-1:postag:SP
-0.009
-1:postag[:2]:SP
-0.060
bias
-0.172
-1:word.istitle()
Weight?
Feature
+0.432
-1:word.istitle()
+0.158
-1:postag[:2]:NC
+0.158
-1:postag:NC
+0.146
+1:postag[:2]:Fe
+0.146
+1:word.lower():"
+0.146
+1:postag:Fe
+0.030
postag[:2]:SP
+0.030
postag:SP
-0.087
word.istitle()
-0.094
bias
-0.119
word.isupper()
-0.120
-1:word.isupper()
-0.121
+1:word.isupper()
-0.211
+1:word.istitle()
Weight?
Feature
+1.681
word.isupper()
+0.507
-1:word.lower():del
+0.350
-1:postag:DA
+0.350
-1:postag[:2]:DA
+0.282
word.lower():efe
+0.234
word[-3:]:EFE
+0.195
-1:word.lower():(
+0.195
-1:postag:Fpa
+0.192
word.istitle()
+0.178
+1:postag:Fpt
+0.178
+1:word.lower():)
+0.173
-1:postag[:2]:Fp
+0.136
-1:word.lower():el
+0.110
postag[:2]:NC
+0.110
postag:NC
-0.004
+1:word.istitle()
-0.023
+1:postag[:2]:Fp
-0.041
+1:postag:NC
-0.041
+1:postag[:2]:NC
-0.210
-1:word.lower():de
-0.515
bias
Weight?
Feature
+1.318
-1:word.istitle()
+0.762
-1:word.lower():de
+0.185
-1:postag:SP
+0.185
-1:postag[:2]:SP
+0.185
word[-3:]:de
+0.058
word.lower():de
-0.043
-1:word.isupper()
-0.267
+1:word.istitle()
-0.536
bias
Weight?
Feature
+0.800
word.istitle()
+0.463
-1:word.lower():,
+0.463
-1:postag[:2]:Fc
+0.463
-1:postag:Fc
+0.148
+1:postag:VMI
+0.125
+1:word.istitle()
+0.095
+1:postag[:2]:VM
+0.007
+1:postag:AQ
+0.007
+1:postag[:2]:AQ
-0.039
-1:word.istitle()
-0.058
postag:DA
-0.058
postag[:2]:DA
-0.063
bias
-0.067
-1:word.lower():de
-0.159
-1:postag:SP
-0.159
-1:postag[:2]:SP
-0.263
-1:postag:DA
-0.263
-1:postag[:2]:DA
Weight?
Feature
+2.127
-1:word.istitle()
+0.331
word.istitle()
+0.016
+1:postag[:2]:Fc
+0.016
+1:word.lower():,
+0.016
+1:postag:Fc
-0.089
+1:postag:SP
-0.089
+1:postag[:2]:SP
-0.648
bias
As you can see, memoized tokens are mostly gone and model now relies on word shapes and POS tags. There is only a few non-zero features remaining. In our example the change probably made the quality worse, but that's a separate question.
Let's focus on transition weights. We can expect that O -> I-ENTIRY transitions to have large negative weights because they are impossible. But these transitions have zero weights, not negative weights, both in heavily regularized model and in our initial model. Something is going on here.
The reason they are zero is that crfsuite haven't seen these transitions in training data, and assumed there is no need to learn weights for them, to save some computation time. This is the default behavior, but it is possible to turn it off using sklearn_crfsuite.CRF all_possible_transitions
option. Let's check how does it affect the result:
In [8]:
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=20,
all_possible_transitions=True,
)
crf.fit(X_train, y_train);
In [9]:
eli5.show_weights(crf, top=5, show=['transition_features'])
Out[9]:
From \ To
O
B-LOC
I-LOC
B-MISC
I-MISC
B-ORG
I-ORG
B-PER
I-PER
O
2.732
1.217
-4.675
1.515
-5.785
1.36
-6.19
0.968
-6.236
B-LOC
-0.226
-0.091
3.378
-0.433
-1.065
-0.861
-1.783
-0.295
-1.57
I-LOC
-0.184
-0.585
2.404
-0.276
-0.485
-0.582
-0.749
-0.442
-0.647
B-MISC
-0.714
-0.353
-0.539
-0.278
3.512
-0.412
-1.047
-0.336
-0.895
I-MISC
-0.697
-0.846
-0.587
-0.297
4.252
-0.84
-1.206
-0.523
-1.001
B-ORG
0.419
-0.187
-1.074
-0.567
-1.607
-1.13
5.392
-0.223
-2.122
I-ORG
-0.117
-1.715
-0.863
-0.631
-1.221
-1.442
5.141
-0.397
-1.908
B-PER
-0.127
-0.806
-0.834
-0.52
-1.228
-1.089
-2.076
-1.01
4.04
I-PER
-0.766
-0.242
-0.67
-0.418
-0.856
-0.903
-1.472
-0.692
2.909
With all_possible_transitions=True
CRF learned large negative weights for impossible transitions like O -> I-ORG.
In [10]:
eli5.show_weights(crf, top=10, targets=['O', 'B-ORG', 'I-ORG'])
Out[10]:
From \ To
O
B-ORG
I-ORG
O
2.732
1.36
-6.19
B-ORG
0.419
-1.13
5.392
I-ORG
-0.117
-1.442
5.141
y=O
top features
y=B-ORG
top features
y=I-ORG
top features
Weight?
Feature
+4.931
BOS
+3.754
postag[:2]:Fp
+3.539
bias
+2.328
word[-3:]:,
+2.328
word.lower():,
+2.328
postag[:2]:Fc
+2.328
postag:Fc
… 15039 more positive …
… 3905 more negative …
-2.187
postag[:2]:NP
-3.685
word.isupper()
-7.025
word.istitle()
Weight?
Feature
+3.041
word.isupper()
+2.952
word.lower():efe
+1.851
word[-3:]:EFE
+1.278
word.lower():gobierno
+1.033
word[-3:]:rno
+1.005
word.istitle()
+0.864
-1:word.lower():del
… 3524 more positive …
… 621 more negative …
-0.842
-1:word.lower():en
-1.416
postag[:2]:SP
-1.416
postag:SP
Weight?
Feature
+1.159
-1:word.lower():de
+0.993
-1:word.istitle()
+0.637
-1:postag[:2]:SP
+0.637
-1:postag:SP
+0.570
-1:word.lower():real
+0.547
word.istitle()
… 3517 more positive …
… 676 more negative …
-0.480
postag:VMI
-0.508
postag[:2]:VM
-0.533
-1:word.isupper()
-1.290
bias
Another option is to check only some of the features - it helps to check if a feature function works as intended. For example, let's check how word shape features are used by model using feature_re
argument and hide transition table:
In [11]:
eli5.show_weights(crf, top=10, feature_re='^word\.is',
horizontal_layout=False, show=['targets'])
Out[11]:
y=O
top features
Weight?
Feature
-3.685
word.isupper()
-7.025
word.istitle()
y=B-LOC
top features
Weight?
Feature
+2.397
word.istitle()
+0.099
word.isupper()
-0.152
word.isdigit()
y=I-LOC
top features
Weight?
Feature
+0.460
word.istitle()
-0.018
word.isdigit()
-0.345
word.isupper()
y=B-MISC
top features
Weight?
Feature
+2.017
word.isupper()
+0.603
word.istitle()
-0.012
word.isdigit()
y=I-MISC
top features
Weight?
Feature
+0.271
word.isdigit()
-0.072
word.isupper()
-0.106
word.istitle()
y=B-ORG
top features
Weight?
Feature
+3.041
word.isupper()
+1.005
word.istitle()
-0.044
word.isdigit()
y=I-ORG
top features
Weight?
Feature
+0.547
word.istitle()
+0.014
word.isdigit()
-0.012
word.isupper()
y=B-PER
top features
Weight?
Feature
+1.757
word.istitle()
+0.050
word.isupper()
-0.123
word.isdigit()
y=I-PER
top features
Weight?
Feature
+0.976
word.istitle()
+0.193
word.isupper()
-0.106
word.isdigit()
Looks fine - UPPERCASE and Titlecase words are likely to be entities of some kind.
It is also possible to format the result as text (could be useful in console):
In [12]:
expl = eli5.explain_weights(crf, top=5, targets=['O', 'B-LOC', 'I-LOC'])
print(eli5.format_as_text(expl))
Explained as: CRF
Transition features:
O B-LOC I-LOC
----- ------ ------- -------
O 2.732 1.217 -4.675
B-LOC -0.226 -0.091 3.378
I-LOC -0.184 -0.585 2.404
y='O' top features
Weight Feature
------ --------------
+4.931 BOS
+3.754 postag[:2]:Fp
+3.539 bias
… 15043 more positive …
… 3906 more negative …
-3.685 word.isupper()
-7.025 word.istitle()
y='B-LOC' top features
Weight Feature
------ ------------------
+2.397 word.istitle()
+2.147 -1:word.lower():en
… 2284 more positive …
… 433 more negative …
-1.080 postag[:2]:SP
-1.080 postag:SP
-1.273 -1:word.istitle()
y='I-LOC' top features
Weight Feature
------ ------------------
+0.882 -1:word.lower():de
+0.780 -1:word.istitle()
+0.718 word[-3:]:de
+0.711 word.lower():de
… 1684 more positive …
… 268 more negative …
-1.965 BOS
In [ ]:
Content source: TeamHG-Memex/eli5
Similar notebooks: