In [1]:
import os, sys
import pandas as pd

sys.path.append('../deepcut')

from train import *


Using TensorFlow backend.

In [2]:
# BEST corpus should be extracted into 'input' folder on the same directory as this notebook 

generate_best_dataset('input')


Save article to CSV file
Save encyclopedia to CSV file
Save news to CSV file
Save novel to CSV file

In [3]:
model = train_model('cleaned_data')


Train on 18724989 samples, validate on 2044274 samples
Epoch 1/10
1238s - loss: 0.0612 - acc: 0.9774 - val_loss: 0.0445 - val_acc: 0.9844
Epoch 2/10
1244s - loss: 0.0433 - acc: 0.9846 - val_loss: 0.0399 - val_acc: 0.9861
Epoch 3/10
1241s - loss: 0.0387 - acc: 0.9863 - val_loss: 0.0379 - val_acc: 0.9870
Epoch 4/10
1223s - loss: 0.0362 - acc: 0.9873 - val_loss: 0.0363 - val_acc: 0.9876
Epoch 5/10
1227s - loss: 0.0345 - acc: 0.9879 - val_loss: 0.0349 - val_acc: 0.9879
Epoch 6/10
1241s - loss: 0.0331 - acc: 0.9884 - val_loss: 0.0345 - val_acc: 0.9882
Epoch 7/10
1238s - loss: 0.0321 - acc: 0.9888 - val_loss: 0.0343 - val_acc: 0.9884
Epoch 8/10
1232s - loss: 0.0313 - acc: 0.9891 - val_loss: 0.0330 - val_acc: 0.9887
Epoch 9/10
1216s - loss: 0.0307 - acc: 0.9893 - val_loss: 0.0331 - val_acc: 0.9887
Epoch 10/10
1209s - loss: 0.0301 - acc: 0.9895 - val_loss: 0.0332 - val_acc: 0.9887
Train on 18724989 samples, validate on 2044274 samples
Epoch 1/3
945s - loss: 0.0284 - acc: 0.9901 - val_loss: 0.0320 - val_acc: 0.9891
Epoch 2/3
944s - loss: 0.0278 - acc: 0.9903 - val_loss: 0.0321 - val_acc: 0.9891
Epoch 3/3
945s - loss: 0.0275 - acc: 0.9904 - val_loss: 0.0313 - val_acc: 0.9892
Train on 18724989 samples, validate on 2044274 samples
Epoch 1/3
780s - loss: 0.0257 - acc: 0.9910 - val_loss: 0.0313 - val_acc: 0.9896
Epoch 2/3
782s - loss: 0.0254 - acc: 0.9912 - val_loss: 0.0309 - val_acc: 0.9895
Epoch 3/3
782s - loss: 0.0251 - acc: 0.9912 - val_loss: 0.0311 - val_acc: 0.9896
Train on 18724989 samples, validate on 2044274 samples
Epoch 1/3
752s - loss: 0.0245 - acc: 0.9915 - val_loss: 0.0308 - val_acc: 0.9896
Epoch 2/3
750s - loss: 0.0244 - acc: 0.9915 - val_loss: 0.0308 - val_acc: 0.9896
Epoch 3/3
751s - loss: 0.0242 - acc: 0.9915 - val_loss: 0.0304 - val_acc: 0.9898
Train on 18724989 samples, validate on 2044274 samples
Epoch 1/3
751s - loss: 0.0237 - acc: 0.9917 - val_loss: 0.0305 - val_acc: 0.9897
Epoch 2/3
749s - loss: 0.0236 - acc: 0.9918 - val_loss: 0.0305 - val_acc: 0.9898
Epoch 3/3
753s - loss: 0.0235 - acc: 0.9918 - val_loss: 0.0307 - val_acc: 0.9898

In [4]:
evaluate('cleaned_data', model)


Out[4]:
(0.98121855103256694, 0.97789486109173629, 0.98456491128245205)

The performance is lower when we strip name entity and abbreviation tag (< NE >....< /NE > and < AB >...< /AB >) out of training and testing data.


In [ ]: