Feature: POS/NER Tag Similarity

Derive bag-of-POS-tag and bag-of-NER-tag vectors from each question and calculate their vector distances.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.



In [1]:

    
from pygoose import *



In [2]:

    
import os
import warnings



In [3]:

    
from collections import Counter



In [4]:

    
from scipy.spatial.distance import cosine, euclidean, jaccard



In [5]:

    
import spacy

Config

Automatically discover the paths to various data folders and compose the project structure.



In [6]:

    
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.



In [7]:

    
feature_list_id = 'nlp_tags'

Read Data

Original question datasets.



In [8]:

    
df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('')
df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('')

Preprocessed and tokenized questions.

We should not use lowercased tokens here because that would harm the named entity recognition process.



In [9]:

    
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_spellcheck_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_spellcheck_test.pickle')



In [10]:

    
df_all_texts = pd.DataFrame(
    [[' '.join(pair[0]), ' '.join(pair[1])] for pair in tokens_train + tokens_test],
    columns=['question1', 'question2'],
)

Dependency parsing takes a lot of time and we don't use any features from it. Let's disable it in the pipeline.

If model loading fails, run python -m spacy download en



In [11]:

    
nlp = spacy.load('en', parser=False)

Build Features



In [12]:

    
pos_tags_whitelist = ['ADJ', 'ADV', 'NOUN', 'PROPN', 'NUM', 'VERB']
ner_tags_whitelist = ['GPE', 'LOC', 'ORG', 'NORP', 'PERSON', 'PRODUCT', 'DATE', 'TIME', 'QUANTITY', 'CARDINAL']



In [13]:

    
num_raw_features = len(pos_tags_whitelist) + len(ner_tags_whitelist)



In [14]:

    
X1 = np.zeros((len(df_all_texts), num_raw_features))
X2 = np.zeros((len(df_all_texts), num_raw_features))



In [15]:

    
X1.shape, X2.shape









    Out[15]:





((2750086, 16), (2750086, 16))

Collect POS and NER tags



In [16]:

    
pipe_q1 = nlp.pipe(df_all_texts['question1'].values, n_threads=os.cpu_count())
pipe_q2 = nlp.pipe(df_all_texts['question2'].values, n_threads=os.cpu_count())



In [17]:

    
for i, doc in progressbar(enumerate(pipe_q1), total=len(df_all_texts)):
    pos_counter = Counter(token.pos_ for token in doc)
    ner_counter = Counter(ent.label_ for ent in doc.ents)
    X1[i, :] = np.array(
        [pos_counter[pos_tag] for pos_tag in pos_tags_whitelist] +
        [ner_counter[ner_tag] for ner_tag in ner_tags_whitelist]
    )









    



100%|██████████| 2750086/2750086 [05:21<00:00, 8558.72it/s]



In [18]:

    
for i, doc in progressbar(enumerate(pipe_q2), total=len(df_all_texts)):
    pos_counter = Counter(token.pos_ for token in doc)
    ner_counter = Counter(ent.label_ for ent in doc.ents)
    X2[i, :] = np.array(
        [pos_counter[pos_tag] for pos_tag in pos_tags_whitelist] +
        [ner_counter[ner_tag] for ner_tag in ner_tags_whitelist]
    )









    



100%|██████████| 2750086/2750086 [05:25<00:00, 8460.16it/s]

Create tag feature sets



In [19]:

    
df_pos_q1 = pd.DataFrame(
    X1[:, 0:len(pos_tags_whitelist)],
    columns=['pos_q1_' + pos_tag.lower() for pos_tag in pos_tags_whitelist]
)



In [20]:

    
df_pos_q2 = pd.DataFrame(
    X2[:, 0:len(pos_tags_whitelist)],
    columns=['pos_q2_' + pos_tag.lower() for pos_tag in pos_tags_whitelist]
)



In [21]:

    
df_ner_q1 = pd.DataFrame(
    X1[:, -len(ner_tags_whitelist):],
    columns=['ner_q1_' + ner_tag.lower() for ner_tag in ner_tags_whitelist]
)



In [22]:

    
df_ner_q2 = pd.DataFrame(
    X2[:, -len(ner_tags_whitelist):],
    columns=['ner_q2_' + ner_tag.lower() for ner_tag in ner_tags_whitelist]
)

Compute pairwise distances



In [24]:

    
def get_vector_distances(i):
    return [
        # POS distances.
        cosine(X1[i, 0:len(pos_tags_whitelist)], X2[i, 0:len(pos_tags_whitelist)]),
        euclidean(X1[i, 0:len(pos_tags_whitelist)], X2[i, 0:len(pos_tags_whitelist)]),

        # NER distances.
        euclidean(X1[i, -len(ner_tags_whitelist):], X2[i, -len(ner_tags_whitelist):]),
        np.abs(np.sum(X1[i, -len(ner_tags_whitelist):]) - np.sum(X2[i, -len(ner_tags_whitelist):])),
    ]



In [38]:

    
warnings.filterwarnings('ignore')
X_distances = kg.jobs.map_batch_parallel(
    list(range(len(df_all_texts))),
    item_mapper=get_vector_distances,
    batch_size=1000,
)









    



Batches: 100%|██████████| 2751/2751 [00:41<00:00, 66.33it/s]



In [26]:

    
X_distances = np.array(X_distances)



In [27]:

    
df_distances = pd.DataFrame(
    X_distances,
    columns=[
        'pos_tag_cosine',
        'pos_tag_euclidean',
        'ner_tag_euclidean',
        'ner_tag_count_diff',
    ]
)

Build master feature list



In [28]:

    
df_master = pd.concat(
    [df_pos_q1, df_ner_q1, df_pos_q2, df_ner_q2, df_distances],
    axis=1,
    ignore_index=True,
)



In [29]:

    
df_master.columns = list(df_pos_q1.columns) + \
    list(df_ner_q1.columns) + \
    list(df_pos_q2.columns) + \
    list(df_ner_q2.columns) + \
    list(df_distances.columns)



In [30]:

    
df_master.describe().T









    Out[30]:







  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      pos_q1_adj
      2750086.000000
      1.067322
      1.083106
      0.000000
      0.000000
      1.000000
      2.000000
      26.000000
    
    
      pos_q1_adv
      2750086.000000
      0.727720
      0.860922
      0.000000
      0.000000
      1.000000
      1.000000
      18.000000
    
    
      pos_q1_noun
      2750086.000000
      2.930388
      1.832767
      0.000000
      2.000000
      3.000000
      4.000000
      42.000000
    
    
      pos_q1_propn
      2750086.000000
      0.868396
      1.336260
      0.000000
      0.000000
      0.000000
      1.000000
      41.000000
    
    
      pos_q1_num
      2750086.000000
      0.451231
      1.490798
      0.000000
      0.000000
      0.000000
      0.000000
      83.000000
    
    
      pos_q1_verb
      2750086.000000
      2.349555
      1.552640
      0.000000
      1.000000
      2.000000
      3.000000
      59.000000
    
    
      ner_q1_gpe
      2750086.000000
      0.165916
      0.446753
      0.000000
      0.000000
      0.000000
      0.000000
      10.000000
    
    
      ner_q1_loc
      2750086.000000
      0.013586
      0.121908
      0.000000
      0.000000
      0.000000
      0.000000
      4.000000
    
    
      ner_q1_org
      2750086.000000
      0.219017
      0.501661
      0.000000
      0.000000
      0.000000
      0.000000
      8.000000
    
    
      ner_q1_norp
      2750086.000000
      0.050499
      0.256038
      0.000000
      0.000000
      0.000000
      0.000000
      8.000000
    
    
      ner_q1_person
      2750086.000000
      0.109720
      0.356363
      0.000000
      0.000000
      0.000000
      0.000000
      6.000000
    
    
      ner_q1_product
      2750086.000000
      0.003194
      0.057656
      0.000000
      0.000000
      0.000000
      0.000000
      3.000000
    
    
      ner_q1_date
      2750086.000000
      0.048328
      0.236150
      0.000000
      0.000000
      0.000000
      0.000000
      11.000000
    
    
      ner_q1_time
      2750086.000000
      0.008575
      0.097877
      0.000000
      0.000000
      0.000000
      0.000000
      4.000000
    
    
      ner_q1_quantity
      2750086.000000
      0.008518
      0.098519
      0.000000
      0.000000
      0.000000
      0.000000
      5.000000
    
    
      ner_q1_cardinal
      2750086.000000
      0.220541
      0.750349
      0.000000
      0.000000
      0.000000
      0.000000
      29.000000
    
    
      pos_q2_adj
      2750086.000000
      1.071430
      1.093956
      0.000000
      0.000000
      1.000000
      2.000000
      27.000000
    
    
      pos_q2_adv
      2750086.000000
      0.732874
      0.868020
      0.000000
      0.000000
      1.000000
      1.000000
      18.000000
    
    
      pos_q2_noun
      2750086.000000
      2.921520
      1.852080
      0.000000
      2.000000
      3.000000
      4.000000
      43.000000
    
    
      pos_q2_propn
      2750086.000000
      0.867678
      1.334859
      0.000000
      0.000000
      0.000000
      1.000000
      40.000000
    
    
      pos_q2_num
      2750086.000000
      0.456342
      1.489910
      0.000000
      0.000000
      0.000000
      0.000000
      83.000000
    
    
      pos_q2_verb
      2750086.000000
      2.376725
      1.609811
      0.000000
      1.000000
      2.000000
      3.000000
      60.000000
    
    
      ner_q2_gpe
      2750086.000000
      0.167076
      0.449481
      0.000000
      0.000000
      0.000000
      0.000000
      9.000000
    
    
      ner_q2_loc
      2750086.000000
      0.013705
      0.122938
      0.000000
      0.000000
      0.000000
      0.000000
      4.000000
    
    
      ner_q2_org
      2750086.000000
      0.218546
      0.501982
      0.000000
      0.000000
      0.000000
      0.000000
      8.000000
    
    
      ner_q2_norp
      2750086.000000
      0.050230
      0.255864
      0.000000
      0.000000
      0.000000
      0.000000
      8.000000
    
    
      ner_q2_person
      2750086.000000
      0.109059
      0.354765
      0.000000
      0.000000
      0.000000
      0.000000
      7.000000
    
    
      ner_q2_product
      2750086.000000
      0.003225
      0.057834
      0.000000
      0.000000
      0.000000
      0.000000
      3.000000
    
    
      ner_q2_date
      2750086.000000
      0.049798
      0.240537
      0.000000
      0.000000
      0.000000
      0.000000
      11.000000
    
    
      ner_q2_time
      2750086.000000
      0.008598
      0.098287
      0.000000
      0.000000
      0.000000
      0.000000
      6.000000
    
    
      ner_q2_quantity
      2750086.000000
      0.008721
      0.099433
      0.000000
      0.000000
      0.000000
      0.000000
      6.000000
    
    
      ner_q2_cardinal
      2750086.000000
      0.222137
      0.749245
      0.000000
      0.000000
      0.000000
      0.000000
      30.000000
    
    
      pos_tag_cosine
      2749307.000000
      0.170882
      0.163781
      -0.000000
      0.053271
      0.119591
      0.237230
      1.000000
    
    
      pos_tag_euclidean
      2750086.000000
      3.109987
      2.110255
      0.000000
      1.732051
      2.645751
      4.000000
      81.030858
    
    
      ner_tag_euclidean
      2750086.000000
      0.754844
      0.957642
      0.000000
      0.000000
      0.000000
      1.000000
      28.017851
    
    
      ner_tag_count_diff
      2750086.000000
      0.641285
      0.998784
      0.000000
      0.000000
      0.000000
      1.000000
      31.000000



In [32]:

    
X_train = df_master[:len(tokens_train)].values
X_test = df_master[len(tokens_train):].values



In [33]:

    
print('X train:', X_train.shape)
print('X test: ', X_test.shape)









    



X train: (404290, 36)
X test:  (2345796, 36)

Save Features



In [34]:

    
feature_names = list(df_master.columns)



In [35]:

    
project.save_features(X_train, X_test, feature_names, feature_list_id)

	count	mean	std	min	25%	50%	75%	max
pos_q1_adj	2750086.000000	1.067322	1.083106	0.000000	0.000000	1.000000	2.000000	26.000000
pos_q1_adv	2750086.000000	0.727720	0.860922	0.000000	0.000000	1.000000	1.000000	18.000000
pos_q1_noun	2750086.000000	2.930388	1.832767	0.000000	2.000000	3.000000	4.000000	42.000000
pos_q1_propn	2750086.000000	0.868396	1.336260	0.000000	0.000000	0.000000	1.000000	41.000000
pos_q1_num	2750086.000000	0.451231	1.490798	0.000000	0.000000	0.000000	0.000000	83.000000
pos_q1_verb	2750086.000000	2.349555	1.552640	0.000000	1.000000	2.000000	3.000000	59.000000
ner_q1_gpe	2750086.000000	0.165916	0.446753	0.000000	0.000000	0.000000	0.000000	10.000000
ner_q1_loc	2750086.000000	0.013586	0.121908	0.000000	0.000000	0.000000	0.000000	4.000000
ner_q1_org	2750086.000000	0.219017	0.501661	0.000000	0.000000	0.000000	0.000000	8.000000
ner_q1_norp	2750086.000000	0.050499	0.256038	0.000000	0.000000	0.000000	0.000000	8.000000
ner_q1_person	2750086.000000	0.109720	0.356363	0.000000	0.000000	0.000000	0.000000	6.000000
ner_q1_product	2750086.000000	0.003194	0.057656	0.000000	0.000000	0.000000	0.000000	3.000000
ner_q1_date	2750086.000000	0.048328	0.236150	0.000000	0.000000	0.000000	0.000000	11.000000
ner_q1_time	2750086.000000	0.008575	0.097877	0.000000	0.000000	0.000000	0.000000	4.000000
ner_q1_quantity	2750086.000000	0.008518	0.098519	0.000000	0.000000	0.000000	0.000000	5.000000
ner_q1_cardinal	2750086.000000	0.220541	0.750349	0.000000	0.000000	0.000000	0.000000	29.000000
pos_q2_adj	2750086.000000	1.071430	1.093956	0.000000	0.000000	1.000000	2.000000	27.000000
pos_q2_adv	2750086.000000	0.732874	0.868020	0.000000	0.000000	1.000000	1.000000	18.000000
pos_q2_noun	2750086.000000	2.921520	1.852080	0.000000	2.000000	3.000000	4.000000	43.000000
pos_q2_propn	2750086.000000	0.867678	1.334859	0.000000	0.000000	0.000000	1.000000	40.000000
pos_q2_num	2750086.000000	0.456342	1.489910	0.000000	0.000000	0.000000	0.000000	83.000000
pos_q2_verb	2750086.000000	2.376725	1.609811	0.000000	1.000000	2.000000	3.000000	60.000000
ner_q2_gpe	2750086.000000	0.167076	0.449481	0.000000	0.000000	0.000000	0.000000	9.000000
ner_q2_loc	2750086.000000	0.013705	0.122938	0.000000	0.000000	0.000000	0.000000	4.000000
ner_q2_org	2750086.000000	0.218546	0.501982	0.000000	0.000000	0.000000	0.000000	8.000000
ner_q2_norp	2750086.000000	0.050230	0.255864	0.000000	0.000000	0.000000	0.000000	8.000000
ner_q2_person	2750086.000000	0.109059	0.354765	0.000000	0.000000	0.000000	0.000000	7.000000
ner_q2_product	2750086.000000	0.003225	0.057834	0.000000	0.000000	0.000000	0.000000	3.000000
ner_q2_date	2750086.000000	0.049798	0.240537	0.000000	0.000000	0.000000	0.000000	11.000000
ner_q2_time	2750086.000000	0.008598	0.098287	0.000000	0.000000	0.000000	0.000000	6.000000
ner_q2_quantity	2750086.000000	0.008721	0.099433	0.000000	0.000000	0.000000	0.000000	6.000000
ner_q2_cardinal	2750086.000000	0.222137	0.749245	0.000000	0.000000	0.000000	0.000000	30.000000
pos_tag_cosine	2749307.000000	0.170882	0.163781	-0.000000	0.053271	0.119591	0.237230	1.000000
pos_tag_euclidean	2750086.000000	3.109987	2.110255	0.000000	1.732051	2.645751	4.000000	81.030858
ner_tag_euclidean	2750086.000000	0.754844	0.957642	0.000000	0.000000	0.000000	1.000000	28.017851
ner_tag_count_diff	2750086.000000	0.641285	0.998784	0.000000	0.000000	0.000000	1.000000	31.000000