Feature: POS/NER Tag Similarity

Derive bag-of-POS-tag and bag-of-NER-tag vectors from each question and calculate their vector distances.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

In [2]:
import os
import warnings

In [3]:
from collections import Counter

In [4]:
from scipy.spatial.distance import cosine, euclidean, jaccard

In [5]:
import spacy

Config

Automatically discover the paths to various data folders and compose the project structure.


In [6]:
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.


In [7]:
feature_list_id = 'nlp_tags'

Read Data

Original question datasets.


In [8]:
df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('')
df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('')

Preprocessed and tokenized questions.

We should not use lowercased tokens here because that would harm the named entity recognition process.


In [9]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_spellcheck_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_spellcheck_test.pickle')

In [10]:
df_all_texts = pd.DataFrame(
    [[' '.join(pair[0]), ' '.join(pair[1])] for pair in tokens_train + tokens_test],
    columns=['question1', 'question2'],
)

Dependency parsing takes a lot of time and we don't use any features from it. Let's disable it in the pipeline.

If model loading fails, run python -m spacy download en


In [11]:
nlp = spacy.load('en', parser=False)

Build Features


In [12]:
pos_tags_whitelist = ['ADJ', 'ADV', 'NOUN', 'PROPN', 'NUM', 'VERB']
ner_tags_whitelist = ['GPE', 'LOC', 'ORG', 'NORP', 'PERSON', 'PRODUCT', 'DATE', 'TIME', 'QUANTITY', 'CARDINAL']

In [13]:
num_raw_features = len(pos_tags_whitelist) + len(ner_tags_whitelist)

In [14]:
X1 = np.zeros((len(df_all_texts), num_raw_features))
X2 = np.zeros((len(df_all_texts), num_raw_features))

In [15]:
X1.shape, X2.shape


Out[15]:
((2750086, 16), (2750086, 16))

Collect POS and NER tags


In [16]:
pipe_q1 = nlp.pipe(df_all_texts['question1'].values, n_threads=os.cpu_count())
pipe_q2 = nlp.pipe(df_all_texts['question2'].values, n_threads=os.cpu_count())

In [17]:
for i, doc in progressbar(enumerate(pipe_q1), total=len(df_all_texts)):
    pos_counter = Counter(token.pos_ for token in doc)
    ner_counter = Counter(ent.label_ for ent in doc.ents)
    X1[i, :] = np.array(
        [pos_counter[pos_tag] for pos_tag in pos_tags_whitelist] +
        [ner_counter[ner_tag] for ner_tag in ner_tags_whitelist]
    )


100%|██████████| 2750086/2750086 [05:21<00:00, 8558.72it/s] 

In [18]:
for i, doc in progressbar(enumerate(pipe_q2), total=len(df_all_texts)):
    pos_counter = Counter(token.pos_ for token in doc)
    ner_counter = Counter(ent.label_ for ent in doc.ents)
    X2[i, :] = np.array(
        [pos_counter[pos_tag] for pos_tag in pos_tags_whitelist] +
        [ner_counter[ner_tag] for ner_tag in ner_tags_whitelist]
    )


100%|██████████| 2750086/2750086 [05:25<00:00, 8460.16it/s] 

Create tag feature sets


In [19]:
df_pos_q1 = pd.DataFrame(
    X1[:, 0:len(pos_tags_whitelist)],
    columns=['pos_q1_' + pos_tag.lower() for pos_tag in pos_tags_whitelist]
)

In [20]:
df_pos_q2 = pd.DataFrame(
    X2[:, 0:len(pos_tags_whitelist)],
    columns=['pos_q2_' + pos_tag.lower() for pos_tag in pos_tags_whitelist]
)

In [21]:
df_ner_q1 = pd.DataFrame(
    X1[:, -len(ner_tags_whitelist):],
    columns=['ner_q1_' + ner_tag.lower() for ner_tag in ner_tags_whitelist]
)

In [22]:
df_ner_q2 = pd.DataFrame(
    X2[:, -len(ner_tags_whitelist):],
    columns=['ner_q2_' + ner_tag.lower() for ner_tag in ner_tags_whitelist]
)

Compute pairwise distances


In [24]:
def get_vector_distances(i):
    return [
        # POS distances.
        cosine(X1[i, 0:len(pos_tags_whitelist)], X2[i, 0:len(pos_tags_whitelist)]),
        euclidean(X1[i, 0:len(pos_tags_whitelist)], X2[i, 0:len(pos_tags_whitelist)]),

        # NER distances.
        euclidean(X1[i, -len(ner_tags_whitelist):], X2[i, -len(ner_tags_whitelist):]),
        np.abs(np.sum(X1[i, -len(ner_tags_whitelist):]) - np.sum(X2[i, -len(ner_tags_whitelist):])),
    ]

In [38]:
warnings.filterwarnings('ignore')
X_distances = kg.jobs.map_batch_parallel(
    list(range(len(df_all_texts))),
    item_mapper=get_vector_distances,
    batch_size=1000,
)


Batches: 100%|██████████| 2751/2751 [00:41<00:00, 66.33it/s]

In [26]:
X_distances = np.array(X_distances)

In [27]:
df_distances = pd.DataFrame(
    X_distances,
    columns=[
        'pos_tag_cosine',
        'pos_tag_euclidean',
        'ner_tag_euclidean',
        'ner_tag_count_diff',
    ]
)

Build master feature list


In [28]:
df_master = pd.concat(
    [df_pos_q1, df_ner_q1, df_pos_q2, df_ner_q2, df_distances],
    axis=1,
    ignore_index=True,
)

In [29]:
df_master.columns = list(df_pos_q1.columns) + \
    list(df_ner_q1.columns) + \
    list(df_pos_q2.columns) + \
    list(df_ner_q2.columns) + \
    list(df_distances.columns)

In [30]:
df_master.describe().T


Out[30]:
count mean std min 25% 50% 75% max
pos_q1_adj 2750086.000000 1.067322 1.083106 0.000000 0.000000 1.000000 2.000000 26.000000
pos_q1_adv 2750086.000000 0.727720 0.860922 0.000000 0.000000 1.000000 1.000000 18.000000
pos_q1_noun 2750086.000000 2.930388 1.832767 0.000000 2.000000 3.000000 4.000000 42.000000
pos_q1_propn 2750086.000000 0.868396 1.336260 0.000000 0.000000 0.000000 1.000000 41.000000
pos_q1_num 2750086.000000 0.451231 1.490798 0.000000 0.000000 0.000000 0.000000 83.000000
pos_q1_verb 2750086.000000 2.349555 1.552640 0.000000 1.000000 2.000000 3.000000 59.000000
ner_q1_gpe 2750086.000000 0.165916 0.446753 0.000000 0.000000 0.000000 0.000000 10.000000
ner_q1_loc 2750086.000000 0.013586 0.121908 0.000000 0.000000 0.000000 0.000000 4.000000
ner_q1_org 2750086.000000 0.219017 0.501661 0.000000 0.000000 0.000000 0.000000 8.000000
ner_q1_norp 2750086.000000 0.050499 0.256038 0.000000 0.000000 0.000000 0.000000 8.000000
ner_q1_person 2750086.000000 0.109720 0.356363 0.000000 0.000000 0.000000 0.000000 6.000000
ner_q1_product 2750086.000000 0.003194 0.057656 0.000000 0.000000 0.000000 0.000000 3.000000
ner_q1_date 2750086.000000 0.048328 0.236150 0.000000 0.000000 0.000000 0.000000 11.000000
ner_q1_time 2750086.000000 0.008575 0.097877 0.000000 0.000000 0.000000 0.000000 4.000000
ner_q1_quantity 2750086.000000 0.008518 0.098519 0.000000 0.000000 0.000000 0.000000 5.000000
ner_q1_cardinal 2750086.000000 0.220541 0.750349 0.000000 0.000000 0.000000 0.000000 29.000000
pos_q2_adj 2750086.000000 1.071430 1.093956 0.000000 0.000000 1.000000 2.000000 27.000000
pos_q2_adv 2750086.000000 0.732874 0.868020 0.000000 0.000000 1.000000 1.000000 18.000000
pos_q2_noun 2750086.000000 2.921520 1.852080 0.000000 2.000000 3.000000 4.000000 43.000000
pos_q2_propn 2750086.000000 0.867678 1.334859 0.000000 0.000000 0.000000 1.000000 40.000000
pos_q2_num 2750086.000000 0.456342 1.489910 0.000000 0.000000 0.000000 0.000000 83.000000
pos_q2_verb 2750086.000000 2.376725 1.609811 0.000000 1.000000 2.000000 3.000000 60.000000
ner_q2_gpe 2750086.000000 0.167076 0.449481 0.000000 0.000000 0.000000 0.000000 9.000000
ner_q2_loc 2750086.000000 0.013705 0.122938 0.000000 0.000000 0.000000 0.000000 4.000000
ner_q2_org 2750086.000000 0.218546 0.501982 0.000000 0.000000 0.000000 0.000000 8.000000
ner_q2_norp 2750086.000000 0.050230 0.255864 0.000000 0.000000 0.000000 0.000000 8.000000
ner_q2_person 2750086.000000 0.109059 0.354765 0.000000 0.000000 0.000000 0.000000 7.000000
ner_q2_product 2750086.000000 0.003225 0.057834 0.000000 0.000000 0.000000 0.000000 3.000000
ner_q2_date 2750086.000000 0.049798 0.240537 0.000000 0.000000 0.000000 0.000000 11.000000
ner_q2_time 2750086.000000 0.008598 0.098287 0.000000 0.000000 0.000000 0.000000 6.000000
ner_q2_quantity 2750086.000000 0.008721 0.099433 0.000000 0.000000 0.000000 0.000000 6.000000
ner_q2_cardinal 2750086.000000 0.222137 0.749245 0.000000 0.000000 0.000000 0.000000 30.000000
pos_tag_cosine 2749307.000000 0.170882 0.163781 -0.000000 0.053271 0.119591 0.237230 1.000000
pos_tag_euclidean 2750086.000000 3.109987 2.110255 0.000000 1.732051 2.645751 4.000000 81.030858
ner_tag_euclidean 2750086.000000 0.754844 0.957642 0.000000 0.000000 0.000000 1.000000 28.017851
ner_tag_count_diff 2750086.000000 0.641285 0.998784 0.000000 0.000000 0.000000 1.000000 31.000000

In [32]:
X_train = df_master[:len(tokens_train)].values
X_test = df_master[len(tokens_train):].values

In [33]:
print('X train:', X_train.shape)
print('X test: ', X_test.shape)


X train: (404290, 36)
X test:  (2345796, 36)

Save Features


In [34]:
feature_names = list(df_master.columns)

In [35]:
project.save_features(X_train, X_test, feature_names, feature_list_id)