(Based on the kernel Unusual meaning map: Treating question pairs as image / surface by Puneeth Singh Ludu)

Unusual meaning map: Treating question pairs as image / surface

Other people have already written really nice exploratory kernels which helped me to write the minimal code myself.

In this kernel, I have tried to extract a different type of feature from which we can learn using any algorithm which can learn via image. The basic assumption behind this exercise is to capture non-sequential closeness between words.

For example: A Question pair has pointing arrows from each of the words of one sentence to each of the words from another sentence

To capture this we can create NxM matrix with Word2Vec distance between each word with other. and resize the matrix just like an image to a 10x10 matrix and use this as a feature to xgboost.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.



In [1]:

    
from pygoose import *



In [2]:

    
import math
import re



In [3]:

    
import nltk
import gensim



In [4]:

    
from gensim import corpora, models, similarities

Config

Automatically discover the paths to various data folders and compose the project structure.



In [5]:

    
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.



In [6]:

    
feature_list_id = '3rdparty_image_similarity'

Read data

Original question sets.



In [7]:

    
df = pd.concat([
    pd.read_csv(project.data_dir + 'train.csv'),
    pd.read_csv(project.data_dir + 'test.csv'),
])

Unique document corpus.



In [8]:

    
sentences = kg.io.load(project.preprocessed_data_dir + 'unique_questions_tokenized.pickle')

Build features

Creating a simple Word2Vec model from the question pair, we can use a pre-trained model instead to get better results



In [9]:

    
model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

A very simple term frequency and document frequency extractor



In [10]:

    
tf = dict()
docf = dict()
total_docs = 0

for sentence in sentences:
    total_docs += 1
    uniq_toks = set(sentence)
    
    for i in sentence:
        if i not in tf:
            tf[i] = 1
        else:
            tf[i] += 1
            
    for i in uniq_toks:
        if i not in docf:
            docf[i] = 1
        else:
            docf[i] += 1

Mimic the IDF function but penalize the words which have fairly high score otherwise, and give a strong boost to the words which appear sporadically.



In [11]:

    
def idf(word):
    return 1 - math.sqrt(docf[word] / total_docs)



In [12]:

    
print(idf("kenya"))









    



0.9885943470222648

A simple cleaning module for feature extraction



In [13]:

    
def basic_cleaning(string):
    string = str(string)
    string = string.lower()
    string = re.sub('[0-9\(\)\!\^\%\$\'\"\.;,-\?\{\}\[\]\\/]', ' ', string)
    string = ' '.join([i for i in string.split() if i not in ["a", "and", "of", "the", "to", "on", "in", "at", "is"]])
    string = re.sub(' +', ' ', string)
    return string



In [14]:

    
def w2v_sim(w1, w2):
    try:
        return model.similarity(w1, w2) * idf(w1) * idf(w2)
    except Exception:
        return 0.0



In [15]:

    
def img_feature(row):
    s1 = row['question1']
    s2 = row['question2']
    t1 = list((basic_cleaning(s1)).split())
    t2 = list((basic_cleaning(s2)).split())
    Z = [[w2v_sim(x, y) for x in t1] for y in t2] 
    a = np.array(Z, order='C')
    return [np.resize(a,(10,10)).flatten()]



In [16]:

    
s = df
img = s.apply(img_feature, axis=1, raw=True)
pix_col = [[] for y in range(100)] 

for k in img.iteritems():
    for f in range(len(list(k[1][0]))):
        pix_col[f].append(k[1][0][f])

Extracting Features



In [17]:

    
df_X = pd.DataFrame()
for g in range(len(pix_col)):
    df_X[f'img{g:03d}'] = pix_col[g]



In [18]:

    
X_train = np.sum(df_X[:404290].values, axis=1).reshape(-1, 1)
X_test = np.sum(df_X[404290:].values, axis=1).reshape(-1, 1)



In [19]:

    
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)









    



X_train: (404290, 1)
X_test:  (2345796, 1)

Save features



In [20]:

    
feature_names = [
    'image_similarity'
]



In [21]:

    
project.save_features(X_train, X_test, feature_names, feature_list_id)