(Based on the kernel Unusual meaning map: Treating question pairs as image / surface by Puneeth Singh Ludu)
Other people have already written really nice exploratory kernels which helped me to write the minimal code myself.
In this kernel, I have tried to extract a different type of feature from which we can learn using any algorithm which can learn via image. The basic assumption behind this exercise is to capture non-sequential closeness between words.
For example:
A Question pair has pointing arrows from each of the words of one sentence to each of the words from another sentence
To capture this we can create NxM matrix with Word2Vec distance between each word with other. and resize the matrix just like an image to a 10x10 matrix and use this as a feature to xgboost.
This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.
In [1]:
from pygoose import *
In [2]:
import math
import re
In [3]:
import nltk
import gensim
In [4]:
from gensim import corpora, models, similarities
Automatically discover the paths to various data folders and compose the project structure.
In [5]:
project = kg.Project.discover()
Identifier for storing these features on disk and referring to them later.
In [6]:
feature_list_id = '3rdparty_image_similarity'
Original question sets.
In [7]:
df = pd.concat([
pd.read_csv(project.data_dir + 'train.csv'),
pd.read_csv(project.data_dir + 'test.csv'),
])
Unique document corpus.
In [8]:
sentences = kg.io.load(project.preprocessed_data_dir + 'unique_questions_tokenized.pickle')
Creating a simple Word2Vec model from the question pair, we can use a pre-trained model instead to get better results
In [9]:
model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
A very simple term frequency and document frequency extractor
In [10]:
tf = dict()
docf = dict()
total_docs = 0
for sentence in sentences:
total_docs += 1
uniq_toks = set(sentence)
for i in sentence:
if i not in tf:
tf[i] = 1
else:
tf[i] += 1
for i in uniq_toks:
if i not in docf:
docf[i] = 1
else:
docf[i] += 1
Mimic the IDF function but penalize the words which have fairly high score otherwise, and give a strong boost to the words which appear sporadically.
In [11]:
def idf(word):
return 1 - math.sqrt(docf[word] / total_docs)
In [12]:
print(idf("kenya"))
A simple cleaning module for feature extraction
In [13]:
def basic_cleaning(string):
string = str(string)
string = string.lower()
string = re.sub('[0-9\(\)\!\^\%\$\'\"\.;,-\?\{\}\[\]\\/]', ' ', string)
string = ' '.join([i for i in string.split() if i not in ["a", "and", "of", "the", "to", "on", "in", "at", "is"]])
string = re.sub(' +', ' ', string)
return string
In [14]:
def w2v_sim(w1, w2):
try:
return model.similarity(w1, w2) * idf(w1) * idf(w2)
except Exception:
return 0.0
In [15]:
def img_feature(row):
s1 = row['question1']
s2 = row['question2']
t1 = list((basic_cleaning(s1)).split())
t2 = list((basic_cleaning(s2)).split())
Z = [[w2v_sim(x, y) for x in t1] for y in t2]
a = np.array(Z, order='C')
return [np.resize(a,(10,10)).flatten()]
In [16]:
s = df
img = s.apply(img_feature, axis=1, raw=True)
pix_col = [[] for y in range(100)]
for k in img.iteritems():
for f in range(len(list(k[1][0]))):
pix_col[f].append(k[1][0][f])
Extracting Features
In [17]:
df_X = pd.DataFrame()
for g in range(len(pix_col)):
df_X[f'img{g:03d}'] = pix_col[g]
In [18]:
X_train = np.sum(df_X[:404290].values, axis=1).reshape(-1, 1)
X_test = np.sum(df_X[404290:].values, axis=1).reshape(-1, 1)
In [19]:
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)
In [20]:
feature_names = [
'image_similarity'
]
In [21]:
project.save_features(X_train, X_test, feature_names, feature_list_id)