In this assignment you will learn how to calculate a similarity for pieces of text. Using this approach you will know how to find duplicate questions from StackOverflow.
In this task you will you will need the following libraries:
In [ ]:
import sys
sys.path.append("..")
from common.download_utils import download_week3_resources
download_week3_resources()
We will create a grader instace below and use it to collect your answers. Note that these outputs will be stored locally inside grader and will be uploaded to platform only after running submiting function in the last part of this assignment. If you want to make partial submission, you can run that cell any time you want.
In [ ]:
from grader import Grader
In [ ]:
grader = Grader()
To solve the problem, you will use two different models of embeddings:
It's always easier to start with pre-trained embeddings. Unpack the pre-trained Goggle's vectors and upload them using the function KeyedVectors.load_word2vec_format from gensim library with the parameter binary=True. If the size of the embeddings is larger than the avaliable memory, you could load only a part of the embeddings by defining the parameter limit (recommended: 500000).
In [ ]:
import gensim
In [ ]:
wv_embeddings = ######### YOUR CODE HERE #############
Once you have loaded the representations, make sure you can access them. First, you can check if the loaded embeddings contain a word:
'word' in wv_embeddings
Second, to get the corresponding embedding you can use the square brackets:
wv_embeddings['word']
To prevent any errors during the first stage, we can check that the loaded embeddings are correct. You can call the function check_embeddings, implemented below, which runs 3 tests:
In the right case the function will return the string These embeddings look good. Othervise, you need to validate the previous steps.
In [ ]:
def check_embeddings(embeddings):
error_text = "Something wrong with your embeddings ('%s test isn't correct)."
most_similar = embeddings.most_similar(positive=['woman', 'king'], negative=['man'])
if len(most_similar) < 1 or most_similar[0][0] != 'queen':
return error_text % "Most similar"
doesnt_match = embeddings.doesnt_match(['breakfast', 'cereal', 'dinner', 'lunch'])
if doesnt_match != 'cereal':
return error_text % "Doesn't match"
most_similar_to_given = embeddings.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])
if most_similar_to_given != 'sound':
return error_text % "Most similar to given"
return "These embeddings look good."
In [ ]:
print(check_embeddings(wv_embeddings))
Task 1 (Question2Vec). Usually, we have word-based embeddings, but for the task we need to create a representation for the whole question. It could be done in different ways. In our case we will use a mean of all word vectors in the question. Now you need to implement the function question_to_vec, which calculates the question representation described above.
Note that there could be words without the corresponding embeddings. In this case, you can just skip these words and don't take them into account during calculating the result. If the question doesn't contain any known word with embedding, the function should return a zero vector.
In [ ]:
import numpy as np
In [ ]:
def question_to_vec(question, embeddings, dim=300):
"""
question: a string
embeddings: dict where the key is a word and a value is its' embedding
dim: size of the representation
result: vector representation for the question
"""
######################################
######### YOUR CODE HERE #############
######################################
To check the basic correctness of your implementation, run the function question_to_vec_tests.
In [ ]:
def question_to_vec_tests():
if (np.zeros(300) != question_to_vec('', wv_embeddings)).any():
return "You need to return zero vector for empty question."
if (np.zeros(300) != question_to_vec('thereisnosuchword', wv_embeddings)).any():
return "You need to return zero vector for the question, which consists only unknown words."
if (wv_embeddings['word'] != question_to_vec('word', wv_embeddings)).any():
return "You need to check the corectness of your function."
if ((wv_embeddings['I'] + wv_embeddings['am']) / 2 != question_to_vec('I am', wv_embeddings)).any():
return "Your function should calculate a mean of word vectors."
return "Basic tests are passed."
In [ ]:
print(question_to_vec_tests())
You can submit embeddings for the questions from file test_embeddings.tsv to earn the points. In this task you don't need to transform the text of a question somehow.
In [ ]:
from util import array_to_string
In [ ]:
question2vec_result = []
for question in open('data/test_embeddings.tsv'):
question = question.strip()
answer = question_to_vec(question, wv_embeddings)
question2vec_result = np.append(question2vec_result, answer)
grader.submit_tag('Question2Vec', array_to_string(question2vec_result))
Now we have a method to create a representation of any sentence and we are ready for the first evaluation. So, let's check how well our solution (Google's vectors + question_to_vec) will work.
We can imagine that if we use good embeddings, the cosine similarity between the duplicate sentences should be less than for the random ones. Overall, for each pair of duplicate sentences we can generate R random negative examples and find out the position of the correct duplicate.
For example, we have the question "Exceptions What really happens" and we are sure that another question "How does the catch keyword determine the type of exception that was thrown" is a duplicate. But our model doesn't know it and tries to find out the best option also among questions like "How Can I Make These Links Rotate in PHP", "NSLog array description not memory address" and "PECL_HTTP not recognised php ubuntu". The goal of the model is to rank all these 4 questions (1 positive and R = 3 negative) in the way that the correct one is in the first place.
However, it is unnatural to count on that the best candidate will be always in the first place. So let us consider the place of the best candidate in the sorted list of candidates and formulate a metric based on it. We can fix some K — a reasonalble number of top-ranked elements and N — a number of queries (size of the sample).
The first simple metric will be a number of correct hits for some K: $$ \text{Hits@K} = \frac{1}{N}\sum_{i=1}^N \, [dup_i \in topK(q_i)]$$
where $q_i$ is the i-th query, $dup_i$ is its duplicate, and topK($q_i$) is the top of the ranked sentences provided by our model.
The second one is a simplified DCG metric:
$$ DCG = \frac{1}{N} \sum_{i=1}^N\frac{1}{\log_2(1+rank_{dup_i})}\cdot[rank_{dup_i} \le K] $$where $rank_{dup_i}$ is a position of the duplicate in the sorted list of the nearest sentences for the query $q_i$. According to this metric, the model gets a higher reward for a higher position of the correct answer. If the answer does not appear in topK at all, the reward is zero.
Let's calculate the described metrics for the toy example introduced above. Consider the following ranking of the candidates:
Using the ranking above, calculate Hits@K metric for K = 1, 2, 4:
Using the ranking above, calculate DCG@K metric for K = 1, 2, 4:
Tasks 2 and 3 (HitsCount and DCGScore). Implement the functions hits_count and dcg_score as described above.
In [ ]:
def hits_count(best_ranks, k):
"""
best_ranks: list with ranks for each element (the best rank is 1, the worst — len(best_ranks))
k: number of top-ranked elements
result: float number
"""
######################################
######### YOUR CODE HERE #############
######################################
Test your code on the tiny examples:
In [ ]:
def test_hits():
answers = ['woman', 'man']
candidates_ranking = [['woman', 'queen'], ['man', 'king']]
best_ranks = [1, 1]
correct_answers = [1, 1]
for k in range(1, 3):
if not np.isclose(hits_count(best_ranks, k), correct_answers[k - 1]):
return "Check the function."
candidates_ranking = [['woman', 'queen'], ['king', 'man']]
best_ranks = [1, 2]
correct_answers = [0.5, 1]
for k in range(1, 3):
if not np.isclose(hits_count(best_ranks, k), correct_answers[k - 1]):
return "Check the function."
return "Basic tests are passed."
In [ ]:
print(test_hits())
In [ ]:
def dcg_score(best_ranks, k):
"""
best_ranks: list with ranks for each element (the best rank is 1, the worst — len(best_ranks))
k: number of top-ranked elements
result: float number
"""
######################################
######### YOUR CODE HERE #############
######################################
In [ ]:
def test_dcg():
answers = ['woman', 'man']
candidates_ranking = [['woman', 'queen'], ['man', 'king']]
best_ranks = [1, 1]
correct_answers = [1.0, 1.0]
for k in range(1, 3):
if not np.isclose(dcg_score(best_ranks, k), correct_answers[k - 1]):
return "Check the function."
candidates_ranking = [['woman', 'queen'], ['king', 'man']]
best_ranks = [1, 2]
correct_answers = [0.5, 0.8154]
for k in range(1, 3):
if not np.isclose(dcg_score(best_ranks, k), correct_answers[k - 1], atol=1e-03):
return "Check the function."
return "Basic tests are passed."
In [ ]:
print(test_dcg())
Submit results of the functions hits_count and dcg_score for the following examples to earn the points.
In [ ]:
test_examples = [
[1],
[1, 2],
[2, 1],
[1, 2, 3],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[9, 5, 4, 2, 8, 10, 7, 6, 1, 3],
[4, 3, 5, 1, 9, 10, 7, 8, 2, 6],
[5, 1, 7, 6, 2, 3, 8, 9, 10, 4],
[6, 3, 1, 4, 7, 2, 9, 8, 10, 5],
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
]
In [ ]:
hits_results = []
for example in test_examples:
for k in range(len(example)):
hits_results.append(hits_count(example, k + 1))
grader.submit_tag('HitsCount', array_to_string(hits_results))
In [ ]:
dcg_results = []
for example in test_examples:
for k in range(len(example)):
dcg_results.append(dcg_score(example, k + 1))
grader.submit_tag('DCGScore', array_to_string(dcg_results))
We will work with predefined train, validation and test corpora. All the files are tab-separated, but have a different format:
Validation corpus will be used for the intermediate validation of models. The test data will be necessary for submitting the quality of your model in the system.
Now you should upload validation corpus to evaluate current solution.
In [ ]:
def read_corpus(filename):
data = []
for line in open(filename, encoding='utf-8'):
data.append(line.strip().split('\t'))
return data
In [ ]:
validation = ######### YOUR CODE HERE #############
In [ ]:
from sklearn.metrics.pairwise import cosine_similarity
We will use cosine distance to rank candidate questions which you need to implement in the function rank_questions. The function should return a sorted list of pairs (initial position in candidates list, question). Index of some pair corresponds to its rank (the first is the best). For example, if the list of candidates was [a, b, c] and the most similar is c, then a and b, the function should return a list [(2, c), (0, a), (1, b)].
Pay attention, if you use the function cosine_similarity from sklearn.metrics.pairwise to calculate similarity because it works in a different way: most similar objects has greatest similarity.
In [ ]:
def rank_candidates(question, candidates, embeddings, dim=300):
"""
question: a string
candidates: a list of strings (candidates) which we want to rank
embeddings: some embeddings
dim: dimension of the current embeddings
result: a list of pairs (initial position in the list, question)
"""
######################################
######### YOUR CODE HERE #############
######################################
Test your code on the tiny examples:
In [ ]:
def test_rank_candidates():
questions = ['converting string to list', 'Sending array via Ajax fails']
candidates = [['Convert Google results object (pure js) to Python object',
'C# create cookie from string and send it',
'How to use jQuery AJAX for an outside domain?'],
['Getting all list items of an unordered list in PHP',
'WPF- How to update the changes in list item of a list',
'select2 not displaying search results']]
results = [[(1, 'C# create cookie from string and send it'),
(0, 'Convert Google results object (pure js) to Python object'),
(2, 'How to use jQuery AJAX for an outside domain?')],
[(0, 'Getting all list items of an unordered list in PHP'),
(2, 'select2 not displaying search results'),
(1, 'WPF- How to update the changes in list item of a list')]]
for question, q_candidates, result in zip(questions, candidates, results):
ranks = rank_candidates(question, q_candidates, wv_embeddings, 300)
if not np.all(ranks == result):
return "Check the function."
return "Basic tests are passed."
In [ ]:
print(test_rank_candidates())
Now we can test the quality of the current approach. Run the next two cells to get the results. Pay attention that calculation of similarity between vectors takes time and this calculation is computed approximately in 10 minutes.
In [ ]:
wv_ranking = []
for line in validation:
q, *ex = line
ranks = rank_candidates(q, ex, wv_embeddings)
wv_ranking.append([r[0] for r in ranks].index(0) + 1)
In [ ]:
for k in [1, 5, 10, 100, 500, 1000]:
print("DCG@%4d: %.3f | Hits@%4d: %.3f" % (k, dcg_score(wv_ranking, k), k, hits_count(wv_ranking, k)))
If you did all the steps correctly, you should be frustrated by the received results. Let's try to understand why the quality is so low. First of all, when you work with some data it is necessary to have an idea how the data looks like. Print several questions from the data:
In [ ]:
for line in validation[:3]:
q, *examples = line
print(q, *examples[:3])
As you can see, we deal with the raw data. It means that we have many punctuation marks, special characters and unlowercased letters. In our case, it could lead to the situation where we can't find some embeddings, e.g. for the word "grid?".
To solve this problem you should use the functions text_prepare from the previous assignments to prepare the data.
In [ ]:
from util import text_prepare
Now transform all the questions from the validation set:
In [ ]:
prepared_validation = []
for line in validation:
######### YOUR CODE HERE #############
Let's evaluate the approach again after the preparation:
In [ ]:
wv_prepared_ranking = []
for line in prepared_validation:
q, *ex = line
ranks = rank_candidates(q, ex, wv_embeddings)
wv_prepared_ranking.append([r[0] for r in ranks].index(0) + 1)
In [ ]:
for k in [1, 5, 10, 100, 500, 1000]:
print("DCG@%4d: %.3f | Hits@%4d: %.3f" % (k, dcg_score(wv_prepared_ranking, k),
k, hits_count(wv_prepared_ranking, k)))
Now, prepare also train and test data, because you will need it in the future:
In [ ]:
def prepare_file(in_, out_):
out = open(out_, 'w')
for line in open(in_, encoding='utf8'):
line = line.strip().split('\t')
new_line = [text_prepare(q) for q in line]
print(*new_line, sep='\t', file=out)
out.close()
In [ ]:
######################################
######### YOUR CODE HERE #############
######################################
Task 4 (W2VTokenizedRanks). For each question from prepared test.tsv submit the ranks of the candidates to earn the points. It should take about 3-5 minutes. Pay attention that the function rank_candidates returns a ranking, while in this case you should find a position in this ranking. Ranks should start with 1.
In [ ]:
from util import matrix_to_string
In [ ]:
w2v_ranks_results = []
prepared_test_data = ######### YOUR CODE HERE #############
for line in open(prepared_test_data):
q, *ex = line.strip().split('\t')
ranks = rank_candidates(q, ex, wv_embeddings, 300)
ranked_candidates = [r[0] for r in ranks]
w2v_ranks_results.append([ranked_candidates.index(i) + 1 for i in range(len(ranked_candidates))])
grader.submit_tag('W2VTokenizedRanks', matrix_to_string(w2v_ranks_results))
Now you are ready to train your own word embeddings! In particular, you need to train embeddings specially for our task of duplicates detection. Unfortunately, StarSpace could not be run on Windows and we recommend to use provided docker container or other alternatives.
The main point in this section is that StarSpace can be trained specifically for some tasks. In contrast to word2vec model, which tries to train similar embeddings for words in similar contexts, StarSpace uses embeddings for the whole sentence (just as a sum of embeddings of words and phrases). Despite the fact that in both cases we get word embeddings as a result of the training, StarSpace embeddings are trained using some supervised data, e.g. a set of similar sentence pairs, and thus they can better suit the task.
In our case, StarSpace should use two types of sentence pairs for training: "positive" and "negative". "Positive" examples are extracted from the train sample (duplicates, high similarity) and the "negative" examples are generated randomly (low similarity assumed).
Normally, you would start with some default choice and then run extensive experiments to compare different strategies. However, we have some recommendations ready for you to save your time:
Train StarSpace embeddings for unigrams on the train dataset. You don't need to change the format of the input data. Just don't forget to use prepared version of the training data.
If you follow the instruction, the training process will take about 1 hour.
In [ ]:
######### TRAINING HAPPENING HERE #############
And now we can compare the new embeddings with the previous ones. You can find trained word vectors in the file [model_file_name].tsv. Upload the embeddings from StarSpace into a dict.
In [ ]:
starspace_embeddings = ######### YOUR CODE HERE #############
In [ ]:
ss_prepared_ranking = []
for line in prepared_validation:
q, *ex = line
ranks = rank_candidates(q, ex, starspace_embeddings, 100)
ss_prepared_ranking.append([r[0] for r in ranks].index(0) + 1)
In [ ]:
for k in [1, 5, 10, 100, 500, 1000]:
print("DCG@%4d: %.3f | Hits@%4d: %.3f" % (k, dcg_score(ss_prepared_ranking, k),
k, hits_count(ss_prepared_ranking, k)))
Due to training for the particular task with the supervised data, you should expect to obtain a higher quality than for the previous approach. In additiion, despite the fact that StarSpace's trained vectors have a smaller dimension than word2vec's, it provides better results in this task.
Task 5 (StarSpaceRanks). For each question from prepared test.tsv submit the ranks of the candidates for trained representation.
In [ ]:
starspace_ranks_results = []
prepared_test_data = ######### YOUR CODE HERE #############
for line in open(prepared_test_data):
q, *ex = line.strip().split('\t')
ranks = rank_candidates(q, ex, starspace_embeddings, 100)
ranked_candidates = [r[0] for r in ranks]
starspace_ranks_results.append([ranked_candidates.index(i) + 1 for i in range(len(ranked_candidates))])
grader.submit_tag('StarSpaceRanks', matrix_to_string(starspace_ranks_results))
In [ ]:
STUDENT_EMAIL = # EMAIL
STUDENT_TOKEN = # TOKEN
grader.status()
If you want to submit these answers, run cell below
In [ ]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)