Load the dataset

The text 8 dataset for training the word_2_vector model is taken from (http://mattmahoney.net/dc/), if the zip file is missing it is downloaded again. Roughly 200k unique words in the sample of senteces with roughly 17M words. Download can take a while. Training is then done in run_Word2Vec module. Load the interface to the model as well



In [1]:

    
import Load_Text_Set as l_data
import run_Word2Vec as w2v









    



File is missing, retrieving  http://mattmahoney.net/dc/text8.zip
Found and verified text8.zip 
Words in file  17005207

Train the model

Run the training and return the final normalized embeddings. The model can be trained useing skipgram or continuos bag of string CBOW, controlled by a switch in the file _tfW0rd2Vec.py. Othe paramaters of the model are also hard coded in here.



In [10]:

    
words = l_data.text_8(200000)
embeddings = w2v.run_embeddings()









    



Found and verified text8.zip 
Words in file  17005207
Initialized embeddings
(200000, 128)
Initialized
Average loss at step 0: 9.255837
Nearest to by: chickpea, device, filming, compound, homebase, stergiopoulos, jahrb, durante,
Nearest to people: alexandrov, cherno, slipper, phaenomenologica, getaway, kiskil, lushan, ajiva,
Nearest to four: karachaganak, kirsan, utility, ngen, mucel, arguers, stilpo, kirundo,
Nearest to can: tylos, berdan, gnd, trollish, aumeier, brilliants, specte, lorenzattractor,
Nearest to d: pk, enumerate, surr, valkyries, succoth, vakatakas, onesicritus, mirv,
Nearest to been: burgundes, mudhol, chene, schemer, fiurenzu, lavey, tarcher, pastries,
Nearest to so: hydrocactus, perata, mora, cursing, dkos, lookalikes, diaphragmic, langland,
Nearest to while: halite, butkus, zapp, chdir, mascott, inbox, toolchain, grunt,
Nearest to seven: sprengel, leavelle, pertain, improvisers, sportbund, friable, ionization, backlighting,
Nearest to were: skvirsky, cushioned, lantern, computationally, seychellois, voluntaryism, oleo, yuba,
Nearest to UNK: father, webspace, leva, cased, atropat, caldara, ussy, ipsilateral,
Nearest to system: socoh, vardy, caimans, phumi, mikel, oilgate, biogenic, prescence,
Nearest to have: gelling, andechs, oliwa, kielten, hits, pepelu, vaishali, homebush,
Nearest to see: jestem, earnest, edonkey, negreiros, stiggs, roundway, muqtana, vulcain,
Nearest to after: gonzo, snorra, reservoirs, udalski, hatounian, hasidim, kocher, gedara,
Nearest to five: ciliophora, lateline, bonaj, square, circumvent, elapsed, harrigan, patission,
Average loss at step 2000: 4.520401
Average loss at step 4000: 3.777777
Average loss at step 6000: 3.549556
Average loss at step 8000: 3.378672
Average loss at step 10000: 3.282246
Nearest to by: dragut, with, cooeeing, for, caveats, from, quadrivium, adn,
Nearest to people: years, kiskil, crouched, furies, disobey, cordoning, getaway, ideas,
Nearest to four: six, eight, seven, five, three, nine, zero, two,
Nearest to can: may, would, could, will, should, must, to, makrani,
Nearest to d: b, anuran, waive, biometrical, pelee, rula, biopreparat, aerarium,
Nearest to been: schemer, be, stejneger, developed, sepia, lavey, info, wyche,
Nearest to so: hydrocactus, dkos, ljungby, mora, langland, hortons, cleomedes, hrithik,
Nearest to while: jasta, thielmann, mascott, for, judiciously, baffin, yevsektsiya, ariadne,
Nearest to seven: eight, six, nine, four, five, three, zero, two,
Nearest to were: are, was, have, had, brinsley, ditko, astyanax, tenchian,
Nearest to UNK: webspace, leva, cased, atropat, caldara, ussy, ipsilateral, chey,
Nearest to system: titanomachy, vardy, ogopogo, caimans, mikel, hlfic, dummar, radiological,
Nearest to have: had, has, were, be, are, creature, vaishali, blacklight,
Nearest to see: aos, nanotubes, sedates, premade, zezuru, heer, kindergartens, palmiest,
Nearest to after: brustein, wiwaxia, from, typographies, snorra, when, hodierna, rossman,
Nearest to five: four, six, three, seven, eight, nine, zero, two,
Average loss at step 12000: 3.286572
Average loss at step 14000: 3.238507
Average loss at step 16000: 3.256216
Average loss at step 18000: 3.187976
Average loss at step 20000: 3.039966
Nearest to by: dragut, hpcc, farago, was, hundertwasserhaus, durey, chickpea, spectroscopic,
Nearest to people: ideas, furies, crouched, rideau, caradon, correcting, getaway, grignard,
Nearest to four: six, three, eight, five, seven, two, nine, zero,
Nearest to can: may, would, will, could, must, should, might, cannot,
Nearest to d: b, anuran, waive, pelee, terni, neurodevelopmental, peterzano, semantically,
Nearest to been: schemer, be, become, was, polytopes, were, bristow, stejneger,
Nearest to so: mla, libby, mora, ljungby, datadisk, lifecycle, fox, ishq,
Nearest to while: although, though, however, and, but, casamayor, including, ariadne,
Nearest to seven: eight, six, nine, five, three, four, two, zero,
Nearest to were: are, have, was, sarcocheilichthys, pamper, tenchian, ditko, being,
Nearest to UNK: webspace, leva, cased, atropat, caldara, ussy, ipsilateral, chey,
Nearest to system: systems, ngm, ely, vardy, strolled, suttee, stratocaster, aramean,
Nearest to have: has, had, were, are, be, creature, include, fce,
Nearest to see: sedates, belemnite, heer, comunali, but, kindergartens, praesidentiae, palmiest,
Nearest to after: before, when, wiwaxia, snorra, brustein, until, usurer, abdank,
Nearest to five: six, eight, seven, three, four, nine, two, zero,
Average loss at step 22000: 3.114988
Average loss at step 24000: 3.071707
Average loss at step 26000: 3.043790
Average loss at step 28000: 3.051388
Average loss at step 30000: 3.025266
Nearest to by: durey, exitu, dragut, verein, through, sarit, zmrzl, sagal,
Nearest to people: men, children, alcindor, ideas, crouched, codeworks, macheth, furies,
Nearest to four: five, six, eight, seven, nine, three, two, zero,
Nearest to can: may, could, would, will, must, should, might, cannot,
Nearest to d: b, pelee, anuran, waive, neurodevelopmental, ertzaintza, c, yachtsmen,
Nearest to been: be, become, schemer, polytopes, were, was, homse, already,
Nearest to so: mla, gerund, antoinette, datadisk, less, filmmakers, veel, hinting,
Nearest to while: when, although, though, after, however, before, kefauver, coudreau,
Nearest to seven: eight, six, nine, four, five, three, zero, two,
Nearest to were: are, was, have, had, been, astyanax, narain, informative,
Nearest to UNK: webspace, leva, cased, atropat, caldara, ussy, ipsilateral, chey,
Nearest to system: systems, situation, ely, titanomachy, stratocaster, strolled, iaw, coca,
Nearest to have: had, has, are, were, having, include, webexhibits, iditarod,
Nearest to see: but, references, kindergartens, burle, bouzouki, liberalize, sedates, called,
Nearest to after: before, when, during, until, while, saw, wiwaxia, within,
Nearest to five: four, six, three, eight, seven, nine, zero, two,
Average loss at step 32000: 2.846431
Average loss at step 34000: 2.979519
Average loss at step 36000: 2.965738
Average loss at step 38000: 2.957362
Average loss at step 40000: 2.976418
Nearest to by: when, songfic, through, farago, durey, from, using, in,
Nearest to people: men, children, women, alcindor, crouched, tft, systemizes, fasti,
Nearest to four: five, eight, seven, three, six, nine, two, zero,
Nearest to can: may, could, will, must, would, should, cannot, might,
Nearest to d: b, pelee, wikifiction, anuran, waive, fraction, ertzaintza, penzias,
Nearest to been: become, be, schemer, polytopes, was, maglev, already, were,
Nearest to so: gerund, mla, antoinette, filmmakers, mossadegh, if, sulfuric, veel,
Nearest to while: although, though, however, and, when, but, after, or,
Nearest to seven: eight, six, five, nine, four, three, zero, two,
Nearest to were: are, have, was, had, pamper, include, although, been,
Nearest to UNK: webspace, leva, cased, atropat, caldara, ussy, ipsilateral, chey,
Nearest to system: systems, device, strolled, situation, bouloustra, appell, axonal, process,
Nearest to have: had, has, were, include, are, having, be, webexhibits,
Nearest to see: references, but, include, sedates, posavac, burle, heer, noematic,
Nearest to after: before, when, during, while, despite, recognitio, through, if,
Nearest to five: four, six, seven, eight, three, nine, two, zero,
Average loss at step 42000: 2.965305
Average loss at step 44000: 2.986973
Average loss at step 46000: 2.911373
Average loss at step 48000: 2.869899
Average loss at step 50000: 2.858816
Nearest to by: without, using, through, paco, songfic, quadrivium, hpcc, during,
Nearest to people: men, children, women, players, those, individuals, alcindor, students,
Nearest to four: five, six, seven, eight, three, nine, two, zero,
Nearest to can: may, could, will, must, would, cannot, should, might,
Nearest to d: b, anuran, rula, biometrical, pelee, trevino, waive, catecholamines,
Nearest to been: become, be, schemer, polytopes, already, was, grown, maglev,
Nearest to so: mla, then, gerund, too, alleviating, sometimes, sulfuric, jeremia,
Nearest to while: although, though, when, however, but, after, before, if,
Nearest to seven: eight, six, nine, four, five, three, zero, two,
Nearest to were: are, have, had, was, although, include, been, tonio,
Nearest to UNK: webspace, leva, cased, caldara, atropat, ussy, ipsilateral, chey,
Nearest to system: systems, situation, process, bidinotto, titanomachy, smollett, device, crisscrossed,
Nearest to have: had, has, were, include, having, are, be, provide,
Nearest to see: include, references, moammar, but, includes, posavac, laetitia, can,
Nearest to after: before, when, during, despite, without, until, while, ralegh,
Nearest to five: four, six, seven, three, eight, nine, zero, two,
Average loss at step 52000: 2.907825
Average loss at step 54000: 2.880096
Average loss at step 56000: 2.857745
Average loss at step 58000: 2.757673
Average loss at step 60000: 2.841038
Nearest to by: songfic, tacs, zmrzl, disuade, hpcc, ishshan, gansz, verein,
Nearest to people: men, children, women, players, individuals, sherkin, those, pendantes,
Nearest to four: five, seven, six, eight, three, nine, two, zero,
Nearest to can: could, may, will, would, must, should, might, cannot,
Nearest to d: b, ertzaintza, brahmins, maxima, plagued, yachtsmen, supercpu, pelee,
Nearest to been: become, be, polytopes, schemer, grown, already, mmixmasters, amidock,
Nearest to so: mla, too, gerund, dut, sometimes, antoinette, thus, sulfuric,
Nearest to while: although, though, when, however, before, after, csarevich, are,
Nearest to seven: eight, six, five, nine, four, three, zero, two,
Nearest to were: are, have, had, was, pamper, be, although, arubans,
Nearest to UNK: webspace, leva, cased, caldara, atropat, ussy, ipsilateral, chey,
Nearest to system: systems, process, device, situation, crisscrossed, macnelly, rebalances, strolled,
Nearest to have: had, has, having, were, provide, include, are, be,
Nearest to see: references, include, includes, but, siddons, integralist, moammar, winnetou,
Nearest to after: before, when, during, despite, without, ralegh, while, within,
Nearest to five: four, seven, six, eight, three, nine, zero, two,
Average loss at step 62000: 2.853227
Average loss at step 64000: 2.795200
Average loss at step 66000: 2.759334
Average loss at step 68000: 2.724250
Average loss at step 70000: 2.846458
Nearest to by: durey, without, industrialists, caveats, disuade, paco, zmrzl, songfic,
Nearest to people: women, men, children, players, authors, individuals, words, those,
Nearest to four: three, five, seven, six, eight, two, nine, zero,
Nearest to can: could, will, must, may, would, should, might, cannot,
Nearest to d: b, brahmins, pumi, dmort, biometrical, biopreparat, bickler, wights,
Nearest to been: become, be, already, were, schemer, grown, was, eschnapur,
Nearest to so: mla, too, sometimes, gerund, antoinette, very, mathematik, prevalance,
Nearest to while: although, when, though, before, however, after, during, but,
Nearest to seven: five, eight, six, nine, three, four, zero, two,
Nearest to were: are, have, was, including, had, pamper, include, been,
Nearest to UNK: tautomer, cased, laxmanniaceae, caldara, lavaur, girdler, cedille, ussy,
Nearest to system: systems, process, device, macnelly, project, strolled, archein, crisscrossed,
Nearest to have: had, has, include, are, were, having, provide, be,
Nearest to see: references, includes, kindergartens, include, moammar, integralist, handhelds, rapida,
Nearest to after: before, during, when, despite, while, without, if, from,
Nearest to five: six, seven, eight, four, nine, three, zero, two,
Average loss at step 72000: 2.800327
Average loss at step 74000: 2.632470
Average loss at step 76000: 2.805637
Average loss at step 78000: 2.834397
Average loss at step 80000: 2.784232
Nearest to by: without, using, during, hpcc, durey, tacs, reexported, gansz,
Nearest to people: children, women, men, individuals, jews, scholars, authors, persons,
Nearest to four: five, seven, three, six, eight, nine, two, zero,
Nearest to can: could, must, may, will, would, should, might, cannot,
Nearest to d: b, yachtsmen, pelee, swindlers, denotational, maxima, moreever, brahmins,
Nearest to been: become, be, schemer, polytopes, was, grown, malophoros, already,
Nearest to so: too, mla, sometimes, alexi, very, varnishes, franchot, gerund,
Nearest to while: although, though, when, before, after, where, however, csarevich,
Nearest to seven: six, eight, five, nine, four, three, two, zero,
Nearest to were: are, have, was, had, including, those, include, vai,
Nearest to UNK: vallens, lect, handwraps, hissariik, centralising, ovate, millefiori, fleming,
Nearest to system: systems, process, device, macnelly, archein, situation, crisscrossed, waserfl,
Nearest to have: had, has, include, were, are, refer, provide, produce,
Nearest to see: include, includes, references, integralist, but, kindergartens, laetitia, vialli,
Nearest to after: before, during, despite, when, without, while, ralegh, contempl,
Nearest to five: six, seven, eight, four, three, nine, two, zero,
Average loss at step 82000: 2.659964
Average loss at step 84000: 2.757100
Average loss at step 86000: 2.728796
Average loss at step 88000: 2.746146
Average loss at step 90000: 2.727805
Nearest to by: through, using, including, sarit, zmrzl, durey, maliki, gaillard,
Nearest to people: women, men, children, individuals, players, authors, users, jews,
Nearest to four: six, five, eight, seven, three, nine, two, zero,
Nearest to can: could, would, must, should, will, might, may, cannot,
Nearest to d: b, pelee, yachtsmen, brahmins, swindlers, ceramicists, legio, denotational,
Nearest to been: become, be, were, schemer, grown, polytopes, undergone, mmixmasters,
Nearest to so: too, mla, antoinette, sometimes, gerund, zeners, alleviating, felsite,
Nearest to while: although, though, when, before, including, csarevich, but, however,
Nearest to seven: eight, six, nine, five, four, three, zero, two,
Nearest to were: are, have, had, those, was, been, although, arubans,
Nearest to UNK: amanah, l, te, catalan, shikar, darlene, caesius, san,
Nearest to system: systems, process, macnelly, archein, device, chonas, bidinotto, donors,
Nearest to have: had, has, are, were, include, having, provide, be,
Nearest to see: include, includes, laetitia, but, references, integralist, kindergartens, external,
Nearest to after: before, during, despite, when, without, ralegh, wiwaxia, recognitio,
Nearest to five: four, six, seven, eight, three, nine, zero, two,
Average loss at step 92000: 2.662648
Average loss at step 94000: 2.718229
Average loss at step 96000: 2.693342
Average loss at step 98000: 2.368214
Average loss at step 100000: 2.399122
Nearest to by: through, without, with, from, zmrzl, durey, industrialists, when,
Nearest to people: men, women, children, individuals, authors, persons, citizens, players,
Nearest to four: six, five, seven, three, eight, nine, zero, two,
Nearest to can: could, must, might, will, cannot, should, would, may,
Nearest to d: b, posies, rula, dmort, kretschmer, anuran, lactis, blaisdell,
Nearest to been: become, be, already, were, schemer, undergone, remained, grown,
Nearest to so: too, mla, then, sometimes, therefore, very, felsite, thus,
Nearest to while: although, when, though, before, after, but, however, where,
Nearest to seven: eight, six, four, five, nine, three, zero, two,
Nearest to were: are, was, have, include, had, been, be, is,
Nearest to UNK: hicom, classic, tejaswini, catalan, inadequate, keesville, escudos, usud,
Nearest to system: systems, process, device, macnelly, archein, chonas, bidinotto, crisscrossed,
Nearest to have: had, has, include, having, refer, are, were, contain,
Nearest to see: include, laetitia, includes, references, prays, vinca, inoxia, called,
Nearest to after: before, when, without, despite, during, while, ralegh, for,
Nearest to five: six, four, seven, eight, three, nine, zero, two,

Some crude attempts at sentiment analysis

As part of the project we would like to associate the mood to a memory by assigning an value between 0 and 1 to a series of different moods:

This study incleds 5 moods were identified: joy, sad, fear, disgust, scary and anger.
For each mood, a set of synonymns are used and the score is taken as the synonymn and sentece average cosine distance



In [11]:

    
import numpy as np
import regex as re

joy_words = ['happy','joy','pleasure','glee']
sad_words = ['sad','unhappy','gloomy']
scary_words = ['scary','frightening','terrifying', 'horrifying']
disgust_words = ['disgust', 'distaste', 'repulsion']
anger_words = ['anger','rage','irritated']

def syn_average(word, list_words = []):
    to_ret = 0
    count = 0 #use this in case a word isnt in dict
    for syn in list_words:
        if syn in words.dictionary:
            syn_id = words.dictionary[syn]
            to_ret+=np.matmul(embeddings[word].reshape(1,128), embeddings[syn_id].reshape(128,1))
            count +=1
        else:
            print(syn," is not in dict")
    return to_ret/count

def test(string_words):
    happy = words.dictionary['joy']
    sad = words.dictionary['fear']
    scary = words.dictionary['sad']
    disgust = words.dictionary['disgust']
    anger = words.dictionary['anger']
    
    
    d2happy = 0 
    d2sad = 0 
    d2scary = 0 
    d2disgust = 0
    d2anger = 0
    for a in string_words:
        if a in words.dictionary:
            in_dict = words.dictionary[a]
            d2happy += syn_average(in_dict,joy_words)
            d2sad += syn_average(in_dict,sad_words)
            d2scary += syn_average(in_dict,scary_words)
            d2disgust += syn_average(in_dict,disgust_words)
            d2anger += syn_average(in_dict,anger_words )
            
    d2happy = d2happy/len(string_words)
    d2sad = d2sad/len(string_words)
    d2scary = d2scary/len(string_words)
    d2disgust = d2disgust/len(string_words)
    d2anger = d2anger/len(string_words)
    print(  max(d2happy,0),"\t",max(d2sad,0),"\t", max(d2scary,0),"\t", max(d2disgust,0),"\t", max(d2anger,0))

def plot_emotions(top = 8):
    emotions= [ words.dictionary['joy'], words.dictionary['fear'],
        words.dictionary['sad'], words.dictionary['disgust'], words.dictionary['anger'] ]
        
    for i,i_word in enumerate(emotions):
        sim = embeddings.similarity(embeddings)        
        nearest = (-sim[i_word, :]).argsort()[1:top+1]
        print('Nearest to ', emotions[i], ": ")
        for k in range(top):
            close_word = words.reverse_dictionary(nearest[k])
            print('\t',close_word)

Proof of principal - ish

To test that the algorithms are worlking we run over a couple of sentences, identified by a human to be happy, scary and agry respectively.

Negative scores (i.e. the sentences average embedding vectors points in opposite direction relative to the mood) are set to 0.



In [12]:

    
happy_string_ = "Even Harry, who knew nothing about the different brooms, thought it looked wonderful. Sleek and shiny, with a mahogany handle, it had a long tail of neat, straight twigs and Nimbus Two Thousand written in gold near the top. As seven o'clock drew nearer, Harry left the castle and set off in the dusk toward the Quidditch field. Held never been inside the stadium before. Hundreds of seats were raised in stands around the field so that the spectators were high enough to see what was going on. At either end of the field were three golden poles with hoops on the end. They reminded Harry of the little plastic sticks Muggle children blew bubbles through, except that they were fifty feet high. Too eager to fly again to wait for Wood, Harry mounted his broomstick and kicked off from the ground. What a feeling -- he swooped in and out of the goal posts and then sped up and down the field. The Nimbus Two Thousand turned wherever he wanted at his lightest touch."
scary_string = "and the next second, Harry felt Quirrell's hand close on his wrist. At once, a needle-sharp pain seared across Harry's scar; his head felt as though it was about to split in two; he yelled, struggling with all his might, and to his surprise, Quirrell let go of him. The pain in his head lessened -- he looked around wildly to see where Quirrell had gone, and saw him hunched in pain, looking at his fingers -- they were blistering before his eyes."
angry_string = 'He’d forgotten all about the people in cloaks until he passed a group of them next to the baker’s. He eyed them angrily as he passed. He didn’t know why, but they made him uneasy. This bunch were whispering  excitedly, too, and he couldn’t see a single collectingtin. It was on his way back past them, clutching a large doughnut in a bag, that he caught a few words of what they were saying.'

happy_string_words = re.sub(r"\p{P}+", "", happy_string_).split()
scary_string_words = re.sub(r"\p{P}+", "", scary_string).split()
angry_string_words = re.sub(r"\p{P}+", "",angry_string).split()
print("\n")
print("Sentence: ")
print(happy_string_)
print("Similarity to: ")
print("happy \t\t sad \t\t scary \t\t disgust \t\t anger")
test(happy_string_words)
print("\n")
print("Sentence: ")
print(scary_string)
print("Similarity to: ")
print("happy \t\t sad \t\t scary \t\t disgust \t\t anger")
test(scary_string_words)
print("\n")
print("Sentence: ")
print(angry_string)
print("Similarity to: ")
print("happy \t\t sad \t\t scary \t\t disgust \t\t anger")
test(angry_string_words)









    




Sentence: 
Even Harry, who knew nothing about the different brooms, thought it looked wonderful. Sleek and shiny, with a mahogany handle, it had a long tail of neat, straight twigs and Nimbus Two Thousand written in gold near the top. As seven o'clock drew nearer, Harry left the castle and set off in the dusk toward the Quidditch field. Held never been inside the stadium before. Hundreds of seats were raised in stands around the field so that the spectators were high enough to see what was going on. At either end of the field were three golden poles with hoops on the end. They reminded Harry of the little plastic sticks Muggle children blew bubbles through, except that they were fifty feet high. Too eager to fly again to wait for Wood, Harry mounted his broomstick and kicked off from the ground. What a feeling -- he swooped in and out of the goal posts and then sped up and down the field. The Nimbus Two Thousand turned wherever he wanted at his lightest touch.
Similarity to: 
happy 		 sad 		 scary 		 disgust 		 anger
[[ 0.00192716]] 	 [[ 0.01867508]] 	 0 	 [[ 0.00227247]] 	 [[ 0.01164252]]


Sentence: 
and the next second, Harry felt Quirrell's hand close on his wrist. At once, a needle-sharp pain seared across Harry's scar; his head felt as though it was about to split in two; he yelled, struggling with all his might, and to his surprise, Quirrell let go of him. The pain in his head lessened -- he looked around wildly to see where Quirrell had gone, and saw him hunched in pain, looking at his fingers -- they were blistering before his eyes.
Similarity to: 
happy 		 sad 		 scary 		 disgust 		 anger
[[ 0.00046383]] 	 [[ 0.0208657]] 	 0 	 [[ 0.00422002]] 	 [[ 0.0167667]]


Sentence: 
He’d forgotten all about the people in cloaks until he passed a group of them next to the baker’s. He eyed them angrily as he passed. He didn’t know why, but they made him uneasy. This bunch were whispering  excitedly, too, and he couldn’t see a single collectingtin. It was on his way back past them, clutching a large doughnut in a bag, that he caught a few words of what they were saying.
Similarity to: 
happy 		 sad 		 scary 		 disgust 		 anger
0 	 [[ 0.01294638]] 	 0 	 [[ 0.01072488]] 	 [[ 0.02178478]]

Results

While the reuslts are not exactly promising, 3 examples is in no way statitstically significant. Optimally I would like to establish a test dataset of sentences labeled by humans to get a more representative understanding of performance.

Outlook

Once we establish a method to gauge the performance of an algorithm we can try to improve the performance (if necesary) Potential improvements:

Skipgram versus CBOW
Larger training data-sets?
Which words carry the most weight in a sentence? (Only use some combination of verbs, nouns, adjectives, ...)
This was fun, but there are trained models out there that are most likely more efficient & are better trained. Use them



In [ ]: