4 TreeTagger usage to tag an italian (or other languages) sentence

Correction with enchant

install via pip install pyenchant
add ita dictionary: sudo apt-get install myspell-it myspell-es
Tutorial at: http://pythonhosted.org/pyenchant/tutorial.html



In [4]:

    
import enchant



In [5]:

    
# The underlying programming model provided by the Enchant library is based on the notion of Providers. 
# A provider is a piece of code that provides spell-checking services which Enchant can use to perform its work. 
# Different providers exist for performing spellchecking using different frameworks - 
# for example there is an aspell provider and a MySpell provider.
## no need to check brokers while running enchant, this is just a simple check if all is installed
b = enchant.Broker()
print(b.describe())
b.list_dicts()









    



[<Enchant: Aspell Provider>, <Enchant: Ispell Provider>, <Enchant: Hspell Provider>, <Enchant: Myspell Provider>]






    Out[5]:





[('en', <Enchant: Aspell Provider>),
 ('en_CA', <Enchant: Aspell Provider>),
 ('en_GB', <Enchant: Aspell Provider>),
 ('en_US', <Enchant: Aspell Provider>),
 ('es_MX', <Enchant: Myspell Provider>),
 ('es_BO', <Enchant: Myspell Provider>),
 ('es_CO', <Enchant: Myspell Provider>),
 ('es_VE', <Enchant: Myspell Provider>),
 ('es_UY', <Enchant: Myspell Provider>),
 ('es_PR', <Enchant: Myspell Provider>),
 ('es_EC', <Enchant: Myspell Provider>),
 ('es_CU', <Enchant: Myspell Provider>),
 ('es_ES', <Enchant: Myspell Provider>),
 ('es_PA', <Enchant: Myspell Provider>),
 ('es_NI', <Enchant: Myspell Provider>),
 ('es_CR', <Enchant: Myspell Provider>),
 ('es_PE', <Enchant: Myspell Provider>),
 ('it_CH', <Enchant: Myspell Provider>),
 ('es_GT', <Enchant: Myspell Provider>),
 ('es_PY', <Enchant: Myspell Provider>),
 ('es_SV', <Enchant: Myspell Provider>),
 ('it_IT', <Enchant: Myspell Provider>),
 ('es_HN', <Enchant: Myspell Provider>),
 ('es_CL', <Enchant: Myspell Provider>),
 ('es', <Enchant: Myspell Provider>),
 ('es_DO', <Enchant: Myspell Provider>),
 ('es_AR', <Enchant: Myspell Provider>)]



In [6]:

    
enchant.list_languages()









    Out[6]:





['en',
 'en_CA',
 'en_GB',
 'en_US',
 'es_MX',
 'es_BO',
 'es_CO',
 'es_VE',
 'es_UY',
 'es_PR',
 'es_EC',
 'es_CU',
 'es_ES',
 'es_PA',
 'es_NI',
 'es_CR',
 'es_PE',
 'it_CH',
 'es_GT',
 'es_PY',
 'es_SV',
 'it_IT',
 'es_HN',
 'es_CL',
 'es',
 'es_DO',
 'es_AR']



In [7]:

    
d = enchant.Dict("it_IT")



In [8]:

    
d.check('Giulia'), d.check('pappapero')









    Out[8]:





(True, False)



In [9]:

    
print( d.suggest("potreima") )
print( d.suggest("marema") )
print( d.suggest("se metto troppe parole lo impallo") )
print( d.suggest("van no") )
print( d.suggest("due parole") )









    



['potrei ma', 'potrei-ma', 'potrei', 'impomatare']
['marame', 'marea', 'maremma', 'Carema', 'ma rema', 'ma-rema', 'mare ma', 'mare-ma', 'Maremma', 'remare', 'remar', 'mare']
[]
['vanno', 'vano']
['duellatole']

Add your own dictionary



In [10]:

    
# Dict objects can also be used to check words against a custom list of correctly-spelled words 
# known as a Personal Word List. This is simply a file listing the words to be considered, one word per line. 
# The following example creates a Dict object for the personal word list stored in “mywords.txt”:
pwl = enchant.request_pwl_dict("../../Data_nlp/mywords.txt")



In [11]:

    
pwl.check('pappapero'), pwl.suggest('cittin'), pwl.check('altro')









    Out[11]:





(False, [], False)



In [12]:

    
# PyEnchant also provides the class DictWithPWL which can be used to combine a language dictionary 
# and a personal word list file:
d2 = enchant.DictWithPWL("it_IT", "../../Data_nlp/mywords.txt")



In [13]:

    
d2.check('altro') & d2.check('pappapero'), d2.suggest('cittin')









    Out[13]:





(False, ['cittadino'])



In [14]:

    
%%timeit
d2.suggest('poliza')









    



10 loops, best of 3: 22.6 ms per loop

check entire phrase



In [15]:

    
from enchant.checker import SpellChecker
chkr = SpellChecker("it_IT")



In [16]:

    
chkr.set_text("questo è un picclo esmpio per dire cm funziona")
for err in chkr:
    print(err.word)
    print(chkr.suggest(err.word))









    



picclo
['picco', 'piccolo', 'picciolo', 'epiciclo', 'ciclopico']
esmpio
['espio', 'empio', 'esempio']



In [17]:

    
print(chkr.word, chkr.wordpos)









    



esmpio 19



In [18]:

    
chkr.replace('pippo')
chkr.get_text()









    Out[18]:





'questo è un picclo pippo per dire cm funziona'

tokenization

As explained above, the module enchant.tokenize provides the ability to split text into its component words. The current implementation is based only on the rules for the English language, and so might not be completely suitable for your language of choice. Fortunately, it is straightforward to extend the functionality of this module.

To implement a new tokenization routine for the language TAG, simply create a class/function “tokenize” within the module “enchant.tokenize.TAG”. This function will automatically be detected by the module’s get_tokenizer function and used when appropriate. The easiest way to accomplish this is to copy the module “enchant.tokenize.en” and modify it to suit your needs.



In [19]:

    
from enchant.tokenize import get_tokenizer
tknzr = get_tokenizer("en_US") # not tak for it_IT up to now
[w for w in tknzr("this is some simple text")]









    Out[19]:





[('this', 0), ('is', 5), ('some', 8), ('simple', 13), ('text', 20)]



In [20]:

    
from enchant.tokenize import get_tokenizer, HTMLChunker
tknzr = get_tokenizer("en_US")
[w for w in tknzr("this is <span class='important'>really important</span> text")]









    Out[20]:





[('this', 0),
 ('is', 5),
 ('span', 9),
 ('class', 14),
 ('important', 21),
 ('really', 32),
 ('important', 39),
 ('span', 50),
 ('text', 56)]



In [28]:

    
tknzr = get_tokenizer("en_US",chunkers=(HTMLChunker,))
[w for w in tknzr("this is <span class='important'>really important</span> text")]









    Out[28]:





[('this', 0), ('is', 5), ('really', 32), ('important', 39), ('text', 56)]



In [21]:

    
from enchant.tokenize import get_tokenizer, EmailFilter
tknzr = get_tokenizer("en_US")
[w for w in tknzr("send an email to fake@example.com please")]









    Out[21]:





[('send', 0),
 ('an', 5),
 ('email', 8),
 ('to', 14),
 ('fake', 17),
 ('example', 22),
 ('com', 30),
 ('please', 34)]



In [22]:

    
tknzr = get_tokenizer("en_US", filters = [EmailFilter])
[w for w in tknzr("send an email to fake@example.com please")]









    Out[22]:





[('send', 0), ('an', 5), ('email', 8), ('to', 14), ('please', 34)]

Other modules:

CmdLineChecker

The module enchant.checker.CmdLineChecker provides the class CmdLineChecker which can be used to interactively check the spelling of some text. It uses standard input and standard output to interact with the user through a command-line interface. The code below shows how to create and use this class from within a python application, along with a short sample checking session:

wxSpellCheckerDialog

The module enchant.checker.wxSpellCheckerDialog provides the class wxSpellCheckerDialog which can be used to interactively check the spelling of some text. The code below shows how to create and use such a dialog from within a wxPython application.

Word2vec

pip install gensim
pip install pyemd
https://radimrehurek.com/gensim/models/word2vec.html



In [23]:

    
import gensim, logging
from  gensim.models import Word2Vec



In [32]:

    
model = gensim.models.KeyedVectors.load_word2vec_format(
    '../../Data_nlp/GoogleNews-vectors-negative300.bin.gz', binary=True)



In [33]:

    
model.doesnt_match("breakfast brian dinner lunch".split())









    Out[33]:





'brian'



In [35]:

    
# give text with w1 w2 your_distance to check if model and w1-w2 have give the same distance
model.evaluate_word_pairs()









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-35-8fd5a98fc876> in <module>()
      1 # give text with w1 w2 your_distance to check if model and w1-w2 have give the same distance
----> 2 model.evaluate_word_pairs()

TypeError: evaluate_word_pairs() missing 1 required positional argument: 'pairs'



In [36]:

    
len(model.index2word)









    Out[36]:





3000000



In [37]:

    
# check accuracy against a premade grouped words
questions_words = model.accuracy('../../Data_nlp/word2vec/trunk/questions-words.txt')
phrases_words = model.accuracy('../../Data_nlp/word2vec/trunk/questions-phrases.txt')



In [38]:

    
questions_words[4]['incorrect']









    Out[38]:





[('BOY', 'GIRL', 'STEPFATHER', 'STEPMOTHER'),
 ('BROTHER', 'SISTER', 'STEPFATHER', 'STEPMOTHER'),
 ('BROTHERS', 'SISTERS', 'STEPFATHER', 'STEPMOTHER'),
 ('DAD', 'MOM', 'HUSBAND', 'WIFE'),
 ('DAD', 'MOM', 'STEPFATHER', 'STEPMOTHER'),
 ('GRANDFATHER', 'GRANDMOTHER', 'HUSBAND', 'WIFE'),
 ('GRANDFATHER', 'GRANDMOTHER', 'STEPFATHER', 'STEPMOTHER'),
 ('GRANDSON', 'GRANDDAUGHTER', 'STEPFATHER', 'STEPMOTHER'),
 ('GROOM', 'BRIDE', 'NEPHEW', 'NIECE'),
 ('GROOM', 'BRIDE', 'STEPFATHER', 'STEPMOTHER'),
 ('GROOM', 'BRIDE', 'UNCLE', 'AUNT'),
 ('GROOM', 'BRIDE', 'BROTHER', 'SISTER'),
 ('GROOM', 'BRIDE', 'BROTHERS', 'SISTERS'),
 ('HE', 'SHE', 'HUSBAND', 'WIFE'),
 ('HE', 'SHE', 'STEPFATHER', 'STEPMOTHER'),
 ('HIS', 'HER', 'HUSBAND', 'WIFE'),
 ('HIS', 'HER', 'STEPFATHER', 'STEPMOTHER'),
 ('HUSBAND', 'WIFE', 'BROTHER', 'SISTER'),
 ('HUSBAND', 'WIFE', 'DAD', 'MOM'),
 ('HUSBAND', 'WIFE', 'FATHER', 'MOTHER'),
 ('HUSBAND', 'WIFE', 'GRANDFATHER', 'GRANDMOTHER'),
 ('HUSBAND', 'WIFE', 'HE', 'SHE'),
 ('HUSBAND', 'WIFE', 'HIS', 'HER'),
 ('KING', 'QUEEN', 'STEPFATHER', 'STEPMOTHER'),
 ('MAN', 'WOMAN', 'STEPFATHER', 'STEPMOTHER'),
 ('MAN', 'WOMAN', 'HUSBAND', 'WIFE'),
 ('NEPHEW', 'NIECE', 'STEPFATHER', 'STEPMOTHER'),
 ('PRINCE', 'PRINCESS', 'STEPFATHER', 'STEPMOTHER'),
 ('PRINCE', 'PRINCESS', 'DAD', 'MOM'),
 ('PRINCE', 'PRINCESS', 'HUSBAND', 'WIFE'),
 ('SON', 'DAUGHTER', 'STEPFATHER', 'STEPMOTHER'),
 ('SONS', 'DAUGHTERS', 'STEPFATHER', 'STEPMOTHER'),
 ('STEPFATHER', 'STEPMOTHER', 'GRANDFATHER', 'GRANDMOTHER'),
 ('UNCLE', 'AUNT', 'STEPFATHER', 'STEPMOTHER')]



In [39]:

    
print( model.n_similarity(['pasta'], ['spaghetti']) )
print( model.n_similarity(['pasta'], ['tomato']) )
print( model.n_similarity(['pasta'], ['car']) )
print( model.n_similarity(['cat'], ['dog']) )









    



0.660332138317
0.458195260447
0.177056094484
0.760945708978



In [40]:

    
model.similar_by_vector( model.word_vec('welcome') )









    Out[40]:





[('welcome', 1.0),
 ('welcomed', 0.7077772617340088),
 ('welcoming', 0.7071465849876404),
 ('welcomes', 0.6647579669952393),
 ('warmly_welcomed', 0.6219103336334229),
 ('warmly_welcome', 0.5892778038978577),
 ('Welcoming', 0.5658251047134399),
 ('greatly_appreciated', 0.5299198627471924),
 ('warmly_welcomes', 0.521955132484436),
 ('invite', 0.5170012712478638)]



In [41]:

    
model.similar_by_word('welcome')









    Out[41]:





[('welcomed', 0.7077772617340088),
 ('welcoming', 0.7071465849876404),
 ('welcomes', 0.6647579669952393),
 ('warmly_welcomed', 0.6219103336334229),
 ('warmly_welcome', 0.5892777442932129),
 ('Welcoming', 0.5658251047134399),
 ('greatly_appreciated', 0.5299198627471924),
 ('warmly_welcomes', 0.521955132484436),
 ('invite', 0.517001211643219),
 ('delighted', 0.5136862397193909)]



In [42]:

    
model.syn0[4,]









    Out[42]:





array([ 0.00704956, -0.07324219,  0.171875  ,  0.02258301, -0.1328125 ,
        0.19824219,  0.11279297, -0.10791016,  0.07177734,  0.02087402,
       -0.12304688, -0.05908203,  0.10107422,  0.01074219,  0.14355469,
        0.25976562, -0.03637695,  0.18554688, -0.07861328, -0.02270508,
       -0.12060547,  0.17773438,  0.04956055,  0.01721191,  0.07958984,
       -0.0456543 , -0.18847656,  0.18945312, -0.02319336,  0.06298828,
        0.09765625, -0.01904297, -0.07910156,  0.15234375,  0.17382812,
        0.1015625 , -0.16308594,  0.11474609,  0.10058594, -0.09277344,
        0.109375  ,  0.05883789, -0.02160645,  0.06347656,  0.04199219,
       -0.0088501 ,  0.03222656,  0.10644531,  0.06445312, -0.11865234,
        0.03051758,  0.06689453,  0.12207031, -0.08300781,  0.171875  ,
        0.07861328,  0.09521484, -0.00778198,  0.02319336,  0.0234375 ,
       -0.0168457 ,  0.15527344, -0.10986328, -0.17675781, -0.11621094,
        0.0234375 , -0.01062012,  0.05273438, -0.13378906,  0.07958984,
        0.07373047,  0.04394531,  0.11523438, -0.02062988,  0.07470703,
       -0.01153564,  0.08056641,  0.04174805,  0.08007812,  0.3515625 ,
        0.09667969, -0.21289062,  0.16503906, -0.078125  ,  0.06982422,
       -0.00139618, -0.09130859,  0.12988281,  0.25195312, -0.01611328,
        0.09326172, -0.14648438, -0.00151062, -0.15136719, -0.02685547,
       -0.15722656,  0.02636719,  0.0859375 ,  0.07177734,  0.07714844,
       -0.0390625 ,  0.05444336, -0.12792969,  0.09130859, -0.18457031,
       -0.03759766, -0.0279541 , -0.08984375, -0.11669922, -0.09863281,
        0.0480957 , -0.16210938, -0.10888672,  0.08496094, -0.0456543 ,
        0.15820312, -0.03808594, -0.08203125,  0.203125  ,  0.08642578,
        0.06933594,  0.03222656, -0.16015625,  0.09472656, -0.0246582 ,
        0.05419922,  0.0279541 ,  0.04492188,  0.16992188,  0.07275391,
       -0.03637695, -0.01025391, -0.01708984, -0.10742188, -0.0007019 ,
       -0.07373047,  0.25390625,  0.05664062,  0.03515625, -0.00860596,
        0.18554688,  0.02148438,  0.26367188, -0.02380371, -0.09912109,
       -0.04125977, -0.06933594, -0.11376953,  0.05004883, -0.05883789,
        0.04614258,  0.08740234,  0.10546875,  0.10644531,  0.0279541 ,
        0.09472656,  0.11621094, -0.17285156, -0.03491211, -0.20800781,
        0.05957031,  0.10400391, -0.00179291,  0.05859375, -0.02978516,
       -0.03759766,  0.04858398, -0.06396484,  0.07958984,  0.06933594,
       -0.10498047, -0.14453125,  0.04345703, -0.06884766, -0.03564453,
       -0.01171875,  0.01367188, -0.06591797,  0.11914062,  0.03125   ,
       -0.04638672, -0.00196838,  0.00735474, -0.05664062,  0.02783203,
        0.08251953, -0.01348877,  0.07177734,  0.14453125,  0.12792969,
        0.04223633,  0.14160156, -0.01806641,  0.02160645, -0.09179688,
        0.13378906, -0.1953125 , -0.05029297, -0.0378418 , -0.09619141,
        0.10302734, -0.10693359, -0.14746094,  0.09960938, -0.23046875,
        0.22753906, -0.07519531,  0.06494141,  0.09179688,  0.046875  ,
        0.06298828,  0.06982422,  0.04614258,  0.09716797, -0.20214844,
        0.19921875,  0.18652344, -0.11962891, -0.14257812,  0.15039062,
       -0.03369141, -0.14550781, -0.00069046, -0.07324219,  0.13378906,
        0.03564453, -0.02294922,  0.02770996, -0.07910156,  0.20703125,
       -0.08349609, -0.04956055,  0.03149414,  0.1484375 ,  0.05566406,
       -0.04492188, -0.07958984,  0.00476074, -0.02075195,  0.06005859,
        0.00476074,  0.01116943,  0.17285156, -0.13476562,  0.03076172,
       -0.07958984,  0.09033203,  0.06103516,  0.07714844, -0.05029297,
       -0.09228516, -0.26757812,  0.10791016,  0.0859375 ,  0.06298828,
        0.10791016, -0.0267334 ,  0.10205078, -0.12060547,  0.05297852,
        0.09472656, -0.16503906,  0.04418945,  0.07226562,  0.04125977,
        0.42578125, -0.10302734, -0.16015625, -0.09033203, -0.06396484,
       -0.0480957 ,  0.14453125,  0.06542969,  0.04931641,  0.05419922,
        0.13574219, -0.01928711, -0.21582031, -0.07421875, -0.14648438,
        0.01147461, -0.16503906, -0.10498047,  0.00320435,  0.13476562,
       -0.00396729, -0.10351562, -0.13964844,  0.10449219, -0.01257324,
       -0.23339844, -0.03637695, -0.09375   ,  0.18261719,  0.02709961,
        0.12792969, -0.02478027,  0.01123047,  0.1640625 ,  0.10693359], dtype=float32)



In [43]:

    
model.index2word[4]









    Out[43]:





'is'



In [44]:

    
model.word_vec('is')









    Out[44]:





array([ 0.00704956, -0.07324219,  0.171875  ,  0.02258301, -0.1328125 ,
        0.19824219,  0.11279297, -0.10791016,  0.07177734,  0.02087402,
       -0.12304688, -0.05908203,  0.10107422,  0.01074219,  0.14355469,
        0.25976562, -0.03637695,  0.18554688, -0.07861328, -0.02270508,
       -0.12060547,  0.17773438,  0.04956055,  0.01721191,  0.07958984,
       -0.0456543 , -0.18847656,  0.18945312, -0.02319336,  0.06298828,
        0.09765625, -0.01904297, -0.07910156,  0.15234375,  0.17382812,
        0.1015625 , -0.16308594,  0.11474609,  0.10058594, -0.09277344,
        0.109375  ,  0.05883789, -0.02160645,  0.06347656,  0.04199219,
       -0.0088501 ,  0.03222656,  0.10644531,  0.06445312, -0.11865234,
        0.03051758,  0.06689453,  0.12207031, -0.08300781,  0.171875  ,
        0.07861328,  0.09521484, -0.00778198,  0.02319336,  0.0234375 ,
       -0.0168457 ,  0.15527344, -0.10986328, -0.17675781, -0.11621094,
        0.0234375 , -0.01062012,  0.05273438, -0.13378906,  0.07958984,
        0.07373047,  0.04394531,  0.11523438, -0.02062988,  0.07470703,
       -0.01153564,  0.08056641,  0.04174805,  0.08007812,  0.3515625 ,
        0.09667969, -0.21289062,  0.16503906, -0.078125  ,  0.06982422,
       -0.00139618, -0.09130859,  0.12988281,  0.25195312, -0.01611328,
        0.09326172, -0.14648438, -0.00151062, -0.15136719, -0.02685547,
       -0.15722656,  0.02636719,  0.0859375 ,  0.07177734,  0.07714844,
       -0.0390625 ,  0.05444336, -0.12792969,  0.09130859, -0.18457031,
       -0.03759766, -0.0279541 , -0.08984375, -0.11669922, -0.09863281,
        0.0480957 , -0.16210938, -0.10888672,  0.08496094, -0.0456543 ,
        0.15820312, -0.03808594, -0.08203125,  0.203125  ,  0.08642578,
        0.06933594,  0.03222656, -0.16015625,  0.09472656, -0.0246582 ,
        0.05419922,  0.0279541 ,  0.04492188,  0.16992188,  0.07275391,
       -0.03637695, -0.01025391, -0.01708984, -0.10742188, -0.0007019 ,
       -0.07373047,  0.25390625,  0.05664062,  0.03515625, -0.00860596,
        0.18554688,  0.02148438,  0.26367188, -0.02380371, -0.09912109,
       -0.04125977, -0.06933594, -0.11376953,  0.05004883, -0.05883789,
        0.04614258,  0.08740234,  0.10546875,  0.10644531,  0.0279541 ,
        0.09472656,  0.11621094, -0.17285156, -0.03491211, -0.20800781,
        0.05957031,  0.10400391, -0.00179291,  0.05859375, -0.02978516,
       -0.03759766,  0.04858398, -0.06396484,  0.07958984,  0.06933594,
       -0.10498047, -0.14453125,  0.04345703, -0.06884766, -0.03564453,
       -0.01171875,  0.01367188, -0.06591797,  0.11914062,  0.03125   ,
       -0.04638672, -0.00196838,  0.00735474, -0.05664062,  0.02783203,
        0.08251953, -0.01348877,  0.07177734,  0.14453125,  0.12792969,
        0.04223633,  0.14160156, -0.01806641,  0.02160645, -0.09179688,
        0.13378906, -0.1953125 , -0.05029297, -0.0378418 , -0.09619141,
        0.10302734, -0.10693359, -0.14746094,  0.09960938, -0.23046875,
        0.22753906, -0.07519531,  0.06494141,  0.09179688,  0.046875  ,
        0.06298828,  0.06982422,  0.04614258,  0.09716797, -0.20214844,
        0.19921875,  0.18652344, -0.11962891, -0.14257812,  0.15039062,
       -0.03369141, -0.14550781, -0.00069046, -0.07324219,  0.13378906,
        0.03564453, -0.02294922,  0.02770996, -0.07910156,  0.20703125,
       -0.08349609, -0.04956055,  0.03149414,  0.1484375 ,  0.05566406,
       -0.04492188, -0.07958984,  0.00476074, -0.02075195,  0.06005859,
        0.00476074,  0.01116943,  0.17285156, -0.13476562,  0.03076172,
       -0.07958984,  0.09033203,  0.06103516,  0.07714844, -0.05029297,
       -0.09228516, -0.26757812,  0.10791016,  0.0859375 ,  0.06298828,
        0.10791016, -0.0267334 ,  0.10205078, -0.12060547,  0.05297852,
        0.09472656, -0.16503906,  0.04418945,  0.07226562,  0.04125977,
        0.42578125, -0.10302734, -0.16015625, -0.09033203, -0.06396484,
       -0.0480957 ,  0.14453125,  0.06542969,  0.04931641,  0.05419922,
        0.13574219, -0.01928711, -0.21582031, -0.07421875, -0.14648438,
        0.01147461, -0.16503906, -0.10498047,  0.00320435,  0.13476562,
       -0.00396729, -0.10351562, -0.13964844,  0.10449219, -0.01257324,
       -0.23339844, -0.03637695, -0.09375   ,  0.18261719,  0.02709961,
        0.12792969, -0.02478027,  0.01123047,  0.1640625 ,  0.10693359], dtype=float32)



In [45]:

    
model.syn0norm[4,]









    Out[45]:





array([ 0.00374603, -0.03891977,  0.09133173,  0.01200026, -0.07057451,
        0.10534285,  0.05993645, -0.0573418 ,  0.03814138,  0.01109213,
       -0.06538521, -0.03139528,  0.05370928,  0.00570823,  0.07628275,
        0.13803546, -0.01933015,  0.09859675, -0.04177389, -0.01206513,
       -0.06408789,  0.09444531,  0.02633571,  0.00914615,  0.04229282,
       -0.02425999, -0.10015354,  0.10067248, -0.01232459,  0.033471  ,
        0.05189303, -0.01011914, -0.04203335,  0.08095312,  0.09236959,
        0.05396875, -0.08666135,  0.06097431,  0.05344982, -0.04929838,
        0.05812019,  0.03126555, -0.01148133,  0.03373047,  0.022314  ,
       -0.00470281,  0.0171247 ,  0.0565634 ,  0.0342494 , -0.06305003,
        0.01621657,  0.03554672,  0.06486628, -0.04410907,  0.09133173,
        0.04177389,  0.0505957 , -0.00413523,  0.01232459,  0.01245433,
       -0.00895155,  0.08250991, -0.05837966, -0.09392638, -0.0617527 ,
        0.01245433, -0.00564337,  0.02802224, -0.07109345,  0.04229282,
        0.03917924,  0.02335186,  0.06123377, -0.0109624 ,  0.03969816,
       -0.00612986,  0.04281175,  0.02218427,  0.04255228,  0.1868149 ,
        0.0513741 , -0.1131268 ,  0.08769922, -0.04151442,  0.03710352,
       -0.00074191, -0.04851998,  0.06901773,  0.13388401, -0.00856235,
        0.04955784, -0.07783955, -0.00080272, -0.0804342 , -0.01427058,
       -0.08354778,  0.01401112,  0.04566586,  0.03814138,  0.04099549,
       -0.02075721,  0.02893036, -0.06797986,  0.04851998, -0.09807783,
       -0.01997882, -0.01485438, -0.04774158, -0.06201217, -0.05241196,
        0.02555732, -0.08614243, -0.05786072,  0.04514693, -0.02425999,
        0.0840667 , -0.02023828, -0.04359014,  0.1079375 ,  0.04592533,
        0.03684405,  0.0171247 , -0.08510457,  0.05033624, -0.01310299,
        0.02880063,  0.01485438,  0.02387079,  0.09029387,  0.03866031,
       -0.01933015, -0.00544877, -0.00908128, -0.05708233, -0.00037298,
       -0.03917924,  0.13492188,  0.03009796,  0.01868149, -0.00457307,
        0.09859675,  0.01141647,  0.14011118, -0.01264893, -0.05267143,
       -0.0219248 , -0.03684405, -0.06045538,  0.02659518, -0.03126555,
        0.02451946,  0.04644426,  0.05604447,  0.0565634 ,  0.01485438,
        0.05033624,  0.0617527 , -0.09185066, -0.01855176, -0.11053215,
        0.03165475,  0.05526607, -0.00095272,  0.03113582, -0.01582737,
       -0.01997882,  0.02581678, -0.03398993,  0.04229282,  0.03684405,
       -0.055785  , -0.07680168,  0.0230924 , -0.03658459, -0.01894096,
       -0.00622716,  0.00726502, -0.03502779,  0.06330949,  0.01660577,
       -0.02464919, -0.00104597,  0.00390819, -0.03009796,  0.01478951,
        0.04384961, -0.00716772,  0.03814138,  0.07680168,  0.06797986,
        0.02244373,  0.07524489, -0.00960021,  0.01148133, -0.04877945,
        0.07109345, -0.10378606, -0.02672491, -0.02010855, -0.05111463,
        0.05474715, -0.05682287, -0.07835847,  0.05293089, -0.12246755,
        0.12091076, -0.03995763,  0.03450887,  0.04877945,  0.02490865,
        0.033471  ,  0.03710352,  0.02451946,  0.05163356, -0.10741857,
        0.10586178,  0.09911568, -0.06356896, -0.07576382,  0.07991526,
       -0.0179031 , -0.07732061, -0.0003669 , -0.03891977,  0.07109345,
        0.01894096, -0.01219486,  0.01472465, -0.04203335,  0.11001322,
       -0.04436854, -0.02633571,  0.0167355 ,  0.0788774 ,  0.02957903,
       -0.02387079, -0.04229282,  0.00252979, -0.01102727,  0.03191421,
        0.00252979,  0.00593527,  0.09185066, -0.07161238,  0.0163463 ,
       -0.04229282,  0.04800105,  0.03243314,  0.04099549, -0.02672491,
       -0.04903891, -0.1421869 ,  0.0573418 ,  0.04566586,  0.033471  ,
        0.0573418 , -0.01420572,  0.05422821, -0.06408789,  0.02815197,
        0.05033624, -0.08769922,  0.02348159,  0.03840084,  0.0219248 ,
        0.2262536 , -0.05474715, -0.08510457, -0.04800105, -0.03398993,
       -0.02555732,  0.07680168,  0.03476833,  0.02620598,  0.02880063,
        0.07213131, -0.01024887, -0.11468359, -0.0394387 , -0.07783955,
        0.00609743, -0.08769922, -0.055785  ,  0.00170274,  0.07161238,
       -0.00210815, -0.05500661, -0.07420703,  0.05552554, -0.00668123,
       -0.12402434, -0.01933015, -0.04981731,  0.09703996,  0.01440032,
        0.06797986, -0.01316786,  0.0059677 ,  0.08718029,  0.05682287], dtype=float32)



In [46]:

    
model.vector_size









    Out[46]:





300



In [47]:

    
import numpy as np
model.similar_by_vector( (model.word_vec('Goofy') + model.word_vec('Minni'))/2 )









    Out[47]:





[('Goofy', 0.796820342540741),
 ('Minni', 0.7049012184143066),
 ('Mickey_Minnie_Goofy', 0.5468583703041077),
 ('Mickey_Goofy', 0.5395780205726624),
 ('Pluto_Goofy', 0.5347572565078735),
 ('Mickey_Minnie', 0.5343326330184937),
 ('Daisy_Duck', 0.5236194729804993),
 ('Sora_Donald', 0.5230178236961365),
 ('Mickey_Mouse_Goofy', 0.5048299431800842),
 ('nephews_Huey_Dewey', 0.5034050345420837)]



In [48]:

    
import pyemd
# This method only works if `pyemd` is installed (can be installed via pip, but requires a C compiler).

sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

# Remove their stopwords.
import nltk
stopwords = nltk.corpus.stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

# Compute WMD.
distance = model.wmdistance(sentence_obama, sentence_president)
print(distance)









    



3.3741233214730024



In [49]:

    
import nltk
stopwords = nltk.corpus.stopwords.words('english')

def sentence_distance(s1, s2):
    sentence_obama = [w for w in s1.split() if w not in stopwords]
    sentence_president = [w for w in s2.split() if w not in stopwords]
    print(sentence_obama, sentence_president, sep='\t')
    print(model.wmdistance(sentence_obama, sentence_president), end='\n\n')



In [50]:

    
sentence_distance('I run every day in the morning', 'I like football')
sentence_distance('I run every day in the morning', 'I run since I was born')
sentence_distance('I run every day in the morning', 'you are idiot')
sentence_distance('I run every day in the morning', 'Are you idiot?')
sentence_distance('I run every day in the morning', 'Is it possible to die?')
sentence_distance('I run every day in the morning', 'Is it possible to die')
sentence_distance('I run every day in the morning', 'I run every day')
sentence_distance('I run every day in the morning', 'I eat every day')
sentence_distance('I run every day in the morning', 'I have breakfast in the morning')
sentence_distance('I run every day in the morning', 'I have breakfast every day in the morning')
sentence_distance('I run every day in the morning', 'Each day I run')
sentence_distance('I run every day in the morning', 'I run every day in the morning')









    



['I', 'run', 'every', 'day', 'morning']	['I', 'like', 'football']
2.3376889762187165

['I', 'run', 'every', 'day', 'morning']	['I', 'run', 'since', 'I', 'born']
1.820895138922882

['I', 'run', 'every', 'day', 'morning']	['idiot']
3.3750919594666007

['I', 'run', 'every', 'day', 'morning']	['Are', 'idiot?']
3.976704031329918

['I', 'run', 'every', 'day', 'morning']	['Is', 'possible', 'die?']
3.1081644990045545

['I', 'run', 'every', 'day', 'morning']	['Is', 'possible', 'die']
3.1858849239788394

['I', 'run', 'every', 'day', 'morning']	['I', 'run', 'every', 'day']
0.5563409008898735

['I', 'run', 'every', 'day', 'morning']	['I', 'eat', 'every', 'day']
1.2005782773353697

['I', 'run', 'every', 'day', 'morning']	['I', 'breakfast', 'morning']
1.711929159530717

['I', 'run', 'every', 'day', 'morning']	['I', 'breakfast', 'every', 'day', 'morning']
0.6631782969713211

['I', 'run', 'every', 'day', 'morning']	['Each', 'day', 'I', 'run']
1.1626442679190636

['I', 'run', 'every', 'day', 'morning']	['I', 'run', 'every', 'day', 'morning']
0.0



In [51]:

    
sentence_distance('I run every day in the morning', 'Each day I run')
sentence_distance('I run every day in the morning', 'Each I run')
sentence_distance('I run every day in the morning', 'Each day run')
sentence_distance('I run every day in the morning', 'Each day I')
sentence_distance('I every day in the morning', 'Each day I run')
sentence_distance('I run day in the morning', 'Each day I run')
sentence_distance('I run every in morning', 'Each day I run')
sentence_distance('I run every in', 'Each day I run')









    



['I', 'run', 'every', 'day', 'morning']	['Each', 'day', 'I', 'run']
1.1626442679190636

['I', 'run', 'every', 'day', 'morning']	['Each', 'I', 'run']
1.804157856150221

['I', 'run', 'every', 'day', 'morning']	['Each', 'day', 'run']
1.7262934209024152

['I', 'run', 'every', 'day', 'morning']	['Each', 'day', 'I']
1.716070718874115

['I', 'every', 'day', 'morning']	['Each', 'day', 'I', 'run']
1.4545022893238069

['I', 'run', 'day', 'morning']	['Each', 'day', 'I', 'run']
1.0145831108093262

['I', 'run', 'every', 'morning']	['Each', 'day', 'I', 'run']
1.1818685887026787

['I', 'run', 'every']	['Each', 'day', 'I', 'run']
1.3422707755860321



In [52]:

    
def get_vect(w):
    try:
        return model.word_vec(w)
    except KeyError:
        return np.zeros(model.vector_size)
    
def calc_avg(s):
    ws = [get_vect(w) for w in s.split() if w not in stopwords]
    avg_vect = sum(ws)/len(ws)
    return avg_vect


from scipy.spatial import distance
def get_euclidean(s1, s2):
    return distance.euclidean(calc_avg(s1), calc_avg(s2))



In [53]:

    
# same questions
s1 = 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?'
s2 = "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"
sentence_distance(s1, s2)
print(get_euclidean(s1, s2))









    



['Astrology:', 'I', 'Capricorn', 'Sun', 'Cap', 'moon', 'cap', 'rising...what', 'say', 'me?']	["I'm", 'triple', 'Capricorn', '(Sun,', 'Moon', 'ascendant', 'Capricorn)', 'What', 'say', 'me?']
2.49434997539555

0.8109230127174104



In [54]:

    
# same questions as above without punctuations
s1 = 'Astrology I am a Capricorn Sun Cap moon and cap rising what does that say about me'
s2 = "I am a triple Capricorn Sun Moon and ascendant in Capricorn What does this say about me"
sentence_distance(s1, s2)
print(get_euclidean(s1, s2))









    



['Astrology', 'I', 'Capricorn', 'Sun', 'Cap', 'moon', 'cap', 'rising', 'say']	['I', 'triple', 'Capricorn', 'Sun', 'Moon', 'ascendant', 'Capricorn', 'What', 'say']
2.0696045677228887

0.9102963209152222



In [55]:

    
# same questions
s1 = 'What is best way to make money online'
s2 = 'What is best way to ask for money online?'
sentence_distance(s1,s2)
print(get_euclidean(s1, s2))









    



['What', 'best', 'way', 'make', 'money', 'online']	['What', 'best', 'way', 'ask', 'money', 'online?']
0.9525941046914722

0.5846541179497374



In [56]:

    
# different questions
s1 = 'How did Darth Vader fought Darth Maul in Star Wars Legends?'
s2 = 'Does Quora have a character limit for profile descriptions?'
sentence_distance(s1,s2)
print(get_euclidean(s1, s2))









    



['How', 'Darth', 'Vader', 'fought', 'Darth', 'Maul', 'Star', 'Wars', 'Legends?']	['Does', 'Quora', 'character', 'limit', 'profile', 'descriptions?']
4.066483587027646

1.7708145158159805



In [57]:

    
# the order of the words doesn't change the distanace bewteeen the two phrases
s1ws = [w for w in s1.split() if w not in stopwords]
s2ws = [w for w in s2.split() if w not in stopwords]
print(model.wmdistance(s1ws, s2ws) )
print(model.wmdistance(s1ws[::-1], s2ws) )
print(model.wmdistance(s1ws, s2ws[::-1]) )
print(model.wmdistance(s1ws[3:]+s1ws[0:3], s2ws[::-1]) )









    



4.066483587027646
4.066483587027646
4.066483587027646
4.066483587027646

conclusion:

distance work well
the order of the words is not taken into account

Translate using google translate

https://github.com/ssut/py-googletrans
should be free and unlimted, interned connection required
pip install googletrans



In [60]:

    
from googletrans import Translator



In [61]:

    
o = open("../../AliceNelPaeseDelleMeraviglie.txt")
all = ''
for l in o: all += l



In [62]:

    
translator = Translator()



In [63]:

    
for i in range(42, 43, 1):
    print(all[i * 1000:i * 1000 + 1000], end='\n\n')
    print(translator.translate(all[i * 1000:i * 1000 + 1000], dest='en').text)









    



 ranuncolo e facendosi vento con
una delle sue foglie. - Oh, avrei voluto insegnargli dei giuochi se... se fossi stata d'una statura adatta!
Poveretta me! avevo dimenticato che avevo bisogno di crescere ancora! Vediamo, come debbo fare?
Forse dovrei mangiare o bere qualche cosa; ma che cosa?
Il problema era questo: che cosa? Alice guardò intorno fra i fiori e i fili d'erba; ma non poté veder
nulla che le sembrasse adatto a mangiare o a bere per l'occasione. C'era però un grosso fungo vicino
a lei, press'a poco alto quanto lei; e dopo che l'ebbe esaminato di sotto, ai lati e di dietro, le parve
cosa naturale di vedere che ci fosse di sopra. 
Alzandosi in punta dei piedi, si affacciò all'orlo del fungo, e gli occhi suoi s'incontrarono con quelli
d'un grosso Bruco turchino che se ne stava seduto nel centro con le braccia conserte, fumando
tranquillamente una lunga pipa, e non facendo la minima attenzione ne a lei, né ad altro.




CONSIGLI DEL BRUCO
Il Bruco e Alice si guardarono a vicend

ranch and getting wind with
one of her leaves. "Oh, I wanted to teach him some jokes if ... if I had been of a suitable stature!
Poverty me! I had forgotten that I needed to grow again! Let's see how do I do it?
Maybe I should eat or drink something; but what?
The problem was this: what? Alice looked around in the flowers and grass roots; but could not see it
nothing that seems fit to eat or drink for the occasion. There was, however, a big fungus near
to her, she pressed as high as her; and after looking at it below, at the sides and behind, it seemed to her
natural thing to see that it was above.
Standing to the tip of his feet, he sprang to the brim of the fungus, and his eyes met with those
a big turquoise Bruco sitting in the center with his arms folded, smoking
quietly a long pipe, and not paying the least attention to her, or to anything else.




BRUCO ADVICE
Bruco and Alice looked at each other



In [64]:

    
## if language is not passed it is guessed, so it can detect a language
frase = "Ciao Giulia, ti va un gelato?"
det = translator.detect(frase)
print("Languge:", det.lang, " with confidence:", det.confidence)









    



Languge: it  with confidence: 0.21400356



In [65]:

    
# command line usage, but it seems to don't work to me
!translate "veritas lux mea" -s la -d en









    



[la] veritas lux mea
    ->
[en] The truth is my light
[pron.] The truth is my light



In [66]:

    
translations = translator.translate(
    ['The quick brown fox', 'jumps over', 'the lazy dog'], dest='ko')
for translation in translations:
    print(translation.origin, ' -> ', translation.text)









    



The quick brown fox  ->  빠른 갈색 여우
jumps over  ->  점프하다
the lazy dog  ->  게으른 개



In [67]:

    
phrase = translator.translate(frase, 'en')
phrase.origin, phrase.text, phrase.src, phrase.pronunciation, phrase.dest









    Out[67]:





('Ciao Giulia, ti va un gelato?',
 'Hi Giulia, do you go ice cream?',
 'it',
 'Hi Giulia, do you go ice cream?',
 'en')

TreeTagger usage to tag an italian (or other languages) sentence

How To install:

nltk need to be already installed and working
follow the instruction from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
run TreeTagger on terminal (echo 'Ciao Giulia come stai?' | tree-tagger-italian) to see if everything is working
download the github to get the python support from: https://github.com/miotto/treetagger-python
run /home/ale/anaconda3/bin/python setup.py install and everything should work (note that you need to specify which python you want, the default is python2)

Infos:

The maximum character limit on a single text is 15k.
this API does not guarantee that the library would work properly at all times
for a more stability API use the non-free https://cloud.google.com/translate/docs/
If you get HTTP 5xx error or errors like #6, it's probably because Google has banned your client IP address



In [70]:

    
from treetagger import TreeTagger
tt = TreeTagger(language='english')
tt.tag('What is the airspeed of an unladen swallow?')









    Out[70]:





[['What', 'WP', 'what'],
 ['is', 'VBZ', 'be'],
 ['the', 'DT', 'the'],
 ['airspeed', 'NN', 'airspeed'],
 ['of', 'IN', 'of'],
 ['an', 'DT', 'an'],
 ['unladen', 'JJ', '<unknown>'],
 ['swallow', 'NN', 'swallow'],
 ['?', 'SENT', '?']]



In [71]:

    
tt = TreeTagger(language='italian')
tt.tag('Proviamo a vedere un pò se funziona bene questo tagger')









    Out[71]:





[['Proviamo', 'VER:pres', 'provare'],
 ['a', 'PRE', 'a'],
 ['vedere', 'VER:infi', 'vedere'],
 ['un', 'DET:indef', 'un'],
 ['pò', 'ADV', 'pò'],
 ['se', 'PRO:refl', 'se'],
 ['funziona', 'VER:pres', 'funzionare'],
 ['bene', 'ADV', 'bene'],
 ['questo', 'PRO:demo', 'questo'],
 ['tagger', 'VER:infi', '<unknown>']]



In [ ]:



In [ ]:



In [ ]: