Importing our wordlists

Here we import all of our wordlists and add them to an array which me can merge at the end.

This wordlists should not be filtered at this point. However they should all contain the same columns to make merging easier for later.



In [2]:

    
wordlists = []

Dictcc

Download the dictionary from http://www.dict.cc/?s=about%3Awordlist

Print out the first 20 lines of the dictionary



In [3]:

    
!head -n 20 de-en.txt









    



# DE-EN vocabulary database	compiled by dict.cc
# Date and time	2016-08-29 23:46
# License	THIS WORK IS PROTECTED BY INTERNATIONAL COPYRIGHT LAWS!
# License	Private use is allowed as long as the data, or parts of it, are not published or given away.
# License	By using this file, you agree to be bound to the Terms of Use published at the following URL:  
# License	http://www.dict.cc/translation_file_request.php
# Contains data from	http://dict.tu-chemnitz.de/ with friendly permission by Frank Richter, TU Chemnitz 
# Brought to you by	Paul Hemetsberger and the users of http://www.dict.cc/, 2002 - 2016

&#945;-Keratin {n}	&#945;-keratin	noun
&#945;-Lactalbumin {n} <&#945;-La>	&#945;-lactalbumin <&#945;-La>	noun
&#946;-Mercaptoethanol {n}	&#946;-mercaptoethanol	noun
&#963;-Algebra {f}	&#963;-field	noun
&#963;-Algebra {f}	sigma algebra	noun
& Co.	and company <& Co.>	
'Die' heißt mein Unterrock, und 'der' hängt im Schrank. [regional] [Satz, mit dem Kinder gerügt werden, die von einer (anwesenden) Frau mit 'die' sprechen]	'She' is the cat's mother. [used to encourage children to use names instead of pronouns to refer to females to whom they should show respect]	
'n Abend allerseits! [ugs.]	Evening all! [coll.]	
'nauf [regional] [hinauf]	up	adv
'Nduja {f} [auch: Nduja]	'nduja [also: nduja]	noun
'ne Macke haben [ugs.]	to be off one's head [coll.]	verb

Use pandas library to import csv file



In [4]:

    
import pandas as pd


dictcc_df = pd.read_csv("de-en.txt", 
                        sep='\t',
                        skiprows=8,
                        header=None, 
                        names=["GermanWord","Word","WordType"])

Preview a few entries of the wordlist



In [5]:

    
dictcc_df[90:100]









    Out[5]:






  
    
      
      GermanWord
      Word
      WordType
    
  
  
    
      90
      (aktiv) Werbung machen für
      to tout
      verb
    
    
      91
      (aktive) Langzeitverbindung {f} [Standverbindu...
      nailed-up connection <NUC>
      noun
    
    
      92
      (aktuelles) Zeitgeschehen {n}
      current events {pl}
      noun
    
    
      93
      (akustisch) verstehen
      to hear
      verb
    
    
      94
      (akustische) Haarzelle {f}
      auditory cell
      noun
    
    
      95
      (akustischer) Dissipationsgrad {m}
      (acoustic) dissipation factor
      noun
    
    
      96
      (akute) Rückenmuskelnekrose {f}
      (acute) back muscle necrosis
      noun
    
    
      97
      (akuter) Hörsturz {m}
      acute hearing loss
      noun
    
    
      98
      (akuter) Myokardinfarkt {m} <AMI / MI>
      (acute) myocardial infarction <AMI / MI>
      noun
    
    
      99
      (akutes) Lungenversagen {n}
      acute respiratory distress syndrome <ARDS>
      noun

We only need "Word" and "WordType" column



In [6]:

    
dictcc_df = dictcc_df[["Word", "WordType"]][:].copy()

Convert WordType Column to a pandas.Categorical



In [7]:

    
word_types = dictcc_df["WordType"].astype('category')
dictcc_df["WordType"] = word_types
# show data types of each column in the dataframe
dictcc_df.dtypes









    Out[7]:





Word          object
WordType    category
dtype: object

List the current distribution of word types in dictcc dataframe



In [8]:

    
# nltk TaggedCorpusParses requires uppercase WordType
dictcc_df["WordType"] = dictcc_df["WordType"].str.upper()
dictcc_df["WordType"].value_counts().head()









    Out[8]:





NOUN          759619
VERB          126806
ADJ            94507
ADV            26277
ADJ PAST-P     12519
Name: WordType, dtype: int64

Add dictcc corpus to our wordlists array



In [9]:

    
wordlists.append(dictcc_df)

Moby

Download the corpus from http://icon.shef.ac.uk/Moby/mpos.html

Perform some basic cleanup on the wordlist



In [10]:

    
# the readme file in `nltk/corpora/moby/mpos` gives some information on how to parse the file

result = []
# replace all DOS line endings '\r' with newlines then change encoding to UTF8
moby_words = !cat nltk/corpora/moby/mpos/mobyposi.i | iconv --from-code=ISO88591 --to-code=UTF8 | tr -s '\r' '\n' | tr -s '×' '/'
result.extend(moby_words)
moby_df = pd.DataFrame(data = result, columns = ['Word'])



In [288]:

    
moby_df.tail(10)









    Out[288]:






  
    
      
      Word
      WordType
    
  
  
    
      233216
      zoomorphic
      ADJ
    
    
      233220
      zoonal
      ADJ
    
    
      233223
      zoophagous
      ADJ
    
    
      233227
      zoophilous
      ADJ
    
    
      233229
      zoophobous
      ADJ
    
    
      233230
      zoophoric
      ADJ
    
    
      233235
      zooplastic
      ADJ
    
    
      233333
      zygomorphic
      ADJ
    
    
      233334
      zygophyllaceous
      ADJ
    
    
      233342
      zymogenic
      ADJ

sort out the nouns, verbs and adjectives



In [12]:

    
# Matches nouns
nouns = moby_df[moby_df["Word"].str.contains('/[Np]$')].copy()
nouns["WordType"] = "NOUN"
# Matches verbs
verbs = moby_df[moby_df["Word"].str.contains('/[Vti]$')].copy()
verbs["WordType"] = "VERB"
# Magtches adjectives
adjectives = moby_df[moby_df["Word"].str.contains('/A$')].copy()
adjectives["WordType"] = "ADJ"

remove the trailing stuff and concatenate the nouns, verbs and adjectives



In [13]:

    
nouns["Word"] = nouns["Word"].str.replace(r'/N$','')
verbs["Word"] = verbs["Word"].str.replace(r'/[Vti]$','')
adjectives["Word"] = adjectives["Word"].str.replace(r'/A$','')
# Merge nouns, verbs and adjectives into one dataframe
moby_df = pd.concat([nouns,verbs,adjectives])

Add moby corpus to wordlists array



In [284]:

    
wordlists.append(moby_df)

Combine all wordlists



In [14]:

    
wordlist = pd.concat(wordlists)

Filter for results that we want

We want to remove words that aren't associated with a type (null WordType)



In [15]:

    
wordlist_filtered = wordlist[wordlist["WordType"].notnull()]

We want to remove words that contain non word characters (whitespace, hypens, etc.)



In [16]:

    
# we choose [a-z] here and not [A-Za-z] because we do _not_
# want to match words starting with uppercase characters.
# ^to matches verbs in the infinitive from `dictcc`
word_chars = r'^[a-z]+$|^to\s'
is_word_chars = wordlist_filtered["Word"].str.contains(word_chars, na=False)
wordlist_filtered = wordlist_filtered[is_word_chars]
wordlist_filtered.describe()
wordlist_filtered["WordType"].value_counts()









    Out[16]:





NOUN                  132318
VERB                  126665
ADJ                    50659
ADV                    12748
ADJ PAST-P              9327
ADJ PRES-P              4223
PAST-P                  1291
ADJ ADV                  620
PREP                     252
PRON                     222
PRES-P                   173
CONJ                     124
PAST-P ADJ                33
PRES-P ADJ                26
ADV PREP                  20
ADJ PRON                  16
PREFIX                    10
ADJ ARCHAIC:ADV           10
ADV CONJ                   9
PREP CONJ                  5
ADJ.                       4
ADV ADJ                    4
ADV PAST-P                 3
[NONE]                     2
ADJ ARCHAIC:PAST-P         2
ADV DATED:ADJ              2
ADV PREP CONJ              2
ADJ OBS:PAST-P             1
ADJ RARE:ADV               1
PRES-P ARCHAIC:ADJ         1
ADJ RARE:PAST-P            1
ADJ ADV NOUN               1
ADJ ADV PREP CONJ          1
ADV.                       1
ADJ PRED                   1
AD JPAST-P                 1
ADV PRON                   1
ADJ COLL:ADV               1
ADV ARCHAIC:ADJ            1
PREP ADV                   1
PRES-P RARE:ADJ            1
ADJ PREP                   1
Name: WordType, dtype: int64

We want results that are less than 'x' letters long (x+3 for verbs since they are in their infinitive form in the dictcc wordlist)



In [17]:

    
lt_x_letters = (wordlist_filtered["Word"].str.len() < 9) |\
               ((wordlist_filtered["Word"].str.contains('^to\s\w+\s')) &\
                (wordlist_filtered["Word"].str.len() < 11)\
               )
wordlist_filtered = wordlist_filtered[lt_x_letters]
wordlist_filtered.describe()

We want to remove all duplicates



In [18]:

    
wordlist_filtered = wordlist_filtered.drop_duplicates("Word")
wordlist_filtered.describe()
wordlist_filtered["WordType"].value_counts()









    Out[18]:





NOUN                  24671
ADJ                    6901
VERB                   2663
ADJ PAST-P             2130
ADV                    1250
ADJ PRES-P              705
PAST-P                  622
ADJ ADV                 132
PRON                     45
PREP                     43
PRES-P                   34
CONJ                     23
PREFIX                    8
PAST-P ADJ                8
ADJ PRON                  5
PRES-P ADJ                4
ADJ ARCHAIC:ADV           2
ADV CONJ                  2
ADJ OBS:PAST-P            1
ADV DATED:ADJ             1
ADV PREP                  1
ADV PRON                  1
ADJ ARCHAIC:PAST-P        1
PRES-P ARCHAIC:ADJ        1
ADV PREP CONJ             1
[NONE]                    1
ADV ADJ                   1
Name: WordType, dtype: int64

Load our wordlists into nltk



In [20]:

    
# The TaggedCorpusReader likes to use the forward slash character '/'
# as seperator between the word and part-of-speech tag (WordType).
wordlist_filtered.to_csv("dictcc_moby.csv",index=False,sep="/",header=None)



In [21]:

    
from nltk.corpus import TaggedCorpusReader
from nltk.tokenize import WhitespaceTokenizer
nltk_wordlist = TaggedCorpusReader("./", "dictcc_moby.csv")

NLTK

Use NLTK to help us merge our wordlists



In [178]:

    
# Our custom wordlist
import nltk
custom_cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in nltk_wordlist.tagged_words() if len(word) < 9 and word.isalpha)



In [179]:

    
# Brown Corpus
import nltk
brown_cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in nltk.corpus.brown.tagged_words() if word.isalpha() and len(word) < 9)



In [196]:

    
# Merge Nouns from all wordlists
nouns = set(brown_cfd["NN"]) | set(brown_cfd["NP"]) | set(custom_cfd["NOUN"])
# Lowercase all words to remove duplicates
nouns = set([noun.lower() for noun in nouns])
print("Total nouns count: " + str(len(nouns)))









    



Total nouns count: 31178



In [195]:

    
# Merge Verbs from all wordlists
verbs = set(brown_cfd["VB"]) | set(brown_cfd["VBD"]) | set(custom_cfd["VERB"])
# Lowercase all words to remove duplicates
verbs = set([verb.lower() for verb in verbs])
print("Total verbs count: " + str(len(verbs)))









    



Total verbs count: 4991



In [197]:

    
# Merge Adjectives from all wordlists
adjectives = set(brown_cfd["JJ"]) | set(custom_cfd["ADJ"])
# Lowercase all words to remove duplicates
adjectives = set([adjective.lower() for adjective in adjectives])
print("Total adjectives count: " + str(len(adjectives)))









    



Total adjectives count: 10541

Make Some Placewords Magic Happen



In [266]:

    
def populate_degrees(nouns):
    degrees = {}
    nouns_copy = nouns.copy()
    for latitude in range(60):
        for longtitude in range(190):
            degrees[(latitude,longtitude)] = nouns_copy.pop()
    return degrees



In [267]:

    
def populate_minutes(verbs):
    minutes = {}
    verbs_copy = verbs.copy()
    for latitude in range(60):
        for longtitude in range(60):
            minutes[(latitude,longtitude)] = verbs_copy.pop()
    return minutes



In [268]:

    
def populate_seconds(adjectives):
    seconds = {}
    adjectives_copy = adjectives.copy()
    for latitude in range(60):
        for longtitude in range(60):
            seconds[(latitude,longtitude)] = adjectives_copy.pop()
    return seconds



In [269]:

    
def populate_fractions(nouns):
    fractions = {}
    nouns_copy = nouns.copy()
    for latitude in range(10):
        for longtitude in range(10):
            fractions[(latitude,longtitude)] = nouns_copy.pop()
    return fractions



In [271]:

    
def placewords(degrees,minutes,seconds,fractions):
    result = []
    result.append(populate_degrees(nouns).get(degrees))
    result.append(populate_minutes(verbs).get(minutes))
    result.append(populate_seconds(adjectives).get(seconds))
    result.append(populate_fractions(nouns).get(fractions))
    return "-".join(result)



In [281]:

    
# Located at 50°40'47.9" N 10°55'55.2" E
ilmenau_home = placewords((50,10),(40,55),(47,55),(9,2))
print("Feel free to stalk me at " + ilmenau_home)









    



Feel free to stalk me at canards-rallied-planked-corium

TODO (wordlist filtering)

We want to remove stopwords from wordlist

from nltk.corpus import stopwords
dif = set(wordlist_filtered['Word']) - set(stopwords.words('english'))
names = nltk.corpus.names
names.fileids()

We want to remove all names and animals
We want to remove words that are difficult to spell
- Words with uncommon vowel duplicates (examples: ["piing", "reeject"])
We want to remove homonyms that are used in different parts of speech (example: saw (as verb) and saw (as noun))
We want to remove arcane and unusual words

import nltk

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

	Word	WordType
count	108112	108112
unique	39257	39
top	boom	NOUN
freq	35	64792

	GermanWord	Word	WordType
90	(aktiv) Werbung machen für	to tout	verb
91	(aktive) Langzeitverbindung {f} [Standverbindu...	nailed-up connection <NUC>	noun
92	(aktuelles) Zeitgeschehen {n}	current events {pl}	noun
93	(akustisch) verstehen	to hear	verb
94	(akustische) Haarzelle {f}	auditory cell	noun
95	(akustischer) Dissipationsgrad {m}	(acoustic) dissipation factor	noun
96	(akute) Rückenmuskelnekrose {f}	(acute) back muscle necrosis	noun
97	(akuter) Hörsturz {m}	acute hearing loss	noun
98	(akuter) Myokardinfarkt {m} <AMI / MI>	(acute) myocardial infarction <AMI / MI>	noun
99	(akutes) Lungenversagen {n}	acute respiratory distress syndrome <ARDS>	noun

	Word	WordType
233216	zoomorphic	ADJ
233220	zoonal	ADJ
233223	zoophagous	ADJ
233227	zoophilous	ADJ
233229	zoophobous	ADJ
233230	zoophoric	ADJ
233235	zooplastic	ADJ
233333	zygomorphic	ADJ
233334	zygophyllaceous	ADJ
233342	zymogenic	ADJ