Importing our wordlists

Here we import all of our wordlists and add them to an array which me can merge at the end.

This wordlists should not be filtered at this point. However they should all contain the same columns to make merging easier for later.


In [2]:
wordlists = []

Dictcc

Download the dictionary from http://www.dict.cc/?s=about%3Awordlist


In [3]:
!head -n 20 de-en.txt


# DE-EN vocabulary database	compiled by dict.cc
# Date and time	2016-08-29 23:46
# License	THIS WORK IS PROTECTED BY INTERNATIONAL COPYRIGHT LAWS!
# License	Private use is allowed as long as the data, or parts of it, are not published or given away.
# License	By using this file, you agree to be bound to the Terms of Use published at the following URL:  
# License	http://www.dict.cc/translation_file_request.php
# Contains data from	http://dict.tu-chemnitz.de/ with friendly permission by Frank Richter, TU Chemnitz 
# Brought to you by	Paul Hemetsberger and the users of http://www.dict.cc/, 2002 - 2016

α-Keratin {n}	α-keratin	noun
&#945;-Lactalbumin {n} <&#945;-La>	&#945;-lactalbumin <&#945;-La>	noun
&#946;-Mercaptoethanol {n}	&#946;-mercaptoethanol	noun
&#963;-Algebra {f}	&#963;-field	noun
&#963;-Algebra {f}	sigma algebra	noun
& Co.	and company <& Co.>	
'Die' heißt mein Unterrock, und 'der' hängt im Schrank. [regional] [Satz, mit dem Kinder gerügt werden, die von einer (anwesenden) Frau mit 'die' sprechen]	'She' is the cat's mother. [used to encourage children to use names instead of pronouns to refer to females to whom they should show respect]	
'n Abend allerseits! [ugs.]	Evening all! [coll.]	
'nauf [regional] [hinauf]	up	adv
'Nduja {f} [auch: Nduja]	'nduja [also: nduja]	noun
'ne Macke haben [ugs.]	to be off one's head [coll.]	verb

Use pandas library to import csv file


In [4]:
import pandas as pd


dictcc_df = pd.read_csv("de-en.txt", 
                        sep='\t',
                        skiprows=8,
                        header=None, 
                        names=["GermanWord","Word","WordType"])

Preview a few entries of the wordlist


In [5]:
dictcc_df[90:100]


Out[5]:
GermanWord Word WordType
90 (aktiv) Werbung machen für to tout verb
91 (aktive) Langzeitverbindung {f} [Standverbindu... nailed-up connection <NUC> noun
92 (aktuelles) Zeitgeschehen {n} current events {pl} noun
93 (akustisch) verstehen to hear verb
94 (akustische) Haarzelle {f} auditory cell noun
95 (akustischer) Dissipationsgrad {m} (acoustic) dissipation factor noun
96 (akute) Rückenmuskelnekrose {f} (acute) back muscle necrosis noun
97 (akuter) Hörsturz {m} acute hearing loss noun
98 (akuter) Myokardinfarkt {m} <AMI / MI> (acute) myocardial infarction <AMI / MI> noun
99 (akutes) Lungenversagen {n} acute respiratory distress syndrome <ARDS> noun

We only need "Word" and "WordType" column


In [6]:
dictcc_df = dictcc_df[["Word", "WordType"]][:].copy()

Convert WordType Column to a pandas.Categorical


In [7]:
word_types = dictcc_df["WordType"].astype('category')
dictcc_df["WordType"] = word_types
# show data types of each column in the dataframe
dictcc_df.dtypes


Out[7]:
Word          object
WordType    category
dtype: object

List the current distribution of word types in dictcc dataframe


In [8]:
# nltk TaggedCorpusParses requires uppercase WordType
dictcc_df["WordType"] = dictcc_df["WordType"].str.upper()
dictcc_df["WordType"].value_counts().head()


Out[8]:
NOUN          759619
VERB          126806
ADJ            94507
ADV            26277
ADJ PAST-P     12519
Name: WordType, dtype: int64

Add dictcc corpus to our wordlists array


In [9]:
wordlists.append(dictcc_df)

Moby

Download the corpus from http://icon.shef.ac.uk/Moby/mpos.html

Perform some basic cleanup on the wordlist


In [10]:
# the readme file in `nltk/corpora/moby/mpos` gives some information on how to parse the file

result = []
# replace all DOS line endings '\r' with newlines then change encoding to UTF8
moby_words = !cat nltk/corpora/moby/mpos/mobyposi.i | iconv --from-code=ISO88591 --to-code=UTF8 | tr -s '\r' '\n' | tr -s '×' '/'
result.extend(moby_words)
moby_df = pd.DataFrame(data = result, columns = ['Word'])

In [288]:
moby_df.tail(10)


Out[288]:
Word WordType
233216 zoomorphic ADJ
233220 zoonal ADJ
233223 zoophagous ADJ
233227 zoophilous ADJ
233229 zoophobous ADJ
233230 zoophoric ADJ
233235 zooplastic ADJ
233333 zygomorphic ADJ
233334 zygophyllaceous ADJ
233342 zymogenic ADJ
  • sort out the nouns, verbs and adjectives

In [12]:
# Matches nouns
nouns = moby_df[moby_df["Word"].str.contains('/[Np]$')].copy()
nouns["WordType"] = "NOUN"
# Matches verbs
verbs = moby_df[moby_df["Word"].str.contains('/[Vti]$')].copy()
verbs["WordType"] = "VERB"
# Magtches adjectives
adjectives = moby_df[moby_df["Word"].str.contains('/A$')].copy()
adjectives["WordType"] = "ADJ"
  • remove the trailing stuff and concatenate the nouns, verbs and adjectives

In [13]:
nouns["Word"] = nouns["Word"].str.replace(r'/N$','')
verbs["Word"] = verbs["Word"].str.replace(r'/[Vti]$','')
adjectives["Word"] = adjectives["Word"].str.replace(r'/A$','')
# Merge nouns, verbs and adjectives into one dataframe
moby_df = pd.concat([nouns,verbs,adjectives])

Add moby corpus to wordlists array


In [284]:
wordlists.append(moby_df)

Combine all wordlists


In [14]:
wordlist = pd.concat(wordlists)

Filter for results that we want

  • We want to remove words that aren't associated with a type (null WordType)

In [15]:
wordlist_filtered = wordlist[wordlist["WordType"].notnull()]
  • We want to remove words that contain non word characters (whitespace, hypens, etc.)

In [16]:
# we choose [a-z] here and not [A-Za-z] because we do _not_
# want to match words starting with uppercase characters.
# ^to matches verbs in the infinitive from `dictcc`
word_chars = r'^[a-z]+$|^to\s'
is_word_chars = wordlist_filtered["Word"].str.contains(word_chars, na=False)
wordlist_filtered = wordlist_filtered[is_word_chars]
wordlist_filtered.describe()
wordlist_filtered["WordType"].value_counts()


Out[16]:
NOUN                  132318
VERB                  126665
ADJ                    50659
ADV                    12748
ADJ PAST-P              9327
ADJ PRES-P              4223
PAST-P                  1291
ADJ ADV                  620
PREP                     252
PRON                     222
PRES-P                   173
CONJ                     124
PAST-P ADJ                33
PRES-P ADJ                26
ADV PREP                  20
ADJ PRON                  16
PREFIX                    10
ADJ ARCHAIC:ADV           10
ADV CONJ                   9
PREP CONJ                  5
ADJ.                       4
ADV ADJ                    4
ADV PAST-P                 3
[NONE]                     2
ADJ ARCHAIC:PAST-P         2
ADV DATED:ADJ              2
ADV PREP CONJ              2
ADJ OBS:PAST-P             1
ADJ RARE:ADV               1
PRES-P ARCHAIC:ADJ         1
ADJ RARE:PAST-P            1
ADJ ADV NOUN               1
ADJ ADV PREP CONJ          1
ADV.                       1
ADJ PRED                   1
AD JPAST-P                 1
ADV PRON                   1
ADJ COLL:ADV               1
ADV ARCHAIC:ADJ            1
PREP ADV                   1
PRES-P RARE:ADJ            1
ADJ PREP                   1
Name: WordType, dtype: int64
  • We want results that are less than 'x' letters long (x+3 for verbs since they are in their infinitive form in the dictcc wordlist)

In [17]:
lt_x_letters = (wordlist_filtered["Word"].str.len() < 9) |\
               ((wordlist_filtered["Word"].str.contains('^to\s\w+\s')) &\
                (wordlist_filtered["Word"].str.len() < 11)\
               )
wordlist_filtered = wordlist_filtered[lt_x_letters]
wordlist_filtered.describe()


Out[17]:
Word WordType
count 108112 108112
unique 39257 39
top boom NOUN
freq 35 64792
  • We want to remove all duplicates

In [18]:
wordlist_filtered = wordlist_filtered.drop_duplicates("Word")
wordlist_filtered.describe()
wordlist_filtered["WordType"].value_counts()


Out[18]:
NOUN                  24671
ADJ                    6901
VERB                   2663
ADJ PAST-P             2130
ADV                    1250
ADJ PRES-P              705
PAST-P                  622
ADJ ADV                 132
PRON                     45
PREP                     43
PRES-P                   34
CONJ                     23
PREFIX                    8
PAST-P ADJ                8
ADJ PRON                  5
PRES-P ADJ                4
ADJ ARCHAIC:ADV           2
ADV CONJ                  2
ADJ OBS:PAST-P            1
ADV DATED:ADJ             1
ADV PREP                  1
ADV PRON                  1
ADJ ARCHAIC:PAST-P        1
PRES-P ARCHAIC:ADJ        1
ADV PREP CONJ             1
[NONE]                    1
ADV ADJ                   1
Name: WordType, dtype: int64

Load our wordlists into nltk


In [20]:
# The TaggedCorpusReader likes to use the forward slash character '/'
# as seperator between the word and part-of-speech tag (WordType).
wordlist_filtered.to_csv("dictcc_moby.csv",index=False,sep="/",header=None)

In [21]:
from nltk.corpus import TaggedCorpusReader
from nltk.tokenize import WhitespaceTokenizer
nltk_wordlist = TaggedCorpusReader("./", "dictcc_moby.csv")

NLTK

  • Use NLTK to help us merge our wordlists

In [178]:
# Our custom wordlist
import nltk
custom_cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in nltk_wordlist.tagged_words() if len(word) < 9 and word.isalpha)

In [179]:
# Brown Corpus
import nltk
brown_cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in nltk.corpus.brown.tagged_words() if word.isalpha() and len(word) < 9)

In [196]:
# Merge Nouns from all wordlists
nouns = set(brown_cfd["NN"]) | set(brown_cfd["NP"]) | set(custom_cfd["NOUN"])
# Lowercase all words to remove duplicates
nouns = set([noun.lower() for noun in nouns])
print("Total nouns count: " + str(len(nouns)))


Total nouns count: 31178

In [195]:
# Merge Verbs from all wordlists
verbs = set(brown_cfd["VB"]) | set(brown_cfd["VBD"]) | set(custom_cfd["VERB"])
# Lowercase all words to remove duplicates
verbs = set([verb.lower() for verb in verbs])
print("Total verbs count: " + str(len(verbs)))


Total verbs count: 4991

In [197]:
# Merge Adjectives from all wordlists
adjectives = set(brown_cfd["JJ"]) | set(custom_cfd["ADJ"])
# Lowercase all words to remove duplicates
adjectives = set([adjective.lower() for adjective in adjectives])
print("Total adjectives count: " + str(len(adjectives)))


Total adjectives count: 10541

Make Some Placewords Magic Happen


In [266]:
def populate_degrees(nouns):
    degrees = {}
    nouns_copy = nouns.copy()
    for latitude in range(60):
        for longtitude in range(190):
            degrees[(latitude,longtitude)] = nouns_copy.pop()
    return degrees

In [267]:
def populate_minutes(verbs):
    minutes = {}
    verbs_copy = verbs.copy()
    for latitude in range(60):
        for longtitude in range(60):
            minutes[(latitude,longtitude)] = verbs_copy.pop()
    return minutes

In [268]:
def populate_seconds(adjectives):
    seconds = {}
    adjectives_copy = adjectives.copy()
    for latitude in range(60):
        for longtitude in range(60):
            seconds[(latitude,longtitude)] = adjectives_copy.pop()
    return seconds

In [269]:
def populate_fractions(nouns):
    fractions = {}
    nouns_copy = nouns.copy()
    for latitude in range(10):
        for longtitude in range(10):
            fractions[(latitude,longtitude)] = nouns_copy.pop()
    return fractions

In [271]:
def placewords(degrees,minutes,seconds,fractions):
    result = []
    result.append(populate_degrees(nouns).get(degrees))
    result.append(populate_minutes(verbs).get(minutes))
    result.append(populate_seconds(adjectives).get(seconds))
    result.append(populate_fractions(nouns).get(fractions))
    return "-".join(result)

In [281]:
# Located at 50°40'47.9" N 10°55'55.2" E
ilmenau_home = placewords((50,10),(40,55),(47,55),(9,2))
print("Feel free to stalk me at " + ilmenau_home)


Feel free to stalk me at canards-rallied-planked-corium

TODO (wordlist filtering)

  • We want to remove stopwords from wordlist
from nltk.corpus import stopwords
dif = set(wordlist_filtered['Word']) - set(stopwords.words('english'))
names = nltk.corpus.names
names.fileids()
  • We want to remove all names and animals

  • We want to remove words that are difficult to spell

    • Words with uncommon vowel duplicates (examples: ["piing", "reeject"])
  • We want to remove homonyms that are used in different parts of speech (example: saw (as verb) and saw (as noun))

  • We want to remove arcane and unusual words

import nltk

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)