In [1]:
from nltk.corpus import gutenberg

In [2]:
fileids = gutenberg.fileids()
print len(fileids), "files"
print fileids


18 files
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

In [3]:
alice_raw = gutenberg.raw(fileids=['carroll-alice.txt'])

In [4]:
print 'type: ', type(alice_raw)


type:  <type 'unicode'>

In [5]:
print alice_raw[:250]


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister 

In [6]:
from nltk.corpus import genesis

In [7]:
fileids = genesis.fileids()
print len(fileids), "files"
print fileids


8 files
[u'english-kjv.txt', u'english-web.txt', u'finnish.txt', u'french.txt', u'german.txt', u'lolcat.txt', u'portuguese.txt', u'swedish.txt']

In [8]:
for fileid in fileids:
    print genesis.raw(fileids=[fileid])[:100] + "\n"


In the beginning God created the heaven and the earth.
And the earth was without form, and void; and

In the beginning God created the heavens and the earth.
Now the earth was formless and empty.  Darkn

Alussa Jumala loi taivaan ja maan. 
Maa oli
autio ja tyhjä, pimeys peitti syvyydet, ja Jumalan
henki

Au commencement, Dieu créa les cieux et la terre.
La terre était informe et vide: il y avait des tén

Am Anfang schuf Gott Himmel und Erde.
Und die Erde war wüst und leer, und es war finster auf der Tie

Oh hai. In teh beginnin Ceiling Cat maded teh skiez An da Urfs, but he did not eated dem.
Da Urfs no

No princípio, criou Deus os céus e a terra.
E a terra era sem forma e vazia; e {havia} trevas sobre 

I begynnelsen skapade Gud himmel och jord.
Och jorden var öde och tom, och mörker var över djupet, o



In [9]:
import nltk
text = nltk.bigrams('Hello')

In [10]:
for b in text:
    print b


('H', 'e')
('e', 'l')
('l', 'l')
('l', 'o')

In [11]:
words = nltk.bigrams(['This', 'is', 'gonna', 'be', 'great!'])

In [12]:
for b in words:
    print b


('This', 'is')
('is', 'gonna')
('gonna', 'be')
('be', 'great!')


In [13]:
from langdetect import detect

In [14]:
print detect("War doesn't show who's right, just who's left.")
print detect("Ein, zwei, drei, vier")


en
de

In [15]:
import unicodecsv

In [17]:
!cat data/7languages.txt


천천히 말씀해 주세요
Können Sie bitte langsamer sprechen
麻煩你講慢一點
تكلم ببطء من فضلك
Por favor hable más despacio
ゆっくり話してください

In [34]:
with open('data/7languages.txt', 'rb') as input_file:
    row_reader = unicodecsv.reader(input_file)
    result = []
    for row in row_reader:
        lang = detect(row[0])
        result += [lang.encode('ascii', 'ignore')]
        print row[0], "|", lang
        
    truth = ['ko', 'de', 'zh', 'ar', 'es', 'ja']
    print "\n", truth
    print result
    
    print "\n", nltk.ConfusionMatrix(truth, result)


천천히 말씀해 주세요 | ko
Können Sie bitte langsamer sprechen | de
麻煩你講慢一點 | ko
تكلم ببطء من فضلك | ar
Por favor hable más despacio | es
ゆっくり話してください | ja

['ko', 'de', 'zh', 'ar', 'es', 'ja']
['ko', 'de', 'ko', 'ar', 'es', 'ja']

   | a d e j k z |
   | r e s a o h |
---+-------------+
ar |<1>. . . . . |
de | .<1>. . . . |
es | . .<1>. . . |
ja | . . .<1>. . |
ko | . . . .<1>. |
zh | . . . . 1<.>|
---+-------------+
(row = reference; col = test)


In [ ]:
with open('data/7languages.txt', 'rb') as input_file:
    row_reader = unicodecsv.reader(input_file)
    result = []
    for row in row_reader:
        lang = detect(row[0])
        result += [lang.encode('ascii', 'ignore')]
        print row[0], "|", lang
        
    truth = ['ko', 'de', 'zh', 'ar', 'es', 'ja']
    print "\n", truth
    print result
    
    print "\n", nltk.ConfusionMatrix(truth, result)m

In [37]:
result = []
for fileid in fileids:
    lang = detect(genesis.raw(fileids=[fileid])[:100])
    result += [lang.encode('ascii', 'ignore')]
    print genesis.raw(fileids=[fileid])[:100], "|", lang, "\n"


In the beginning God created the heaven and the earth.
And the earth was without form, and void; and | en 

In the beginning God created the heavens and the earth.
Now the earth was formless and empty.  Darkn | en 

Alussa Jumala loi taivaan ja maan. 
Maa oli
autio ja tyhjä, pimeys peitti syvyydet, ja Jumalan
henki | fi 

Au commencement, Dieu créa les cieux et la terre.
La terre était informe et vide: il y avait des tén | fr 

Am Anfang schuf Gott Himmel und Erde.
Und die Erde war wüst und leer, und es war finster auf der Tie | de 

Oh hai. In teh beginnin Ceiling Cat maded teh skiez An da Urfs, but he did not eated dem.
Da Urfs no | en 

No princípio, criou Deus os céus e a terra.
E a terra era sem forma e vazia; e {havia} trevas sobre  | pt 

I begynnelsen skapade Gud himmel och jord.
Och jorden var öde och tom, och mörker var över djupet, o | sv 



In [38]:
some_text = "This is some #@*!$ text! This can't be right!"
print nltk.word_tokenize(some_text)
print nltk.wordpunct_tokenize(some_text)


['This', 'is', 'some', '#', '@', '*', '!', '$', 'text', '!', 'This', 'ca', "n't", 'be', 'right', '!']
['This', 'is', 'some', '#@*!$', 'text', '!', 'This', 'can', "'", 't', 'be', 'right', '!']

In [41]:
with open('data/7languages.txt', 'rb') as input_file:
    row_reader = unicodecsv.reader(input_file)
    for row in row_reader:
        tokens = nltk.word_tokenize(row[0])
        for t in tokens:
            print t, "|||"
        print


천천히 |||
말씀해 |||
주세요 |||

Können |||
Sie |||
bitte |||
langsamer |||
sprechen |||

麻煩你講慢一點 |||

تكلم |||
ببطء |||
من |||
فضلك |||

Por |||
favor |||
hable |||
más |||
despacio |||

ゆっくり話してください |||


In [44]:
from rosette.api import API, RosetteParameters

In [49]:
api = API(service_url="https://api.rosette.com/rest/v1", user_key="40fe14de7872ebf3b8c5e11c17fb7a5f")
params = RosetteParameters()
op = api.morphology()

In [51]:
with open('data/7languages.txt', 'rb') as input_file:
    row_reader = unicodecsv.reader(input_file)
    for row in row_reader:
        params["content"] = row[0]
        result = op.operate(params)
        tokens = result['lemmas']
        for t in tokens:
            print t['text'], "|||",
        print


천천히 ||| 말씀해 ||| 주세요 |||
Können ||| Sie ||| bitte ||| langsamer ||| sprechen |||
麻煩 ||| 你 ||| 講 ||| 慢 ||| 一點 |||
تكلم ||| ببطء ||| من ||| فضلك |||
Por ||| favor ||| hable ||| más ||| despacio |||
ゆっくり ||| 話し ||| て ||| ください |||