In [3]:
from nltk.corpus import gutenberg

In [8]:
fileids = gutenberg.fileids()
print len(fileids), "files"
print fileids


18 files
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

In [10]:
alice_raw = gutenberg.raw(fileids=['carroll-alice.txt'])

In [11]:
print 'type: ', type(alice_raw)


type:  <type 'unicode'>

In [13]:
print alice_raw[:250]


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister 

In [14]:
from nltk.corpus import genesis

In [16]:
fileids = genesis.fileids()
print len(fileids), "files"
print fileids


8 files
[u'english-kjv.txt', u'english-web.txt', u'finnish.txt', u'french.txt', u'german.txt', u'lolcat.txt', u'portuguese.txt', u'swedish.txt']

In [23]:
for fileid in fileids:
    print genesis.raw(fileids=[fileid])[:100] + "\n"


In the beginning God created the heaven and the earth.
And the earth was without form, and void; and

In the beginning God created the heavens and the earth.
Now the earth was formless and empty.  Darkn

Alussa Jumala loi taivaan ja maan. 
Maa oli
autio ja tyhjä, pimeys peitti syvyydet, ja Jumalan
henki

Au commencement, Dieu créa les cieux et la terre.
La terre était informe et vide: il y avait des tén

Am Anfang schuf Gott Himmel und Erde.
Und die Erde war wüst und leer, und es war finster auf der Tie

Oh hai. In teh beginnin Ceiling Cat maded teh skiez An da Urfs, but he did not eated dem.
Da Urfs no

No princípio, criou Deus os céus e a terra.
E a terra era sem forma e vazia; e {havia} trevas sobre 

I begynnelsen skapade Gud himmel och jord.
Och jorden var öde och tom, och mörker var över djupet, o



In [26]:
import nltk
text = nltk.bigrams('Hello')

In [27]:
for b in text:
    print b


('H', 'e')
('e', 'l')
('l', 'l')
('l', 'o')

In [28]:
words = nltk.bigrams(['This', 'is', 'gonna', 'be', 'great!'])

In [29]:
for b in words:
    print b


('This', 'is')
('is', 'gonna')
('gonna', 'be')
('be', 'great!')


In [30]:
from langdetect import detect

In [32]:
print detect("War doesn't show who's right, just who's left.")
print detect("Ein, zwei, drei, vier")


en
de

In [34]:
import unicodecsv

In [ ]:
!cat data/language