Stem Text from words


In [1]:
%matplotlib inline
from stemgraphic.alpha import stem_text
from stemgraphic.stopwords import VOWELS, EN

In [2]:
source = '../datasets/A Case of Identity by Arthur Conan Doyle.txt'

In [3]:
stem_text(source, column=VOWELS, display=250);


['a', 'e', 'i', 'o', 'u']: 
count     250
unique     75
top       and
freq       35
Name: word, dtype: object
sampled  250

e| afiilnnnvvxxy
u| nnnppppppppppsss
i| ffmnnnnnnnnnnnnnnnnnnnnnnnnnsssssssssstttttttttt
o| bbbbfffffffffffffffffffffffffffnnnnnnnnnnnrrtttuuuuuuvvvww
a|                          bbbbccdddfffllllmnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnrrrsssssssssssssssstttttttww

In [4]:
stem_text(source, caps=False, display=750, reverse=False, stop_words=EN, legend_pos=None);


z| e
j| ouuuu
q| uuuuuu
n| aeeeeo
y| eeeeeee
v| aaaeeeiiooo
k| eeeeiinnnnn
u| nnnnnnnnnnnnnsss
o| bbbbbcffhhhhilpp’
e|  aaaafllmmnnnnstxxxy
g| aaeeeelllllooooooorrrr
r| aaeeeeeeeeeeeeeeiiiiioo
i| dmmmmnnnnnnnnnnnnnnnnnst
p| aaaaaeeeilllloooooooorrrrrruuu
b| aaaeeeeeeiillllooorrrrrrrruuuuu
a| cdddfffillmnnnnnnnnnnrrrssssssssvw
l| aaaeeeeeeiiiiiiiiiiiiiiiiiioooooooooy
f| aaaaaaaaaaaaaaaaaaaeeeiiiiooooorrrrruu
d| aaaaaaeeeeeeeeeeeeiiiiiiiioooooooorrrru
t| aaaaaaeeeehhhhhhhhhhhhhhiiiiiiooooorrrrrrrrruwwyyy
w| aaaaaaaaaaeeeeeeeeehiiiiiiiiiiiiiiiiiooooooooorrrr
m| aaaaaaaaaaaaaaaaaaaaaaaeeeeeeiiiiooooorrrrrrrrrrrrryy
c| aaaaaaaaaaeehhhhhilllllllllooooooooooooooooooooooooooorrrrru
h| aaaaaaaaaaaaaaaaeeeeeeeeeeeeeiiooooooooooooooooooooooooooooouuuu
s| aaaaaaaaaaaaaaaaaaaaaaacceeeeeehhhhhhhhhhhhiiiiiiillllnoooppppptttttttttttttttttttuuuuuuuuuwww’

In [5]:
# looking at words in reverse. 'word' converts to bigram 'dr', stem 'd', leaf 'r'
stem_text(source, caps=False, display=750, reverse=True, stop_words=EN, legend_pos=None);


x| o
c| ii
p| orsuu
w| oooooo
u| ooooooooo
k| acccnnnnnooo
m| aiiiiiiiiiooooor
h| cccccccggostttttttttt
l| aaeeellllllllllllllllll
g| nnnnnnnnnnnnnnnnnnnnnnnn
f| iiiiiillllooooooooooooooooo
o| dddddgosssssstttttttttttttttw
y| aaaabddeeeefhlllllllllmmmmmmnnrrrrrrrrrr
n| aaaaaaaaaaeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiioooooooooooorw
r|  aaaaaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeimmmmoooooooooooooouuuuuuuu
d| aaaaaaaaaaaeeeeeeeeeeeeeeeeeeeeiiiiiilllllnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnr
s| aaaaaaaaaaaaaaaaaaaaaaaddddeeeeeeeeeeeeeeegiiiiiiiiiiiilnppprrrrrrrrrssttttuuyy’’’’’
t| aaaaaaaaaaaaaaaaaaaaaaacccceeeeeehhhhhiiiiiiiiiiiiiiiiiiinnnnnooooooooorssssssssssssssuuuuuuuuuxx’’
e| bbbbcccccccefggghhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhkklllllllllmmmmmmmmmpprrrrrrrrrrrrsssssssssssttttttttuuvvvvvvvvvvvvvvvvvvvvvwwwww

In [6]:
rows, hm, df = stem_text(source, break_on='m', caps=False, display=1200, random_state=120,
                         rows_only=False, sort_by='alpha', stop_words=EN);


: 
count     1200
unique     781
top       said
freq        23
Name: word, dtype: object
sampled 1200

a| bbccccccddddfffggilllllll
a| mmnnnnnnnnnnnnnnnpppprrssssssssssstuwwww
b| aaaaaaaaeeeeeeeeeeiiillllll
b| ooooooooooorrrrrrrrrrruuuuuuuu
c| aaaaaaaaaaaaaaaaaaaaaeeehhhhhhhhhhiiiiilllllllll
c| oooooooooooooooooooooooooooooooooooooooorrrrrrrrrrrrruuuu
d| aaaaaaeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiii
d| ooooooooooooorrrrrrrrrrué
e| aaaaff
e| nnnnnnnnnnnqtvvxxxxxyyyy
f| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaeeeeeiiiiiiiiiiill
f| oooooooooooooooorrrrrrrrrtu
g| aaaaaeeeeeeiiiillllllllllll
g| oooooooooorrrrrrrrr
h| aaaaaaaaaaaaaaaaaaaaaaaaaaeeeeeeei
h| oooooooooooooooooooooooooooooooooooooooouuuuuu
i| dl
i| mmmmmmmmnnnnnnnnnnnnnnrss
j| aaa
j| oouuu
k| eiiiiii
k| nnnnnnnnnn
l| aaaaaaaeeeeeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiiiiiii
l| oooooooooooooy
m| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaeeeeeeeeiiiiiiiiiiiiiii
m| oooooooooorrrrrrrrrrrrrrrrrrrrry
n| aeeeeeeei
n| ooooo
o| bbbbbbbbbbccfffhhhhhhllll
o| prsu’’
p| aaaaaaaaaeeeeeeeeehillllllll
p| ooooooooorrrrrrrrrrruuuuu
q| uuuuuuuuuuu
q| 
r|   eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeiiiiiiiii
r| ooooouuu
s| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaceeeeeeeeehhhhhhhhhhhhhiiiiiiiiiiillllllll
s| mmmmnnnnoooooppppttttttttttttuuuuuuuuuuuuuuuuuuwwwy’’
t| aaaaaaaaaaaaeeeeehhhhhhhhhhhhhhhhhiiiiiii
t| oooooooorrrrrrrrrrrrrrrruwwyyyyyy
u| nnnnnssssssss
u| 
v| aeiii
v| ooooouu
w| aaaaaaaaaaaeeeeeeeeeeeeehhhiiiiiiiiiiiiii
w| ooooooooooooooorrrrrrrr
y| eeeeeeeeee
y| o

o followed by an apostrophe (’). Irish name?


In [7]:
df[df.word.str[:2]=='o’']


Out[7]:
index word stem leaf ngram
68 8287 o’clock o o’
524 9263 o’clock o o’

Ah, o’clock. That explains it.