Stem Text from words



In [1]:

    
%matplotlib inline
from stemgraphic.alpha import stem_text
from stemgraphic.stopwords import VOWELS, EN



In [2]:

    
source = '../datasets/A Case of Identity by Arthur Conan Doyle.txt'



In [3]:

    
stem_text(source, column=VOWELS, display=250);









    



['a', 'e', 'i', 'o', 'u']: 
count     250
unique     75
top       and
freq       35
Name: word, dtype: object
sampled  250

e| afiilnnnvvxxy
u| nnnppppppppppsss
i| ffmnnnnnnnnnnnnnnnnnnnnnnnnnsssssssssstttttttttt
o| bbbbfffffffffffffffffffffffffffnnnnnnnnnnnrrtttuuuuuuvvvww
a|                          bbbbccdddfffllllmnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnrrrsssssssssssssssstttttttww



In [4]:

    
stem_text(source, caps=False, display=750, reverse=False, stop_words=EN, legend_pos=None);









    



z| e
j| ouuuu
q| uuuuuu
n| aeeeeo
y| eeeeeee
v| aaaeeeiiooo
k| eeeeiinnnnn
u| nnnnnnnnnnnnnsss
o| bbbbbcffhhhhilpp’
e|  aaaafllmmnnnnstxxxy
g| aaeeeelllllooooooorrrr
r| aaeeeeeeeeeeeeeeiiiiioo
i| dmmmmnnnnnnnnnnnnnnnnnst
p| aaaaaeeeilllloooooooorrrrrruuu
b| aaaeeeeeeiillllooorrrrrrrruuuuu
a| cdddfffillmnnnnnnnnnnrrrssssssssvw
l| aaaeeeeeeiiiiiiiiiiiiiiiiiioooooooooy
f| aaaaaaaaaaaaaaaaaaaeeeiiiiooooorrrrruu
d| aaaaaaeeeeeeeeeeeeiiiiiiiioooooooorrrru
t| aaaaaaeeeehhhhhhhhhhhhhhiiiiiiooooorrrrrrrrruwwyyy
w| aaaaaaaaaaeeeeeeeeehiiiiiiiiiiiiiiiiiooooooooorrrr
m| aaaaaaaaaaaaaaaaaaaaaaaeeeeeeiiiiooooorrrrrrrrrrrrryy
c| aaaaaaaaaaeehhhhhilllllllllooooooooooooooooooooooooooorrrrru
h| aaaaaaaaaaaaaaaaeeeeeeeeeeeeeiiooooooooooooooooooooooooooooouuuu
s| aaaaaaaaaaaaaaaaaaaaaaacceeeeeehhhhhhhhhhhhiiiiiiillllnoooppppptttttttttttttttttttuuuuuuuuuwww’



In [5]:

    
# looking at words in reverse. 'word' converts to bigram 'dr', stem 'd', leaf 'r'
stem_text(source, caps=False, display=750, reverse=True, stop_words=EN, legend_pos=None);









    



x| o
c| ii
p| orsuu
w| oooooo
u| ooooooooo
k| acccnnnnnooo
m| aiiiiiiiiiooooor
h| cccccccggostttttttttt
l| aaeeellllllllllllllllll
g| nnnnnnnnnnnnnnnnnnnnnnnn
f| iiiiiillllooooooooooooooooo
o| dddddgosssssstttttttttttttttw
y| aaaabddeeeefhlllllllllmmmmmmnnrrrrrrrrrr
n| aaaaaaaaaaeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiioooooooooooorw
r|  aaaaaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeimmmmoooooooooooooouuuuuuuu
d| aaaaaaaaaaaeeeeeeeeeeeeeeeeeeeeiiiiiilllllnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnr
s| aaaaaaaaaaaaaaaaaaaaaaaddddeeeeeeeeeeeeeeegiiiiiiiiiiiilnppprrrrrrrrrssttttuuyy’’’’’
t| aaaaaaaaaaaaaaaaaaaaaaacccceeeeeehhhhhiiiiiiiiiiiiiiiiiiinnnnnooooooooorssssssssssssssuuuuuuuuuxx’’
e| bbbbcccccccefggghhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhkklllllllllmmmmmmmmmpprrrrrrrrrrrrsssssssssssttttttttuuvvvvvvvvvvvvvvvvvvvvvwwwww



In [6]:

    
rows, hm, df = stem_text(source, break_on='m', caps=False, display=1200, random_state=120,
                         rows_only=False, sort_by='alpha', stop_words=EN);









    



: 
count     1200
unique     781
top       said
freq        23
Name: word, dtype: object
sampled 1200

a| bbccccccddddfffggilllllll
a| mmnnnnnnnnnnnnnnnpppprrssssssssssstuwwww
b| aaaaaaaaeeeeeeeeeeiiillllll
b| ooooooooooorrrrrrrrrrruuuuuuuu
c| aaaaaaaaaaaaaaaaaaaaaeeehhhhhhhhhhiiiiilllllllll
c| oooooooooooooooooooooooooooooooooooooooorrrrrrrrrrrrruuuu
d| aaaaaaeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiii
d| ooooooooooooorrrrrrrrrrué
e| aaaaff
e| nnnnnnnnnnnqtvvxxxxxyyyy
f| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaeeeeeiiiiiiiiiiill
f| oooooooooooooooorrrrrrrrrtu
g| aaaaaeeeeeeiiiillllllllllll
g| oooooooooorrrrrrrrr
h| aaaaaaaaaaaaaaaaaaaaaaaaaaeeeeeeei
h| oooooooooooooooooooooooooooooooooooooooouuuuuu
i| dl
i| mmmmmmmmnnnnnnnnnnnnnnrss
j| aaa
j| oouuu
k| eiiiiii
k| nnnnnnnnnn
l| aaaaaaaeeeeeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiiiiiii
l| oooooooooooooy
m| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaeeeeeeeeiiiiiiiiiiiiiii
m| oooooooooorrrrrrrrrrrrrrrrrrrrry
n| aeeeeeeei
n| ooooo
o| bbbbbbbbbbccfffhhhhhhllll
o| prsu’’
p| aaaaaaaaaeeeeeeeeehillllllll
p| ooooooooorrrrrrrrrrruuuuu
q| uuuuuuuuuuu
q| 
r|   eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeiiiiiiiii
r| ooooouuu
s| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaceeeeeeeeehhhhhhhhhhhhhiiiiiiiiiiillllllll
s| mmmmnnnnoooooppppttttttttttttuuuuuuuuuuuuuuuuuuwwwy’’
t| aaaaaaaaaaaaeeeeehhhhhhhhhhhhhhhhhiiiiiii
t| oooooooorrrrrrrrrrrrrrrruwwyyyyyy
u| nnnnnssssssss
u| 
v| aeiii
v| ooooouu
w| aaaaaaaaaaaeeeeeeeeeeeeehhhiiiiiiiiiiiiii
w| ooooooooooooooorrrrrrrr
y| eeeeeeeeee
y| o

o followed by an apostrophe (’). Irish name?



In [7]:

    
df[df.word.str[:2]=='o’']









    Out[7]:







  
    
      
      index
      word
      stem
      leaf
      ngram
    
  
  
    
      68
      8287
      o’clock
      o
      ’
      o’
    
    
      524
      9263
      o’clock
      o
      ’
      o’

Ah, o’clock. That explains it.