Stem_graphic

For this demonstration, we will import stem_graphic along with some helpers, such as a list of French stopwords (common words we typically want to avoid in our models and exploratory analysis). Stemgraphic also has english (EN) and spanish (ES) stopwords built-in. Other lists can also be used.


In [1]:
%matplotlib inline
from stemgraphic.alpha import stem_graphic
from stemgraphic.helpers import APOSTROPHE
from stemgraphic.stopwords import VOYELLES, FR

The source of data will be "Arsene Lupin contre Herlock Sholmes, La lampe Juive" by Maurice Leblanc, in French. We load the unprocessed text directly from disk.


In [2]:
fig, ax, df = stem_graphic('../lupin_fr.txt', figure_only=False)


stem_graphic expects a list of words in some form. df from stem_graphic above has stem and leaf and ngrams so we select df.word.

Reverse=True allows us to look at word endings, stem is last letter, and leaves the letter preceding it


In [3]:
_,_, df2 = stem_graphic(df.word, reverse=True, figure_only=False);


Quite interesting to see the inversion of the shape of the distribution...


In [4]:
df2.head()  # Oh, the fun times! C'est kazy du verlan...


Out[4]:
index word stem leaf ngram
0 267 sèmlohS s è
1 362 ed e d ed
2 495 rus r u ru
3 217 dnanidreF-tniaS d n dn
4 420 tpes t p tp

In [5]:
fig, ax, df = stem_graphic('../lupin_fr.txt', ascending=False, caps=False, display=1500,
                           figure_only=False, random_state=42, sort_by='alpha', stop_words=FR);


Many elisions seen in the above two charts. This is much common in French than in English and is rare in Spanish, so, not surprising given that the text is in French. Let's see what word pairs they are part of:


In [6]:
df[df.word.str[1:2]==APOSTROPHE]


Out[6]:
index word stem leaf ngram
9 11597 n’en n n’
13 11727 l’état l l’
18 21827 n’est n n’
22 21316 c’est c c’
46 10439 l’endroit l l’
54 2524 l’on l l’
63 23176 j’ignore j j’
65 15921 n’en n n’
67 2693 d’hésitation d d’
83 12101 c’est c c’
85 12906 l’affaire l l’
87 14964 c’est c c’
93 19002 d’un d d’
96 5544 d’autre d d’
98 12892 c’est c c’
114 9444 c’est c c’
135 19745 d’autres d d’
142 17457 d’un d d’
148 12789 m’avez m m’
152 2978 s’était s s’
159 17673 l’avait l l’
167 17996 l’avait l l’
174 5918 c’est c c’
189 7311 j’en j j’
198 3248 d’abord d d’
207 18798 l’histoire l l’
214 8842 m’avait m m’
234 7088 l’éclairât l l’
244 16940 s’accrocher s s’
245 16757 c’est c c’
... ... ... ... ... ...
1185 7161 l’ennemi l l’
1195 2486 l’un l l’
1196 16476 d’une d d’
1214 22522 l’anglais l l’
1215 4180 d’imblevalle d d’
1228 1806 j’ai j j’
1269 17377 c’est c c’
1287 5687 l’autre l l’
1298 21182 l’épouvante l l’
1300 8941 l’avait l l’
1302 7741 l’une l l’
1310 1668 d’être d d’
1344 8093 l’énigme l l’
1347 20235 l’anglais l l’
1367 16830 l’idée l l’
1371 22032 s’installer s s’
1386 19030 n’ose n n’
1414 1042 j’en j j’
1425 21976 j’ai j j’
1431 16425 n’aurais n n’
1450 16554 s’enfonçait s s’
1453 2106 s’arranger s s’
1457 17464 l’ajusta l l’
1464 946 m’embêter m m’
1474 9504 s’assirent s s’
1477 12332 s’approcha s s’
1483 625 l’occasion l l’
1490 17173 l’unique l l’
1493 3032 c’est c c’
1494 7129 l’aventure l l’

159 rows × 5 columns

Note that there is a difference between a straight quote ('), and a proper typographical apostrophe which is angled (’). If your text does not use the proper typographical form, you will have to run it through something like sed. Be careful not to replace actual quote marks by apostrophes.


In [7]:
res = df[df.word.str[1:2]=='’'].sort_values(by='word')
print(res.word.unique())


['c’est' 'c’était' 'd’abord' 'd’agir' 'd’ailleurs' 'd’annoncer' 'd’autant'
 'd’autre' 'd’autres' 'd’eux' 'd’hommes' 'd’hésitation' 'd’imblevalle'
 'd’influence' 'd’où' 'd’un' 'd’une' 'd’utiliser' 'd’y' 'd’émeraudes'
 'd’être' 'd’œil' 'j’admire' 'j’ai' 'j’en' 'j’ignore' 'j’étais' 'l’a'
 'l’adresse' 'l’affaire' 'l’aie' 'l’air' 'l’ajusta' 'l’album' 'l’alphabet'
 'l’ami' 'l’amie' 'l’anglais' 'l’angoisse' 'l’aplomb' 'l’aurez'
 'l’autorisation' 'l’autre' 'l’avait' 'l’aventure' 'l’eau' 'l’effroyable'
 'l’embarcation' 'l’endroit' 'l’ennemi' 'l’entende' 'l’erreur' 'l’histoire'
 'l’honneur' 'l’idée' 'l’individu' 'l’inquiétude' 'l’insignifiante'
 'l’occasion' 'l’on' 'l’ordinaire' 'l’un' 'l’une' 'l’unique' 'l’échelle'
 'l’écho' 'l’éclairât' 'l’énigme' 'l’épaule' 'l’épouvante' 'l’état' 'l’œil'
 'm’avait' 'm’avez' 'm’avez-vous' 'm’embêter' 'm’ont' 'm’étonner'
 'n’admettait' 'n’agit' 'n’ai' 'n’aurais' 'n’avais' 'n’avez' 'n’effrayait'
 'n’en' 'n’est' 'n’est-ce' 'n’ont' 'n’ose' 'n’y' 'n’était' 's’accentua'
 's’accrocher' 's’agit' 's’aperçut' 's’approcha' 's’arranger' 's’assirent'
 's’en' 's’enfonçait' 's’informa' 's’installer' 's’interrompit' 's’opérait'
 's’y' 's’écria' 's’éloignait' 's’étaient' 's’était']

taking the above list, we’ll plot a stem and leaf, skipping the apostrophe


In [8]:
stem_graphic(res, title="word pair ellipsis from the original French edition of 'Arsene Lupin vs. Herlock Sholmes'",
             leaf_skip=1, random_state=121);


Some observations:

Besides accented letters, we also see typographical ligatures such as o&e (œ, as in l'œil) as first character

Of those starting with n’ (a contracted negation ne) we see that the word right after the apostrophe most often starts with an e, followed closely by a. There is also a y, which means: n’y. It would be interesting to compare the use of y by different authors and different eras in French litterature in its different forms, such as d’y, n’y, m’y, l’y, s’y, and t’y.


In [9]:
stem_graphic(res, title="stem-and-leaf of words starting with n’", 
             column=['n'],  # look only at words starting with n
             stem_skip=2,  # skip the n’, leaves now become the stem
             leaf_order=2,  # and show pair of leaves
             random_state=121);


Not a lot of repetition. n'est is found 4 times. Everything else are one off.


In [ ]: