For this demonstration, we will import stem_graphic along with some helpers, such as a list of French stopwords (common words we typically want to avoid in our models and exploratory analysis). Stemgraphic also has english (EN) and spanish (ES) stopwords built-in. Other lists can also be used.
In [1]:
%matplotlib inline
from stemgraphic.alpha import stem_graphic
from stemgraphic.helpers import APOSTROPHE
from stemgraphic.stopwords import VOYELLES, FR
The source of data will be "Arsene Lupin contre Herlock Sholmes, La lampe Juive" by Maurice Leblanc, in French. We load the unprocessed text directly from disk.
In [2]:
fig, ax, df = stem_graphic('../lupin_fr.txt', figure_only=False)
stem_graphic expects a list of words in some form. df from stem_graphic above has stem and leaf and ngrams so we select df.word.
Reverse=True allows us to look at word endings, stem is last letter, and leaves the letter preceding it
In [3]:
_,_, df2 = stem_graphic(df.word, reverse=True, figure_only=False);
Quite interesting to see the inversion of the shape of the distribution...
In [4]:
df2.head() # Oh, the fun times! C'est kazy du verlan...
Out[4]:
In [5]:
fig, ax, df = stem_graphic('../lupin_fr.txt', ascending=False, caps=False, display=1500,
figure_only=False, random_state=42, sort_by='alpha', stop_words=FR);
Many elisions seen in the above two charts. This is much common in French than in English and is rare in Spanish, so, not surprising given that the text is in French. Let's see what word pairs they are part of:
In [6]:
df[df.word.str[1:2]==APOSTROPHE]
Out[6]:
Note that there is a difference between a straight quote ('), and a proper typographical apostrophe which is angled (’). If your text does not use the proper typographical form, you will have to run it through something like sed. Be careful not to replace actual quote marks by apostrophes.
In [7]:
res = df[df.word.str[1:2]=='’'].sort_values(by='word')
print(res.word.unique())
taking the above list, we’ll plot a stem and leaf, skipping the apostrophe
In [8]:
stem_graphic(res, title="word pair ellipsis from the original French edition of 'Arsene Lupin vs. Herlock Sholmes'",
leaf_skip=1, random_state=121);
Besides accented letters, we also see typographical ligatures such as o&e (œ, as in l'œil) as first character
Of those starting with n’ (a contracted negation ne) we see that the word right after the apostrophe most often starts with an e, followed closely by a. There is also a y, which means: n’y. It would be interesting to compare the use of y by different authors and different eras in French litterature in its different forms, such as d’y, n’y, m’y, l’y, s’y, and t’y.
In [9]:
stem_graphic(res, title="stem-and-leaf of words starting with n’",
column=['n'], # look only at words starting with n
stem_skip=2, # skip the n’, leaves now become the stem
leaf_order=2, # and show pair of leaves
random_state=121);
Not a lot of repetition. n'est is found 4 times. Everything else are one off.
In [ ]: