Stem_graphic

For this demonstration, we will import stem_graphic along with some helpers, such as a list of French stopwords (common words we typically want to avoid in our models and exploratory analysis). Stemgraphic also has english (EN) and spanish (ES) stopwords built-in. Other lists can also be used.



In [1]:

    
%matplotlib inline
from stemgraphic.alpha import stem_graphic
from stemgraphic.helpers import APOSTROPHE
from stemgraphic.stopwords import VOYELLES, FR

The source of data will be "Arsene Lupin contre Herlock Sholmes, La lampe Juive" by Maurice Leblanc, in French. We load the unprocessed text directly from disk.



In [2]:

    
fig, ax, df = stem_graphic('../lupin_fr.txt', figure_only=False)

stem_graphic expects a list of words in some form. df from stem_graphic above has stem and leaf and ngrams so we select df.word.

Reverse=True allows us to look at word endings, stem is last letter, and leaves the letter preceding it



In [3]:

    
_,_, df2 = stem_graphic(df.word, reverse=True, figure_only=False);

Quite interesting to see the inversion of the shape of the distribution...



In [4]:

    
df2.head()  # Oh, the fun times! C'est kazy du verlan...









    Out[4]:







  
    
      
      index
      word
      stem
      leaf
      ngram
    
  
  
    
      0
      267
      sèmlohS
      s
      è
      sè
    
    
      1
      362
      ed
      e
      d
      ed
    
    
      2
      495
      rus
      r
      u
      ru
    
    
      3
      217
      dnanidreF-tniaS
      d
      n
      dn
    
    
      4
      420
      tpes
      t
      p
      tp



In [5]:

    
fig, ax, df = stem_graphic('../lupin_fr.txt', ascending=False, caps=False, display=1500,
                           figure_only=False, random_state=42, sort_by='alpha', stop_words=FR);

Many elisions seen in the above two charts. This is much common in French than in English and is rare in Spanish, so, not surprising given that the text is in French. Let's see what word pairs they are part of:



In [6]:

    
df[df.word.str[1:2]==APOSTROPHE]









    Out[6]:







  
    
      
      index
      word
      stem
      leaf
      ngram
    
  
  
    
      9
      11597
      n’en
      n
      ’
      n’
    
    
      13
      11727
      l’état
      l
      ’
      l’
    
    
      18
      21827
      n’est
      n
      ’
      n’
    
    
      22
      21316
      c’est
      c
      ’
      c’
    
    
      46
      10439
      l’endroit
      l
      ’
      l’
    
    
      54
      2524
      l’on
      l
      ’
      l’
    
    
      63
      23176
      j’ignore
      j
      ’
      j’
    
    
      65
      15921
      n’en
      n
      ’
      n’
    
    
      67
      2693
      d’hésitation
      d
      ’
      d’
    
    
      83
      12101
      c’est
      c
      ’
      c’
    
    
      85
      12906
      l’affaire
      l
      ’
      l’
    
    
      87
      14964
      c’est
      c
      ’
      c’
    
    
      93
      19002
      d’un
      d
      ’
      d’
    
    
      96
      5544
      d’autre
      d
      ’
      d’
    
    
      98
      12892
      c’est
      c
      ’
      c’
    
    
      114
      9444
      c’est
      c
      ’
      c’
    
    
      135
      19745
      d’autres
      d
      ’
      d’
    
    
      142
      17457
      d’un
      d
      ’
      d’
    
    
      148
      12789
      m’avez
      m
      ’
      m’
    
    
      152
      2978
      s’était
      s
      ’
      s’
    
    
      159
      17673
      l’avait
      l
      ’
      l’
    
    
      167
      17996
      l’avait
      l
      ’
      l’
    
    
      174
      5918
      c’est
      c
      ’
      c’
    
    
      189
      7311
      j’en
      j
      ’
      j’
    
    
      198
      3248
      d’abord
      d
      ’
      d’
    
    
      207
      18798
      l’histoire
      l
      ’
      l’
    
    
      214
      8842
      m’avait
      m
      ’
      m’
    
    
      234
      7088
      l’éclairât
      l
      ’
      l’
    
    
      244
      16940
      s’accrocher
      s
      ’
      s’
    
    
      245
      16757
      c’est
      c
      ’
      c’
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1185
      7161
      l’ennemi
      l
      ’
      l’
    
    
      1195
      2486
      l’un
      l
      ’
      l’
    
    
      1196
      16476
      d’une
      d
      ’
      d’
    
    
      1214
      22522
      l’anglais
      l
      ’
      l’
    
    
      1215
      4180
      d’imblevalle
      d
      ’
      d’
    
    
      1228
      1806
      j’ai
      j
      ’
      j’
    
    
      1269
      17377
      c’est
      c
      ’
      c’
    
    
      1287
      5687
      l’autre
      l
      ’
      l’
    
    
      1298
      21182
      l’épouvante
      l
      ’
      l’
    
    
      1300
      8941
      l’avait
      l
      ’
      l’
    
    
      1302
      7741
      l’une
      l
      ’
      l’
    
    
      1310
      1668
      d’être
      d
      ’
      d’
    
    
      1344
      8093
      l’énigme
      l
      ’
      l’
    
    
      1347
      20235
      l’anglais
      l
      ’
      l’
    
    
      1367
      16830
      l’idée
      l
      ’
      l’
    
    
      1371
      22032
      s’installer
      s
      ’
      s’
    
    
      1386
      19030
      n’ose
      n
      ’
      n’
    
    
      1414
      1042
      j’en
      j
      ’
      j’
    
    
      1425
      21976
      j’ai
      j
      ’
      j’
    
    
      1431
      16425
      n’aurais
      n
      ’
      n’
    
    
      1450
      16554
      s’enfonçait
      s
      ’
      s’
    
    
      1453
      2106
      s’arranger
      s
      ’
      s’
    
    
      1457
      17464
      l’ajusta
      l
      ’
      l’
    
    
      1464
      946
      m’embêter
      m
      ’
      m’
    
    
      1474
      9504
      s’assirent
      s
      ’
      s’
    
    
      1477
      12332
      s’approcha
      s
      ’
      s’
    
    
      1483
      625
      l’occasion
      l
      ’
      l’
    
    
      1490
      17173
      l’unique
      l
      ’
      l’
    
    
      1493
      3032
      c’est
      c
      ’
      c’
    
    
      1494
      7129
      l’aventure
      l
      ’
      l’
    
  

159 rows × 5 columns

Note that there is a difference between a straight quote ('), and a proper typographical apostrophe which is angled (’). If your text does not use the proper typographical form, you will have to run it through something like sed. Be careful not to replace actual quote marks by apostrophes.



In [7]:

    
res = df[df.word.str[1:2]=='’'].sort_values(by='word')
print(res.word.unique())









    



['c’est' 'c’était' 'd’abord' 'd’agir' 'd’ailleurs' 'd’annoncer' 'd’autant'
 'd’autre' 'd’autres' 'd’eux' 'd’hommes' 'd’hésitation' 'd’imblevalle'
 'd’influence' 'd’où' 'd’un' 'd’une' 'd’utiliser' 'd’y' 'd’émeraudes'
 'd’être' 'd’œil' 'j’admire' 'j’ai' 'j’en' 'j’ignore' 'j’étais' 'l’a'
 'l’adresse' 'l’affaire' 'l’aie' 'l’air' 'l’ajusta' 'l’album' 'l’alphabet'
 'l’ami' 'l’amie' 'l’anglais' 'l’angoisse' 'l’aplomb' 'l’aurez'
 'l’autorisation' 'l’autre' 'l’avait' 'l’aventure' 'l’eau' 'l’effroyable'
 'l’embarcation' 'l’endroit' 'l’ennemi' 'l’entende' 'l’erreur' 'l’histoire'
 'l’honneur' 'l’idée' 'l’individu' 'l’inquiétude' 'l’insignifiante'
 'l’occasion' 'l’on' 'l’ordinaire' 'l’un' 'l’une' 'l’unique' 'l’échelle'
 'l’écho' 'l’éclairât' 'l’énigme' 'l’épaule' 'l’épouvante' 'l’état' 'l’œil'
 'm’avait' 'm’avez' 'm’avez-vous' 'm’embêter' 'm’ont' 'm’étonner'
 'n’admettait' 'n’agit' 'n’ai' 'n’aurais' 'n’avais' 'n’avez' 'n’effrayait'
 'n’en' 'n’est' 'n’est-ce' 'n’ont' 'n’ose' 'n’y' 'n’était' 's’accentua'
 's’accrocher' 's’agit' 's’aperçut' 's’approcha' 's’arranger' 's’assirent'
 's’en' 's’enfonçait' 's’informa' 's’installer' 's’interrompit' 's’opérait'
 's’y' 's’écria' 's’éloignait' 's’étaient' 's’était']

taking the above list, we’ll plot a stem and leaf, skipping the apostrophe



In [8]:

    
stem_graphic(res, title="word pair ellipsis from the original French edition of 'Arsene Lupin vs. Herlock Sholmes'",
             leaf_skip=1, random_state=121);

Some observations:

Besides accented letters, we also see typographical ligatures such as o&e (œ, as in l'œil) as first character

Of those starting with n’ (a contracted negation ne) we see that the word right after the apostrophe most often starts with an e, followed closely by a. There is also a y, which means: n’y. It would be interesting to compare the use of y by different authors and different eras in French litterature in its different forms, such as d’y, n’y, m’y, l’y, s’y, and t’y.



In [9]:

    
stem_graphic(res, title="stem-and-leaf of words starting with n’", 
             column=['n'],  # look only at words starting with n
             stem_skip=2,  # skip the n’, leaves now become the stem
             leaf_order=2,  # and show pair of leaves
             random_state=121);

Not a lot of repetition. n'est is found 4 times. Everything else are one off.



In [ ]:

	index	word	stem	leaf	ngram
0	267	sèmlohS	s	è	sè
1	362	ed	e	d	ed
2	495	rus	r	u	ru
3	217	dnanidreF-tniaS	d	n	dn
4	420	tpes	t	p	tp

	index	word	stem	leaf	ngram
9	11597	n’en	n	’	n’
13	11727	l’état	l	’	l’
18	21827	n’est	n	’	n’
22	21316	c’est	c	’	c’
46	10439	l’endroit	l	’	l’
54	2524	l’on	l	’	l’
63	23176	j’ignore	j	’	j’
65	15921	n’en	n	’	n’
67	2693	d’hésitation	d	’	d’
83	12101	c’est	c	’	c’
85	12906	l’affaire	l	’	l’
87	14964	c’est	c	’	c’
93	19002	d’un	d	’	d’
96	5544	d’autre	d	’	d’
98	12892	c’est	c	’	c’
114	9444	c’est	c	’	c’
135	19745	d’autres	d	’	d’
142	17457	d’un	d	’	d’
148	12789	m’avez	m	’	m’
152	2978	s’était	s	’	s’
159	17673	l’avait	l	’	l’
167	17996	l’avait	l	’	l’
174	5918	c’est	c	’	c’
189	7311	j’en	j	’	j’
198	3248	d’abord	d	’	d’
207	18798	l’histoire	l	’	l’
214	8842	m’avait	m	’	m’
234	7088	l’éclairât	l	’	l’
244	16940	s’accrocher	s	’	s’
245	16757	c’est	c	’	c’
...	...	...	...	...	...
1185	7161	l’ennemi	l	’	l’
1195	2486	l’un	l	’	l’
1196	16476	d’une	d	’	d’
1214	22522	l’anglais	l	’	l’
1215	4180	d’imblevalle	d	’	d’
1228	1806	j’ai	j	’	j’
1269	17377	c’est	c	’	c’
1287	5687	l’autre	l	’	l’
1298	21182	l’épouvante	l	’	l’
1300	8941	l’avait	l	’	l’
1302	7741	l’une	l	’	l’
1310	1668	d’être	d	’	d’
1344	8093	l’énigme	l	’	l’
1347	20235	l’anglais	l	’	l’
1367	16830	l’idée	l	’	l’
1371	22032	s’installer	s	’	s’
1386	19030	n’ose	n	’	n’
1414	1042	j’en	j	’	j’
1425	21976	j’ai	j	’	j’
1431	16425	n’aurais	n	’	n’
1450	16554	s’enfonçait	s	’	s’
1453	2106	s’arranger	s	’	s’
1457	17464	l’ajusta	l	’	l’
1464	946	m’embêter	m	’	m’
1474	9504	s’assirent	s	’	s’
1477	12332	s’approcha	s	’	s’
1483	625	l’occasion	l	’	l’
1490	17173	l’unique	l	’	l’
1493	3032	c’est	c	’	c’
1494	7129	l’aventure	l	’	l’