In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from stemgraphic.alpha import ngram_data, word_scatter, stem_scatter, scatter
from stemgraphic.stopwords import EN
We enable interactive plots with cufflinks
In [2]:
import cufflinks as cf
cf.go_offline()
We will compare stem-and-leaf scatters with leaf order of 1, 2, and 3 for two stories by Arthur Conan Doyle. And we will enable jitter to see overlapping points.
In [3]:
src1 = '../datasets/The Red Headed League by Arthur Conan Doyle.txt'
src2 = '../datasets/A Case of Identity by Arthur Conan Doyle.txt'
scatter defaults to stem order of 1 and leaf order of 1, so we can use it directly. There is also a stem_scatter alias.
In [4]:
scatter(src1, src2, jitter=True, stop_words=EN);
scatter(src1, src2, leaf_order=2, jitter=True, stop_words=EN);
scatter(src1, src2, leaf_order=3, jitter=True, stop_words=EN);
'co', 'hol' and 'mr' end up as most common with leaf order 1 (bigram), leaf order 2 (trigram) and leaf order 3 (4-gram). 'mr' is only 2 characters even though we are looking at a stem order of 1 and leaf order of 3 (4-gram) and that is normal, any complete words count even if they are shorter.
If you prefer to be explicit that you are using a stem-and-leaf count vs word count, scatter can be called as stem_scatter (and not the effect of not specifying jitter - default is false):
In [5]:
ax, red = stem_scatter(src1, src2, leaf_order=2, stop_words=EN);
red.head()
Out[5]:
Even with cufflinks loaded, we can force a matplotlib output by specifying false. By default, log scale is enabled. If we disable it, the chart is much less readable.
In [6]:
word_scatter(src1, src2, interactive=False, log_scale=False, jitter=True, stop_words=EN);
We return to the default of the log scale. We will now compare the effect of no stop word removal vs stop word removal. At the same time this shows how to stack two word_scatter plots into one figure.
In [7]:
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,12))
word_scatter(src1, src2, interactive=False, jitter=True, ax=ax1);
word_scatter(src1, src2, interactive=False, jitter=True, stop_words=EN, ax=ax2);
Finally, we plot the above in interactive mode.
In [8]:
ax, res = word_scatter(src1, src2, jitter=True, stop_words=EN);
In [ ]: