stemgraphic.alpha

Demonstration of the text handling capabilities of stemgraphic. Available on pypi, github and at stemgraphic.org.


In [1]:
%matplotlib inline
from stemgraphic import alpha
from stemgraphic.stopwords import *

In [2]:
help(alpha)


Help on module stemgraphic.alpha in stemgraphic:

NAME
    stemgraphic.alpha - stemgraphic.alpha.

DESCRIPTION
    Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. alpha is a
    module of the stemgraphic package to add support for categorical and text variables.
    
    The module also adds functionality to handle whole words, beside stem-and-leaf bigrams and n-grams.
    
    For example, for the word "alabaster":
    
    With word_ functions, we can look at the word frequency in a text, or compare it through a distance function
    (default to Levenshtein) to other words in a corpus
    
    With stem_ functions, we can look at the fundamental stem-and-leaf, stem would be 'a' and leaf would be 'l', for
    a bigram 'al'. With a stem_order of 1 and a leaf_order of 2, we would have 'a' and 'la', for a trigram 'ala', so
    on and so forth.

FUNCTIONS
    add_missing_letters(mat, stem_order, leaf_order, letters=None)
        Add missing stems based on LETTERS. defaults to a-z alphabet.
        
        :param mat: matrix to modify
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param leaf_order: how many leaf characters per data point to display, defaults to 1
        :param letters: letters that must be present as stems
        :return: the modified matrix
    
    heatmap(src, alpha_only=False, annotate=False, asFigure=False, ax=None, caps=False, compact=True, display=None, interactive=True, leaf_order=1, random_state=None, stem_order=1, stop_words=None)
        The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves,
         they are left in their respective columns. Row 'a' and Column 'b' would have the count of words starting
         with 'ab'. The heatmap is useful to look at patterns. For distribution, stem\_graphic is better suited.
        
        :param src: string, filename, url, list, numpy array, time series, pandas or dask dataframe
        :param alpha_only: only use stems from a-z alphabet
        :param annotate: display annotations (Z) on heatmap
        :param asFigure: return plot as plotly figure (for web applications)
        :param ax:  matplotlib axes instance, usually from a figure or other plot
        :param caps: bool, True to be case sensitive
        :param compact: remove empty stems
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param interactive: if cufflinks is loaded, renders as interactive plot in notebook
        :param leaf_order: how many leaf characters per data point to display, defaults to 1
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stop_words:stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :return:
    
    heatmap_grid(src1, src2, src3=None, src4=None, alpha_only=True, annot=False, caps=False, center=0, cmap=None, display=1000, leaf_order=1, random_state=None, robust=False, stem_order=1, stop_words=None, threshold=0)
        heatmap_grid.
        
        With stem_graphic, it is possible to directly compare two different sources. In the case of a heatmap,
        two different data sets cannot be visualized directly on a single heatmap. For this task, we designed
        heatmap_grid to adapt to the number of sources to build a layout. It can take from 2 to 4 different source.
        
        With 2 sources, a square grid will be generated, allowing for horizontal and vertical comparisons,
        with an extra heatmap showing the difference between the two matrices. It also computes a norm for that
        difference matrix. The smaller the value, the closer the two heatmaps are.
        
        With 3 sources, it builds a triangular grid, with each source heatmap in a corner and the difference between
        each pair in between.
        
        Finally, with 4 sources, a 3 x 3 grid is built, each source in a corner and the
        difference between each pair in between, with the center expressing the difference between top left
        and bottom right diagonal.
        
        :param src1: string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)
        :param src2: string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)
        :param src3: string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)
        :param src4: string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)
        :param alpha_only: only use stems from a-z alphabet
        :param annot: display annotations (Z) on heatmap
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.
        :param center: the center of the divergent color map for the difference heatmaps
        :param cmap: color map for difference heatmap or None (default) to use the builtin red / blue divergent map
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param leaf_order: how many leaf characters per data point to display, defaults to 1
        :param robust: reduce effect of outliers on difference heatmap
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :param threshold: absolute value minimum count difference for a difference heatmap element to be visible
        :return:
    
    matrix_difference(mat1, mat2, thresh=0)
        matrix_difference
        
        :param mat1: first heatmap dataframe
        :param mat2: second heatmap dataframe
        :param thresh: : absolute value minimum count difference for a difference heatmap element to be visible
        :return: difference matrix, norm and ratio of the sum of the first matrix over the second
    
    ngram_data(df, alpha_only=False, ascending=True, binary=False, break_on=None, caps=True, char_filter=None, column=None, compact=False, display=750, leaf_order=1, leaf_skip=0, persistence=None, random_state=None, remove_accents=False, rows_only=True, sort_by='len', stem_order=1, stem_skip=0, stop_words=None)
        ngram_data
        
        This is the main text ingestion function for stemgraphic.alpha. It is used by most of the visualizations. It
        can also be used directly, to feed a pipeline, for example.
        
        If selected (rows_only=False), the returned dataframe includes in each row a single word, the stem, the leaf and
        the ngram (stem + leaf) - the index is the 'token' position in the original source:
        
            word    stem    leaf    ngram
            -----------------------------
        12  salut   s           a       sa
        13  chéri   c           h       ch
        
        :param df: list, numpy array, series, pandas or dask dataframe
        :param alpha_only: only use stems from a-z alphabet (NA on dataframe)
        :param ascending: bool if the sort is ascending
        :param binary: bool if True forces counts to 1 for anything greater than 0
        :param break_on: letter on which to break a row, or None (default)
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
        :param char_filter: list of characters to ignore. If None (default) CHAR_FILTER list will be used
        :param column: specify which column (string or number) of the dataframe to use, or group of columns (stems)
                       else the frame is assumed to only have one column with words.
        :param compact: remove empty stems
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param leaf_order: how many leaf characters per data point to display, defaults to 1
        :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: 'wol','wor','woo'
        :param persistence: will save the sampled datafrae to filename (with csv or pkl extension) or None
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param remove_accents: bool if True strips accents (NA on dataframe)
        :param rows_only: bool by default returns only the stem and leaf rows. If false, also return the matrix and dataframe
        :param sort_by: default to 'len', can also be 'alpha'
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :return: ordered rows if rows_only, else also returns the matrix and dataframe
    
    plot_sunburst_level(normalized, ax, label=True, level=0, offset=0, ngram=False, plot=True, stem=None, vis=0)
        plot_sunburst_level
        
        utility function for sunburst function.
        
        :param normalized:
        :param ax:
        :param label:
        :param level:
        :param ngram:
        :param offset:
        :param plot:
        :param stem:
        :param vis:
        :return:
    
    polar_word_plot(ax, word, words, label, min_dist, max_dist, metric, offset, step)
        polar_word_plot
        
        Utility function for radar plot.
        
        :param ax: matplotlib ax
        :param word: string, the reference word that will be placed in the middle
        :param words: list of words to compare
        :param label:  bool if True display words centered at coordinate
        :param min_dist: minimum distance based on metric to include a word for display
        :param max_dist: maximum distance for a given section
        :param metric: any metric function accepting two values and returning that metric in a range from 0 to x
        :param offset: where to start plotting in degrees
        :param step: how many degrees to step between plots
        :return:
    
    radar(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')
        radar
        
        The radar plot compares a reference word with a corpus. By default, it calculates the levenshtein
        distance between the reference word and each words in the corpus. An alternate distance or metric
        function can be provided. Each word is then plotted around the center based on 3 criteria.
        
        1) If the word length is longer, it is plotted on the left side, else on the right side.
        
        2) Distance from center is based on the distance function.
        
        3) the words are equidistant, and their order defined alphabetically or by count (only applicable
           if the corpus is a text and not a list of unique words, such as a password dictionary).
        
        Stem-and-leaf support is upcoming.
        
        :param word: string, the reference word that will be placed in the middle
        :param comparisons: external file, list or string or dataframe of words
        :param ascending: bool if the sort is ascending
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param label: bool if True display words centered at coordinate
        :param metric: Levenshtein (default), or any metric function accepting two values and returning that metric
        :param min_distance: minimum distance based on metric to include a word for display
        :param max_distance: maximum distance based on metric to include a word for display
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param sort_by: default to 'alpha', can also be 'len'
        :return:
    
    radians(...)
        radians(x)
        
        Convert angle x from degrees to radians.
    
    stem_freq_plot(df, alpha_only=False, asFigure=False, column=None, compact=True, caps=False, display=2600, interactive=True, kind='barh', leaf_order=1, leaf_skip=0, random_state=None, stem_order=1, stem_skip=0, stop_words=None)
        stem_freq_plot
        
        Word frequency plot is the most common visualization in NLP. In this version it supports stem-and-leaf / n-grams.
        
        Each row is the stem, and similar leaves are grouped together and each different group is stacked
        in bar charts.
        
        Default is horizontal bar chart, but vertical, histograms, area charts and even pie charts are
        supported by this one visualization.
        
        
        :param df: string, filename, url, list, numpy array, time series, pandas or dask dataframe
        :param alpha_only: only use stems from a-z alphabet (NA on dataframe)
        :param asFigure: return plot as plotly figure (for web applications)
        :param column: specify which column (string or number) of the dataframe to use, or group of columns (stems)
                       else the frame is assumed to only have one column with words.
        :param compact: do not display empty stem rows (with no leaves), defaults to False
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param interactive: if cufflinks is loaded, renders as interactive plot in nebook
        :param kind: defaults to 'barh'. One of 'bar','barh','area','hist'. Non-interactive also supports 'pie'
        :param leaf_order: how many leaf digits per data point to display, defaults to 1
        :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: 'wol','wor','woo'
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :return:
    
    stem_graphic(df, df2=None, aggregation=True, alpha=0.1, alpha_only=True, ascending=False, ax=None, bar_color='C0', bar_outline=None, break_on=None, caps=True, column=None, combined=None, compact=False, delimiter_color='C3', display=750, figure_only=True, flip_axes=False, font_kw=None, leaf_color='k', leaf_order=1, leaf_skip=0, legend_pos='best', median_color='magenta', mirror=False, persistence=None, primary_kw=None, random_state=None, remove_accents=False, secondary=False, show_stem=True, sort_by='len', stop_words=None, stem_order=1, stem_skip=0, title=None, trim_blank=False, underline_color=None)
        stem_graphic
        
        The principal visualization of stemgraphic.alpha is stem_graphic. It offers all the
        options of stem\_text (3.1) and adds automatic title, mirroring, flipping of axes,
        export (to pdf, svg, png, through fig.savefig) and many more options to change the
        visual appearance of the plot (font size, color, background color, underlining and more).
        
        By providing a secondary text source, the plot will enable comparison through a back-to-back display
        
        
        :param df: string, filename, url, list, numpy array, time series, pandas or dask dataframe
        :param df2: string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional).
                    for back 2 back stem-and-leaf plots
        :param aggregation: Boolean for sum, else specify function
        :param alpha: opacity of the bars, median and outliers, defaults to 10%
        :param alpha_only: only use stems from a-z alphabet (NA on dataframe)
        :param ascending: stem sorted in ascending order, defaults to True
        :param ax: matplotlib axes instance, usually from a figure or other plot
        :param bar_color: the fill color of the bar representing the leaves
        :param bar_outline: the outline color of the bar representing the leaves
        :param break_on: force a break of the leaves at that letter, the rest of the leaves will appear on the next line
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
        :param column: specify which column (string or number) of the dataframe to use, or group of columns (stems)
                       else the frame is assumed to only have one column with words.
        :param combined: list (specific subset to automatically include, say, for comparisons), or None
        :param compact: do not display empty stem rows (with no leaves), defaults to False
        :param delimiter_color: color of the line between aggregate and stem and stem and leaf
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param figure_only: bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)
        :param flip_axes: X becomes Y and Y becomes X
        :param font_kw: keyword dictionary, font parameters
        :param leaf_color: font color of the leaves
        :param leaf_order: how many leaf digits per data point to display, defaults to 1
        :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: 'wol','wor','woo'
        :param legend_pos: One of 'top', 'bottom', 'best' or None, defaults to 'best'.
        :param median_color: color of the box representing the median
        :param mirror: mirror the plot in the axis of the delimiters
        :param persistence: filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
        :param primary_kw: stem-and-leaf plot additional arguments
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param remove_accents: bool if True strips accents (NA on dataframe)
        :param secondary: bool if True, this is a secondary plot - mostly used for back-to-back plots
        :param show_stem: bool if True (default) displays the stems
        :param sort_by: default to 'len', can also be 'alpha'
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :param title: string, or None. When None and source is a file, filename will be used.
        :param trim_blank: remove the blank between the delimiter and the first leaf, defaults to True
        :param underline_color: color of the horizontal line under the leaves, None for no display
        :return: matplotlib figure and axes instance, and dataframe if figure_only is False
    
    stem_sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=False, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=0)
        stem_sunburst
        
        Stem-and-leaf based sunburst. See sunburst for details
        
        :param words: string, filename, url, list, numpy array, time series, pandas or dask dataframe
        :param alpha_only: only use stems from a-z alphabet (NA on dataframe)
        :param ascending: stem sorted in ascending order, defaults to True
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
        :param compact: do not display empty stem rows (with no leaves), defaults to False
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param hole: bool if True (default) leave space in middle for statistics
        :param label: bool if True display words centered at coordinate
        :param leaf_order: how many leaf digits per data point to display, defaults to 1
        :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: 'wol','wor','woo'
        :param median: bool if True (default) display an origin and a median mark
        :param ngram: bool if True display full n-gram as leaf label
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param sort_by: sort by 'alpha' (default) or 'count'
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :param top: how many different words to count by order frequency. If negative, this will be the least frequent
        :return:
    
    stem_text(df, aggr=False, alpha_only=True, ascending=True, binary=False, break_on=None, caps=True, column=None, compact=False, display=750, legend_pos='top', leaf_order=1, leaf_skip=0, persistence=None, remove_accents=False, rows_only=False, sort_by='len', stem_order=1, stem_skip=0, stop_words=None, random_state=None)
        stem_text
        
        Tukey's original stem-and-leaf plot was text, with a vertical delimiter to separate stem from
        leaves. Just as stemgraphic implements a text version of the plot for numbers,
        stemgraphic.alpha implements a text version for words. This type of plot serves a similar
        purpose as a stacked bar chart with each data point annotated.
        
        It also displays some basic statistics on the whole text (or subset if using column).
        
        :param df: list, numpy array, time series, pandas or dask dataframe
        :param aggr: bool if True display the aggregated count of leaves by row
        :param alpha_only: only use stems from a-z alphabet (NA on dataframe)
        :param ascending: bool if the sort is ascending
        :param binary: bool if True forces counts to 1 for anything greater than 0
        :param break_on: force a break of the leaves at that letter, the rest of the leaves will appear on the next line
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
        :param column: specify which column (string or number) of the dataframe to use, or group of columns (stems)
                       else the frame is assumed to only have one column with words.
        :param compact: do not display empty stem rows (with no leaves), defaults to False
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param leaf_order: how many leaf characters per data point to display, defaults to 1
        :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: 'wol','wor','woo'
        :param legend_pos: where to put the legend: 'top' (default), 'bottom' or None
        :param persistence:  will save the sampled datafrae to  filename (with csv or pkl extension) or None
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param remove_accents: bool if True strips accents (NA on dataframe)
        :param rows_only: by default returns only the stem and leaf rows. If false, also return the matrix and dataframe
        :param sort_by: default to 'len', can also be 'alpha'
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
    
    sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=40)
        sunburst
        
         Word sunburst charts are similar to pie or donut charts, but add some statistics
         in the middle of the chart, including the percentage of total words targeted for a given
        number of unique words (ie. top 50 words, 48\% coverage).
        
        With stem-and-leaf, the first level of the sunburst represents the stem and the second
        level subdivides each stem by leaves.
        
        :param words: string, filename, url, list, numpy array, time series, pandas or dask dataframe
        :param alpha_only: only use stems from a-z alphabet (NA on dataframe)
        :param ascending: stem sorted in ascending order, defaults to True
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
        :param compact: do not display empty stem rows (with no leaves), defaults to False
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param hole: bool if True (default) leave space in middle for statistics
        :param label: bool if True display words centered at coordinate
        :param leaf_order: how many leaf digits per data point to display, defaults to 1
        :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: 'wol','wor','woo'
        :param median: bool if True (default) display an origin and a median mark
        :param ngram: bool if True (default) display full n-gram as leaf label
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param statistics: bool if True (default) displays statistics in center - hole has to be True
        :param sort_by: sort by 'alpha' (default) or 'count'
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :param top: how many different words to count by order frequency. If negative, this will be the least frequent
        :return: matplotlib polar ax, dataframe
    
    word_freq_plot(src, ascending=False, alpha_only=False, asFigure=False, caps=False, display=None, interactive=True, kind='barh', random_state=None, sort_by='count', stop_words=None, top=100)
        word frequency bar chart.
        
        This function creates a classical word frequency bar chart.
        
        :param src: Either a filename including path, a url or a ready to process text in a dataframe or a tokenized format.
        :param alpha_only: words only if True, words and numbers if False
        :param ascending: stem sorted in ascending order, defaults to True
        :param asFigure: if interactive, the function will return a plotly figure instead of a matplotlib ax
        :param caps: keep capitalization (True, False)
        :param display: if specified, sample that quantity of words
        :param interactive: interactive graphic (True, False)
        :param kind: horizontal bar chart (barh) - also 'bar', 'area', 'hist' and non interactive 'kde' and 'pie'
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param sort_by: default to 'count', can also be 'alpha'
        :param stop_words: a list of words to ignore
        :param top: how many different words to count by order frequency. If negative, this will be the least frequent
        :return: text as dataframe and plotly figure or matplotlib ax
    
    word_radar(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')
        word_radar
        
        Radar plot based on words. Currently, the only type of radar plot supported. See `radar' for more detail.
        
        :param word: string, the reference word that will be placed in the middle
        :param comparisons: external file, list or string or dataframe of words
        :param ascending: bool if the sort is ascending
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param label: bool if True display words centered at coordinate
        :param metric: any metric function accepting two values and returning that metric in a range from 0 to x
        :param min_distance: minimum distance based on metric to include a word for display
        :param max_distance: maximum distance based on metric to include a word for display
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param sort_by: default to 'alpha', can also be 'len'
        :return:
    
    word_sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=None, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=None, stem_skip=0, stop_words=None, top=40)
        word_sunburst
        
        Word based sunburst. See sunburst for details
        
        :param words: string, filename, url, list, numpy array, time series, pandas or dask dataframe
        :param alpha_only: only use stems from a-z alphabet (NA on dataframe)
        :param ascending: stem sorted in ascending order, defaults to True
        :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
        :param compact: do not display empty stem rows (with no leaves), defaults to False
        :param display: maximum number of data points to display, forces sampling if smaller than len(df)
        :param hole: bool if True (default) leave space in middle for statistics
        :param label: bool if True display words centered at coordinate
        :param leaf_order: how many leaf digits per data point to display, defaults to 1
        :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: 'wol','wor','woo'
        :param median: bool if True (default) display an origin and a median mark
        :param ngram: bool if True (default) display full n-gram as leaf label
        :param random_state: initial random seed for the sampling process, for reproducible research
        :param statistics: bool if True (default) displays statistics in center - hole has to be True
        :param sort_by: sort by 'alpha' (default) or 'count'
        :param stem_order: how many stem characters per data point to display, defaults to 1
        :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
        :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
        :param top: how many different words to count by order frequency. If negative, this will be the least frequent
        :return:

DATA
    CHAR_FILTER = ['\t', '\n', r'\', '/', '`', '*', '_', '{', '}', '[', ']...
    LETTERS = 'abcdefghijklmnopqrstuvwxyz'
    NON_ALPHA = ['-', '+', '/', '[', ']', '_', '£', '1', '2', '3', '4', '5...

FILE
    /home/fdion/github/stemgraphic/stemgraphic/alpha.py



In [3]:
help(alpha.word_freq_plot)


Help on function word_freq_plot in module stemgraphic.alpha:

word_freq_plot(src, ascending=False, alpha_only=False, asFigure=False, caps=False, display=None, interactive=True, kind='barh', random_state=None, sort_by='count', stop_words=None, top=100)
    word frequency bar chart.
    
    This function creates a classical word frequency bar chart.
    
    :param src: Either a filename including path, a url or a ready to process text in a dataframe or a tokenized format.
    :param alpha_only: words only if True, words and numbers if False
    :param ascending: stem sorted in ascending order, defaults to True
    :param asFigure: if interactive, the function will return a plotly figure instead of a matplotlib ax
    :param caps: keep capitalization (True, False)
    :param display: if specified, sample that quantity of words
    :param interactive: interactive graphic (True, False)
    :param kind: horizontal bar chart (barh) - also 'bar', 'area', 'hist' and non interactive 'kde' and 'pie'
    :param random_state: initial random seed for the sampling process, for reproducible research
    :param sort_by: default to 'count', can also be 'alpha'
    :param stop_words: a list of words to ignore
    :param top: how many different words to count by order frequency. If negative, this will be the least frequent
    :return: text as dataframe and plotly figure or matplotlib ax

The usual suspect: word frequency

Word Frequency Plot


In [4]:
alpha.word_freq_plot('A Case of Identity by Arthur Conan Doyle.txt', top=40);


Interactive plot requested, but cufflinks not loaded. Falling back to matplotlib.

Word Sunburst


In [5]:
alpha.word_sunburst('iden.txt', top=40, sort_by='count', ascending=False);


Interactivity

Let's try the same after importing cufflinks


In [6]:
import cufflinks as cf
cf.go_offline()


Interactive Word Frequency Bar Chart Plot


In [7]:
alpha.word_freq_plot('iden.txt', top=40, stop_words=EN);


Interactive Word Frequency Area Chart Plot


In [8]:
alpha.word_freq_plot('iden.txt', kind='area', top=40, stop_words=EN);


Interactive Heatmap, Stem-and-Leaf


In [9]:
alpha.heatmap('iden.txt', alpha_only=True);


Exploratory text analysis

Reading directly from disk, 'A Case of Identity' by Arthur Conan Doyle and getting a quick view of a few basic statistics, and the shape of the overall distribution using the default settings of stem_text.

Text stem-and-leaf plot


In [10]:
alpha.stem_text('iden.txt');


: 
              index
count    750.000000
mean    6264.593333
std     3654.379436
min       41.000000
25%     3012.250000
50%     6160.000000
75%     9491.250000
max    12699.000000
sampled  750

D| o
F| r
v| i
J| a
P| l
N| oo
L| ey
Y| oo
k| en
j| aou
O| Ffu
q| uuuuu
W| eeeiii
e| aalnvy
M| aorryy
S| hhhitu
g| aaeirrr
A|  bnnnnn
T| hhhhhhhh
H| aeeeoooooooo
y| eoooooooooooo
p| aaehilrrrruuu
l| aeeeeeeiilooo
u| nnnnppppppssss
d| aaaeeiiioooorr
n| eiooooooooooooo
r| aeeeeeeeeeeeiiuu
c| aaaaaaehhhhhooooooorru
m| aaaaaeeeeeiiiiioouuuuyyyy
I|                       ttt
f| aaaaaeeeeiiiooooooooorrrrrtuu
i| dffmnnnnnnnnnnnnnnnsssssstttttttt
b| aaaaeeeeeeeeeeelloorrrrruuuuuuuuy
s|    aaaaaaaaaceeeeeehhhhhhhiillmooooooooppttuw
o| bcffffffffffffffffffffffnnnnnnnrrtttuuuuuvvwww
h| aaaaaaaaaaaeeeeeeeeeeeeeeeeeeeeeeiiiiiiiiiioouy
w| aaaaaaaaaaaaaaaeeeeeeeeehhhhhhhhhhhhhhiiiiiiiiiioooooooooo
a|                       bbbbccddfffglllllllmmnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnrrrrsssssssssttttttt
t|  eehhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhiioooooooooooooooooooooooooorrrwwwyyy

Graphical stem-and-leaf plot

The same, but graphically using stem_graphic


In [11]:
alpha.stem_graphic('iden.txt');


Sorting alphabetically, ignoring caps and using an english stop word list. Stem and / or leaf order can always be specified.


In [12]:
alpha.stem_graphic('iden.txt', caps=False, sort_by='alpha', stop_words=EN, stem_order=1, leaf_order=1);


Sunburst stem-and-leaf plot


In [13]:
alpha.stem_sunburst('iden.txt');


Interactive Stem-and-Leaf stacked frequency plot

As an example, limit stems to english vowels


In [14]:
alpha.stem_freq_plot('iden.txt', column=VOWELS);


Comparisons

Comparing one word with a list of words

Radar

metric=Levenshtein.distance by default


In [15]:
alpha.radar('air', ['are', 'hare', 'hair', 'eyre', 'heir', 'err', 'ere']);


Heatmap Grid


In [16]:
fig = alpha.heatmap_grid('redh.txt', 'iden.txt');



In [17]:
fig.savefig('heatmap_grid.pdf')

In [18]:
alpha.heatmap_grid('redh.txt', 'iden.txt', 'goldbug.txt');



In [19]:
alpha.heatmap_grid('donquixote_es.txt', 'viagealparnaso.txt', 'donquixote_en.txt', 'donquixote_fr.txt');


Back-to-back stem graphic


In [20]:
fig, ax = alpha.stem_graphic('redh.txt', 'iden.txt', caps=False, display=500, random_state=42, stop_words=EN);


SVG, PNG, PDF etc


In [21]:
fig.savefig('red_headed_league_vs_a_case_of_identity.svg', bbox_inches='tight')

In [22]:
!ls -al red_headed_league_vs_a_case_of_identity.svg


-rw-rw-r-- 1 fdion fdion 126844 Feb 23 17:01 red_headed_league_vs_a_case_of_identity.svg

In [ ]: