.. _linear_categorical:: .. currentmodule:: seaborn

Linear models with categorical data


In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

In [ ]:
sns.set(style="whitegrid")
np.random.seed(sum(map(ord, "linear_categorical")))

In [ ]:
titanic = sns.load_dataset("titanic")
exercise = sns.load_dataset("exercise")
attend = sns.load_dataset("attention")
.. _factorplot: Plotting categorical data with :func:`factorplot` ------------------------------------------------- As with the quantitative functions :func:`lmplot` and :func:`regplot`, you can draw categorical plots with functions that operate at two different levels. In most cases, you'll want to use the :func:`factorplot` function. Like :func:`lmplot`, it plots onto a :class:`FacetGrid` and can visualize :ref:`a lot ` of data quickly. However, in some cases you may want a bit more control over the figure you're making, in which case you can use the lower-level functions :func:`pointplot` and :func:`barplot`. The :func:`factorplot` function is using these behind the scenes, and you can control which gets used with the ``kind`` parameter. The API for the :func:`factorplot` function will be familiar by now. It draws data from a tidy DataFrame, and the positional arguments specify the names of variables that will be placed on the x and y axes of the plot. :func:`factorplot` also takes a third positional argument. It is named ``hue``, as it plays a similar role as the ``hue`` variable in :func:`lmplot`, plotting subsets of the data for easy direct comparison. However, in some cases the ``hue`` variable will also affect the location on the x axis where data is plotted. Because these functions are intended for use with *categorical* data, the x axis is not quantitatively represented. However, there will be cases where the x axis has a natural ordering. The two main kinds of categorical plots show the same data, but with a different emphasis. ``point`` plots are better for comparing between conditions:

In [ ]:
sns.factorplot("kind", "pulse", "diet", exercise, kind="point");
Whereas ``bar`` plots are better for understanding overall magnitude and how far it is from 0:

In [ ]:
sns.factorplot("kind", "pulse", "diet", exercise, kind="bar");
You can also plot a :func:`factorplot` with a boxplot representation (using ``kind="box"``). While the above plots focus on the central tendency of the data (with a measure of the error associated with that value), the boxplot should be used when you care about the *distribution* of the data in different categories.

In [ ]:
sns.factorplot("kind", "pulse", "diet", exercise, kind="box");
When the ``kind`` is not specified, :func:`factorplot` uses a few heuristic rules to choose the appropriate kind of plot to draw. These are pretty rough, and may change over time, so it's better to specify.

In [ ]:
sns.factorplot("kind", "pulse", "diet", exercise);
Naturally, you can specify the palette to render the ``hue`` variable in. Any seaborn palette definition will work, and you can also pass a dictionary mapping values of the ``hue`` variable to colors.

In [ ]:
sns.factorplot("class", "fare", "sex", data=titanic, palette="Pastel1");
Options for grouping the categories ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It's not necessary to use a ``hue`` variable:

In [ ]:
sns.factorplot("class", "fare", data=titanic);
And you can then use the palette to map the ``x`` variable:

In [ ]:
sns.factorplot("class", "fare", data=titanic, palette="Greens_d");
In fact, you don't need to provide a ``y`` variable either. When ``y`` is missing, the height of the plot shows the *count* of observations in each category:

In [ ]:
sns.factorplot("class", data=titanic, palette="PuBuGn_d");

In [ ]:
sns.factorplot("class", hue="sex", data=titanic, palette="Pastel1");
Estimators of central tendency and their error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By default the height of the bars/points shows the mean and 95% confidence interval, but both can be changed.

In [ ]:
sns.factorplot("class", "fare", data=titanic, palette="Greens_d", estimator=np.median);
Remember, the 68% confidence interval shows the standard error of the estimator:

In [ ]:
sns.factorplot("class", "fare", data=titanic, kind="point", ci=68);
Plotting on different facets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Remember, :func:`factorplot` is using a :class:`FacetGrid`, so all of the :ref:`options ` for structuring the plot into different subsets are availible:

In [ ]:
sns.factorplot("class", "survived", "sex", data=titanic, col="sex", aspect=.5, palette="Set1");

In [ ]:
sns.factorplot("deck", data=titanic, row="class", x_order=list("ABCDEF"),
               margin_titles=True, aspect=3, size=2, palette="BuPu_d");
Choices in visual presentation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are a few other choices for how the plot gets drawn when you use a ``point`` plot. Sometimes, the errorbars for different hue categories will overlap:

In [ ]:
sns.factorplot("solutions", "score", "attention", attend);
When this happens, you may want to "dodge" the different hue categories a bit so the extent of the overlap is more clear:

In [ ]:
sns.factorplot("solutions", "score", "attention", attend, dodge=.05);
It's also not strictly necessary to join the points for each of the different ``hue`` levels:

In [ ]:
sns.factorplot("solutions", "score", "attention", attend, join=False, dodge=.2);
The ``x`` and ``hue`` values are plotted in sorted order by default, but sometimes it makes more sense to provide a specific order:

In [ ]:
sns.factorplot("time", "pulse", "kind", exercise, col="diet",
               hue_order=["rest", "walking", "running"],
               palette="YlGnBu_d", aspect=.75);
Although by default the ``hue`` variable is only mapped to different colors (by the ``palette`` argument), you can also use different markers and linestyles for each level of the ``hue`` variable:

In [ ]:
sns.factorplot("solutions", "score", "attention", attend,
               markers=["o", "D"], linestyles=["-", "--"]);
.. _barplot:: .. _pointplot:: Plotting with :func:`pointplot` and :func:`barplot` --------------------------------------------------- As noted above, :func:`factorplot` is a combination of a :class:`FacetGrid` and a lower-level plotting function. If you want to built up a more complicated figure with different kind of presentations in different subplots, you can use the :func:`pointplot` and :func:`barplot` functions directly. They take all of the same arguments as :func:`factorplot`, aside from those that control the faceting.

In [ ]:
f, (ax1, ax2) = plt.subplots(1, 2)
sns.pointplot("time", "pulse", "kind", data=exercise, ax=ax1)
sns.barplot("kind", "pulse", "time", data=exercise, ax=ax2);
Like :func:`regplot`, the lower-level categorical functions also accept their data directly in the form of a Series or array.

In [ ]:
survivors = titanic.query("alive == 'yes'")
gs = plt.GridSpec(3, 1)
plt.subplot(gs[:2])
ax1 = sns.pointplot(survivors["class"], survivors["age"], label="survivors", color=".3")
ax1.set(xticks=[], xlabel="")
plt.subplot(gs[-1])
ax2 = sns.barplot(survivors["class"], color=".5")
ax1.set_xlim(ax2.get_xlim())
sns.despine(left=True, bottom=True)

In [ ]: