seaborn.swarmplot


Swarmplots display numeric data over a set of categories. They display numeric data as individual datapoints associated with each category, and the datapoints may be jittered slightly along the categorical axis (which has no numeric meaning) in order to better convey the density. Swarmplots are particularly useful in combination with other types of categorical/numeric figures, such as overlaying a swarmplot onto a boxplot. dataset: IMDB 5000 Movie Dataset


In [28]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"

In [29]:
df = pd.read_csv('../../../datasets/movie_metadata.csv')

In [30]:
df.head()


Out[30]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

For the bar plot, let's look at the number of movies in each category, allowing each movie to be counted more than once.


In [31]:
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])

# one-hot encode each movie's classification
for cat in categories:
    df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()


Out[31]:
director_name genres duration Crime Comedy Thriller War History Horror Animation ... Western Mystery Short Musical News Sci-Fi Reality-TV Family Action Sport
0 James Cameron Action|Adventure|Fantasy|Sci-Fi 178.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 1 0
1 Gore Verbinski Action|Adventure|Fantasy 169.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 Sam Mendes Action|Adventure|Thriller 148.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 Christopher Nolan Action|Thriller 164.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
4 Doug Walker Documentary NaN 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 29 columns


In [32]:
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
             id_vars=['duration'],
             value_vars = list(categories),
             var_name = 'Category',
             value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)

In [ ]:
df.head()


Out[ ]:
Duration Category Count
45 140.0 Crime 1
59 91.0 Crime 1
66 152.0 Crime 1
100 106.0 Crime 1
157 90.0 Crime 1

Basic plot


In [ ]:
p = sns.swarmplot(data=df,
                  x = 'Category',
                  y = 'Duration')

The outliers here are making things a bit squished, so I'll remove them since I am just interested in demonstrating the visualization tool.


In [ ]:
df = df.loc[df.Duration < 250]

In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration')

Change the order of categories


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()))

Change the order that the colors are chosen

Change orientation to horizontal


In [ ]:
p = sns.violinplot(data=df,
                   y = 'Category',
                   x = 'Duration',
                   order = sorted(df.Category.unique()),
                   orient="h")

Desaturate


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   saturation=.25)

Adjust width of violins


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   width=.25)

Change the size of outlier markers


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   fliersize=20)

Adjust the bandwidth of the KDE filtering parameter. Smaller values will use a thinner kernel and thus will contain higher feature resolution but potentially noise. Here are examples of low and high settings to demonstrate the difference.


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=.05)

In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=5)

Finalize


In [ ]:
sns.set(rc={"axes.facecolor":"#e6e6e6",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(20.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})


p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   palette = 'spectral',
                   order = sorted(df.Category.unique()),
                   notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(4.85,200, "Violin Plot", fontsize = 95, color="black", fontstyle='italic')

In [ ]:
p.get_figure().savefig('../../figures/swarmplot.png')