seaborn.swarmplot


Violinplots summarize numeric data over a set of categories. They are essentially a box plot with a kernel density estimate (KDE) overlaid along the range of the box and reflected to make it look nice. They provide more information than a boxplot because they also include information about how the data is distributed within the inner quartiles. dataset: IMDB 5000 Movie Dataset


In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"

In [2]:
df = pd.read_csv('../../../datasets/movie_metadata.csv')

In [3]:
df.head()


Out[3]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

For the bar plot, let's look at the number of movies in each category, allowing each movie to be counted more than once.


In [4]:
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])

# one-hot encode each movie's classification
for cat in categories:
    df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()


Out[4]:
director_name genres duration Crime Comedy Thriller War History Horror Animation ... Western Mystery Short Musical News Sci-Fi Reality-TV Family Action Sport
0 James Cameron Action|Adventure|Fantasy|Sci-Fi 178.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 1 0
1 Gore Verbinski Action|Adventure|Fantasy 169.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 Sam Mendes Action|Adventure|Thriller 148.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 Christopher Nolan Action|Thriller 164.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
4 Doug Walker Documentary NaN 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 29 columns


In [5]:
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
             id_vars=['duration'],
             value_vars = list(categories),
             var_name = 'Category',
             value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)

In [ ]:
df.head()


Out[ ]:
Duration Category Count
45 140.0 Crime 1
59 91.0 Crime 1
66 152.0 Crime 1
100 106.0 Crime 1
157 90.0 Crime 1

Basic plot


In [ ]:
p = sns.swarmplot(data=df,
                  x = 'Category',
                  y = 'Duration')

The outliers here are making things a bit squished, so I'll remove them since I am just interested in demonstrating the visualization tool.


In [ ]:
df = df.loc[df.Duration < 250]

In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration')

Change the order of categories


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()))

Change the order that the colors are chosen

Change orientation to horizontal


In [ ]:
p = sns.violinplot(data=df,
                   y = 'Category',
                   x = 'Duration',
                   order = sorted(df.Category.unique()),
                   orient="h")

Desaturate


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   saturation=.25)

Adjust width of violins


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   width=.25)

Change the size of outlier markers


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   fliersize=20)

Adjust the bandwidth of the KDE filtering parameter. Smaller values will use a thinner kernel and thus will contain higher feature resolution but potentially noise. Here are examples of low and high settings to demonstrate the difference.


In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=.05)

In [ ]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=5)

Finalize


In [ ]:
sns.set(rc={"axes.facecolor":"#e6e6e6",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(20.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})


p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   palette = 'spectral',
                   order = sorted(df.Category.unique()),
                   notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(4.85,200, "Violin Plot", fontsize = 95, color="black", fontstyle='italic')

In [ ]:
p.get_figure().savefig('../../figures/swarmplot.png')