seaborn.violinplot


Violinplots summarize numeric data over a set of categories. They are essentially a box plot with a kernel density estimate (KDE) overlaid along the range of the box and reflected to make it look nice. They provide more information than a boxplot because they also include information about how the data is distributed within the inner quartiles. dataset: IMDB 5000 Movie Dataset


In [9]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"

In [10]:
df = pd.read_csv('../../datasets/movie_metadata.csv')

In [11]:
df.head()


Out[11]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

For the bar plot, let's look at the number of movies in each category, allowing each movie to be counted more than once.


In [12]:
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])

# one-hot encode each movie's classification
for cat in categories:
    df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()


Out[12]:
director_name genres duration Animation Film-Noir News Short Thriller Game-Show Action ... Horror Sport Documentary Western Crime Adventure Reality-TV Biography Mystery Romance
0 James Cameron Action|Adventure|Fantasy|Sci-Fi 178.0 0 0 0 0 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
1 Gore Verbinski Action|Adventure|Fantasy 169.0 0 0 0 0 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
2 Sam Mendes Action|Adventure|Thriller 148.0 0 0 0 0 1 0 1 ... 0 0 0 0 0 1 0 0 0 0
3 Christopher Nolan Action|Thriller 164.0 0 0 0 0 1 0 1 ... 0 0 0 0 0 0 0 0 0 0
4 Doug Walker Documentary NaN 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 29 columns


In [13]:
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
             id_vars=['duration'],
             value_vars = list(categories),
             var_name = 'Category',
             value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)

In [14]:
df.head()


Out[14]:
Duration Category Count
20174 148.0 Thriller 1
20175 164.0 Thriller 1
20200 131.0 Thriller 1
20201 124.0 Thriller 1
20202 143.0 Thriller 1

Basic plot


In [15]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration')


The outliers here are making things a bit squished, so I'll remove them since I am just interested in demonstrating the visualization tool.


In [16]:
df = df.loc[df.Duration < 250]

In [17]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration')


Change the order of categories


In [18]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()))


Change the order that the colors are chosen

Change orientation to horizontal


In [28]:
p = sns.violinplot(data=df,
                   y = 'Category',
                   x = 'Duration',
                   order = sorted(df.Category.unique()),
                   orient="h")


Desaturate


In [29]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   saturation=.25)


Adjust width of violins


In [31]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   width=.25)


Change the size of outlier markers


In [32]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   fliersize=20)


Adjust the bandwidth of the KDE filtering parameter. Smaller values will use a thinner kernel and thus will contain higher feature resolution but potentially noise. Here are examples of low and high settings to demonstrate the difference.


In [47]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=.05)



In [50]:
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=5)


Finalize


In [56]:
sns.set(rc={"axes.facecolor":"#ccddff",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(20.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})


p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   palette = 'Paired',
                   order = sorted(df.Category.unique()),
                   notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(4.85,200, "Violin Plot", fontsize = 95, color="black", fontstyle='italic')


Out[56]:
<matplotlib.text.Text at 0x7f0e529be828>

In [27]:
p.get_figure().savefig('../figures/violinplot.png')