Violinplots summarize numeric data over a set of categories. They are essentially a box plot with a kernel density estimate (KDE) overlaid along the range of the box and reflected to make it look nice. They provide more information than a boxplot because they also include information about how the data is distributed within the inner quartiles. dataset: IMDB 5000 Movie Dataset
In [9]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
In [10]:
df = pd.read_csv('../../datasets/movie_metadata.csv')
In [11]:
df.head()
Out[11]:
For the bar plot, let's look at the number of movies in each category, allowing each movie to be counted more than once.
In [12]:
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])
# one-hot encode each movie's classification
for cat in categories:
df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()
Out[12]:
In [13]:
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
id_vars=['duration'],
value_vars = list(categories),
var_name = 'Category',
value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)
In [14]:
df.head()
Out[14]:
Basic plot
In [15]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration')
The outliers here are making things a bit squished, so I'll remove them since I am just interested in demonstrating the visualization tool.
In [16]:
df = df.loc[df.Duration < 250]
In [17]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration')
Change the order of categories
In [18]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()))
Change the order that the colors are chosen
Change orientation to horizontal
In [28]:
p = sns.violinplot(data=df,
y = 'Category',
x = 'Duration',
order = sorted(df.Category.unique()),
orient="h")
Desaturate
In [29]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
saturation=.25)
Adjust width of violins
In [31]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
width=.25)
Change the size of outlier markers
In [32]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
fliersize=20)
Adjust the bandwidth of the KDE filtering parameter. Smaller values will use a thinner kernel and thus will contain higher feature resolution but potentially noise. Here are examples of low and high settings to demonstrate the difference.
In [47]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
bw=.05)
In [50]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
bw=5)
Finalize
In [56]:
sns.set(rc={"axes.facecolor":"#ccddff",
"axes.grid":False,
'axes.labelsize':30,
'figure.figsize':(20.0, 10.0),
'xtick.labelsize':25,
'ytick.labelsize':20})
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
palette = 'Paired',
order = sorted(df.Category.unique()),
notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(4.85,200, "Violin Plot", fontsize = 95, color="black", fontstyle='italic')
Out[56]:
In [27]:
p.get_figure().savefig('../figures/violinplot.png')