Swarmplots display numeric data over a set of categories. They display numeric data as individual datapoints associated with each category, and the datapoints may be jittered slightly along the categorical axis (which has no numeric meaning) in order to better convey the density. Swarmplots are particularly useful in combination with other types of categorical/numeric figures, such as overlaying a swarmplot onto a boxplot. dataset: IMDB 5000 Movie Dataset
In [28]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
In [29]:
df = pd.read_csv('../../../datasets/movie_metadata.csv')
In [30]:
df.head()
Out[30]:
For the bar plot, let's look at the number of movies in each category, allowing each movie to be counted more than once.
In [31]:
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])
# one-hot encode each movie's classification
for cat in categories:
df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()
Out[31]:
In [32]:
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
id_vars=['duration'],
value_vars = list(categories),
var_name = 'Category',
value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)
In [ ]:
df.head()
Out[ ]:
Basic plot
In [ ]:
p = sns.swarmplot(data=df,
x = 'Category',
y = 'Duration')
The outliers here are making things a bit squished, so I'll remove them since I am just interested in demonstrating the visualization tool.
In [ ]:
df = df.loc[df.Duration < 250]
In [ ]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration')
Change the order of categories
In [ ]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()))
Change the order that the colors are chosen
Change orientation to horizontal
In [ ]:
p = sns.violinplot(data=df,
y = 'Category',
x = 'Duration',
order = sorted(df.Category.unique()),
orient="h")
Desaturate
In [ ]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
saturation=.25)
Adjust width of violins
In [ ]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
width=.25)
Change the size of outlier markers
In [ ]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
fliersize=20)
Adjust the bandwidth of the KDE filtering parameter. Smaller values will use a thinner kernel and thus will contain higher feature resolution but potentially noise. Here are examples of low and high settings to demonstrate the difference.
In [ ]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
bw=.05)
In [ ]:
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
bw=5)
Finalize
In [ ]:
sns.set(rc={"axes.facecolor":"#e6e6e6",
"axes.grid":False,
'axes.labelsize':30,
'figure.figsize':(20.0, 10.0),
'xtick.labelsize':25,
'ytick.labelsize':20})
p = sns.violinplot(data=df,
x = 'Category',
y = 'Duration',
palette = 'spectral',
order = sorted(df.Category.unique()),
notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(4.85,200, "Violin Plot", fontsize = 95, color="black", fontstyle='italic')
In [ ]:
p.get_figure().savefig('../../figures/swarmplot.png')