Boxplots summarize numeric data over a set of categories. The data is divided into four groups called quartiles. A box is drawn connecting the innermost two quartiles, and a horizontal line is drawn at the position of the median (which always falls within the box). Usually, a second set of lines will be drawn some distance from the inner box denoting a "maximum" and "minimum" value for the data, and then values existing outside of these extrema are considered outliers and plotted as individual points. The location of these "whisker" lines is variable and generally some multiple of the innerquartile range (IQR), which is range of values covered by the inner box.
dataset: IMDB 5000 Movie Dataset
In [302]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
In [303]:
df = pd.read_csv('../../datasets/movie_metadata.csv')
In [304]:
df.head()
Out[304]:
For the bar plot, let's look at the number of movies in each category, allowing each movie to be counted more than once.
In [305]:
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])
# one-hot encode each movie's classification
for cat in categories:
df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()
Out[305]:
In [306]:
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
id_vars=['duration'],
value_vars = list(categories),
var_name = 'Category',
value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)
In [307]:
df.head()
Out[307]:
Basic plot
In [308]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration')
The outliers here are making things a bit squished, so I'll remove them since I am just interested in demonstrating the visualization tool.
In [309]:
df = df.loc[df.Duration < 250]
In [310]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration')
Change the order of categories
In [311]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()))
Change the order that the colors are chosen
In [ ]:
Change orientation to horizontal
In [312]:
p = sns.boxplot(data=df,
y = 'Category',
x = 'Duration',
order = sorted(df.Category.unique()),
orient="h")
Desaturate
In [313]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
saturation=.25)
Adjust width of boxes
In [314]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
width=.25)
Change the size of outlier markers
In [315]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
fliersize=20)
Adjust the position of the whiskers as a fraction of IQR
In [316]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
whis=.2)
Add a notch to the box indicating a confidence interval for the median
In [317]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
notch=True)
In [318]:
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
notch=False,
linewidth=2.5)
Finalize
In [319]:
sns.set(rc={"axes.facecolor":"#ccddff",
"axes.grid":False,
'axes.labelsize':30,
'figure.figsize':(20.0, 10.0),
'xtick.labelsize':25,
'ytick.labelsize':20})
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
palette = 'Paired',
order = sorted(df.Category.unique()),
notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(5.4,200, "Box Plot", fontsize = 95, color="black", fontstyle='italic')
Out[319]:
In [320]:
p.get_figure().savefig('../figures/boxplot.png')