seaborn.swarmplot

Swarmplots display numeric data over a set of categories. They display numeric data as individual datapoints associated with each category, and the datapoints may be jittered slightly along the categorical axis (which has no numeric meaning) in order to better convey the density. Swarmplots are particularly useful in combination with other types of categorical/numeric figures, such as overlaying a swarmplot onto a boxplot. dataset: IMDB 5000 Movie Dataset



In [28]:

    
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"



In [29]:

    
df = pd.read_csv('../../../datasets/movie_metadata.csv')



In [30]:

    
df.head()









    Out[30]:







  
    
      
      color
      director_name
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_2_name
      actor_1_facebook_likes
      gross
      genres
      ...
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
    
  
  
    
      0
      Color
      James Cameron
      723.0
      178.0
      0.0
      855.0
      Joel David Moore
      1000.0
      760505847.0
      Action|Adventure|Fantasy|Sci-Fi
      ...
      3054.0
      English
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      7.9
      1.78
      33000
    
    
      1
      Color
      Gore Verbinski
      302.0
      169.0
      563.0
      1000.0
      Orlando Bloom
      40000.0
      309404152.0
      Action|Adventure|Fantasy
      ...
      1238.0
      English
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      7.1
      2.35
      0
    
    
      2
      Color
      Sam Mendes
      602.0
      148.0
      0.0
      161.0
      Rory Kinnear
      11000.0
      200074175.0
      Action|Adventure|Thriller
      ...
      994.0
      English
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      6.8
      2.35
      85000
    
    
      3
      Color
      Christopher Nolan
      813.0
      164.0
      22000.0
      23000.0
      Christian Bale
      27000.0
      448130642.0
      Action|Thriller
      ...
      2701.0
      English
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      8.5
      2.35
      164000
    
    
      4
      NaN
      Doug Walker
      NaN
      NaN
      131.0
      NaN
      Rob Walker
      131.0
      NaN
      Documentary
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      12.0
      7.1
      NaN
      0
    
  

5 rows × 28 columns

For the bar plot, let's look at the number of movies in each category, allowing each movie to be counted more than once.



In [31]:

    
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])

# one-hot encode each movie's classification
for cat in categories:
    df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()









    Out[31]:







  
    
      
      director_name
      genres
      duration
      Crime
      Comedy
      Thriller
      War
      History
      Horror
      Animation
      ...
      Western
      Mystery
      Short
      Musical
      News
      Sci-Fi
      Reality-TV
      Family
      Action
      Sport
    
  
  
    
      0
      James Cameron
      Action|Adventure|Fantasy|Sci-Fi
      178.0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      1
      0
    
    
      1
      Gore Verbinski
      Action|Adventure|Fantasy
      169.0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      2
      Sam Mendes
      Action|Adventure|Thriller
      148.0
      0
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      3
      Christopher Nolan
      Action|Thriller
      164.0
      0
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      4
      Doug Walker
      Documentary
      NaN
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 29 columns



In [32]:

    
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
             id_vars=['duration'],
             value_vars = list(categories),
             var_name = 'Category',
             value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)



In [ ]:

    
df.head()

Basic plot



In [ ]:

    
p = sns.swarmplot(data=df,
                  x = 'Category',
                  y = 'Duration')

The outliers here are making things a bit squished, so I'll remove them since I am just interested in demonstrating the visualization tool.



In [ ]:

    
df = df.loc[df.Duration < 250]



In [ ]:

    
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration')

Change the order of categories



In [ ]:

    
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()))

Change the order that the colors are chosen

Change orientation to horizontal



In [ ]:

    
p = sns.violinplot(data=df,
                   y = 'Category',
                   x = 'Duration',
                   order = sorted(df.Category.unique()),
                   orient="h")

Desaturate



In [ ]:

    
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   saturation=.25)

Adjust width of violins



In [ ]:

    
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   width=.25)

Change the size of outlier markers



In [ ]:

    
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   fliersize=20)

Adjust the bandwidth of the KDE filtering parameter. Smaller values will use a thinner kernel and thus will contain higher feature resolution but potentially noise. Here are examples of low and high settings to demonstrate the difference.



In [ ]:

    
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=.05)



In [ ]:

    
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=5)

Finalize



In [ ]:

    
sns.set(rc={"axes.facecolor":"#e6e6e6",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(20.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})


p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   palette = 'spectral',
                   order = sorted(df.Category.unique()),
                   notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(4.85,200, "Violin Plot", fontsize = 95, color="black", fontstyle='italic')



In [ ]:

    
p.get_figure().savefig('../../figures/swarmplot.png')

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	...	num_user_for_reviews	language	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	...	3054.0	English	USA	PG-13	237000000.0	2009.0	936.0	7.9	1.78	33000
1	Color	Gore Verbinski	302.0	169.0	563.0	1000.0	Orlando Bloom	40000.0	309404152.0	Action\|Adventure\|Fantasy	...	1238.0	English	USA	PG-13	300000000.0	2007.0	5000.0	7.1	2.35	0
2	Color	Sam Mendes	602.0	148.0	0.0	161.0	Rory Kinnear	11000.0	200074175.0	Action\|Adventure\|Thriller	...	994.0	English	UK	PG-13	245000000.0	2015.0	393.0	6.8	2.35	85000
3	Color	Christopher Nolan	813.0	164.0	22000.0	23000.0	Christian Bale	27000.0	448130642.0	Action\|Thriller	...	2701.0	English	USA	PG-13	250000000.0	2012.0	23000.0	8.5	2.35	164000
4	NaN	Doug Walker	NaN	NaN	131.0	NaN	Rob Walker	131.0	NaN	Documentary	...	NaN	NaN	NaN	NaN	NaN	NaN	12.0	7.1	NaN	0

	Duration	Category	Count
45	140.0	Crime	1
59	91.0	Crime	1
66	152.0	Crime	1
100	106.0	Crime	1
157	90.0	Crime	1