Working with Data



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Working through a data munging example

Objective

Read the file movies.csv into a data frame. Plot the popularity of differnet movie genres over the years.

Step 0: Load data set



In [2]:

    
df = pd.read_csv('movies.csv')
df.head()









    Out[2]:






  
    
      
      movieId
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy

Step 1: Extract Year into its own columne



In [3]:

    
df['year'] = df['title'].str.extract('\((\d{4})\)', expand=True)



In [4]:

    
df.head()









    Out[4]:






  
    
      
      movieId
      title
      genres
      year
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1995
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
      1995
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
      1995
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
      1995
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy
      1995

Step 2: Extract each genre and put into single column



In [5]:

    
df1 = df['genres'].str.split('|', expand=True)



In [6]:

    
df1 = pd.concat([df, df1], axis=1)



In [7]:

    
df1.head()









    Out[7]:






  
    
      
      movieId
      title
      genres
      year
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1995
      Adventure
      Animation
      Children
      Comedy
      Fantasy
      None
      None
      None
      None
      None
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
      1995
      Adventure
      Children
      Fantasy
      None
      None
      None
      None
      None
      None
      None
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
      1995
      Comedy
      Romance
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
      1995
      Comedy
      Drama
      Romance
      None
      None
      None
      None
      None
      None
      None
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy
      1995
      Comedy
      None
      None
      None
      None
      None
      None
      None
      None
      None



In [8]:

    
df2 = pd.melt(df1, id_vars=['title', 'year'], 
        value_vars=list(range(10)), value_name='genre')



In [9]:

    
df2.head()









    Out[9]:






  
    
      
      title
      year
      variable
      genre
    
  
  
    
      0
      Toy Story (1995)
      1995
      0
      Adventure
    
    
      1
      Jumanji (1995)
      1995
      0
      Adventure
    
    
      2
      Grumpier Old Men (1995)
      1995
      0
      Comedy
    
    
      3
      Waiting to Exhale (1995)
      1995
      0
      Comedy
    
    
      4
      Father of the Bride Part II (1995)
      1995
      0
      Comedy



In [10]:

    
df2.dropna(inplace=True)



In [11]:

    
df2.drop('variable', axis=1, inplace=True)



In [12]:

    
df2.head()









    Out[12]:






  
    
      
      title
      year
      genre
    
  
  
    
      0
      Toy Story (1995)
      1995
      Adventure
    
    
      1
      Jumanji (1995)
      1995
      Adventure
    
    
      2
      Grumpier Old Men (1995)
      1995
      Comedy
    
    
      3
      Waiting to Exhale (1995)
      1995
      Comedy
    
    
      4
      Father of the Bride Part II (1995)
      1995
      Comedy

Step 3: Count sub-groups



In [13]:

    
df3 = df2.groupby(['year', 'genre']).count()
df3.head()

Step 4: Unstack to get desired table



In [14]:

    
df4 = df3.unstack(1)



In [15]:

    
df4.head()









    Out[15]:






  
    
      
      title
    
    
      genre
      (no genres listed)
      Action
      Adventure
      Animation
      Children
      Comedy
      Crime
      Documentary
      Drama
      Fantasy
      Film-Noir
      Horror
      IMAX
      Musical
      Mystery
      Romance
      Sci-Fi
      Thriller
      War
      Western
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1902
      NaN
      1.0
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
    
    
      1915
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
    
    
      1916
      NaN
      1.0
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
    
    
      1917
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1918
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

Step 5: Clean up and plotting



In [16]:

    
df4.columns = df4.columns.levels[1]
df4.head()









    Out[16]:






  
    
      genre
      (no genres listed)
      Action
      Adventure
      Animation
      Children
      Comedy
      Crime
      Documentary
      Drama
      Fantasy
      Film-Noir
      Horror
      IMAX
      Musical
      Mystery
      Romance
      Sci-Fi
      Thriller
      War
      Western
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1902
      NaN
      1.0
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
    
    
      1915
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
    
    
      1916
      NaN
      1.0
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
    
    
      1917
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1918
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN



In [17]:

    
df4.fillna(0, inplace=True)
df4 = df4.astype('int')
df4.head()









    Out[17]:






  
    
      genre
      (no genres listed)
      Action
      Adventure
      Animation
      Children
      Comedy
      Crime
      Documentary
      Drama
      Fantasy
      Film-Noir
      Horror
      IMAX
      Musical
      Mystery
      Romance
      Sci-Fi
      Thriller
      War
      Western
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1902
      0
      1
      1
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      1915
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      1916
      0
      1
      1
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      1917
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1918
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0



In [19]:

    
df4.plot.line()
plt.legend(loc=(1.05, 0), fontsize=8)









    Out[19]:





<matplotlib.legend.Legend at 0x7f97fc5c2278>



In [ ]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

	movieId	title	genres	year
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	1995
2	3	Grumpier Old Men (1995)	Comedy\|Romance	1995
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	1995
4	5	Father of the Bride Part II (1995)	Comedy	1995

	title	year	genre
0	Toy Story (1995)	1995	Adventure
1	Jumanji (1995)	1995	Adventure
2	Grumpier Old Men (1995)	1995	Comedy
3	Waiting to Exhale (1995)	1995	Comedy
4	Father of the Bride Part II (1995)	1995	Comedy

	title
genre	(no genres listed)	Action	Adventure	Animation	Children	Comedy	Crime	Documentary	Drama	Fantasy	Film-Noir	Horror	IMAX	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
year
1902	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN
1915	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN
1916	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN
1917	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1918	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN