A Rubric for Data Wrangling and Exploration

Companion to Lecture 4 of Harvard CS109: Data Science | Prepared by Chris Beaumont

This scene from Cast Away is an accurate metaphor for the amount of time you'll spend cleaning data, and the delirium you'll experience at the end.



In [59]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#tell pandas to display wide tables as pretty HTML tables
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

I'd like to suggest a basic rubric for the early stages of exploratory data analysis in Python. This isn't universally applicable, but it does cover many patterns which recur in several data analysis contexts. It's useful to keep this rubric in mind when encountering a new dataset.

The basic workflow is as follows:

Build a DataFrame from the data (ideally, put all data in this object)
Clean the DataFrame. It should have the following properties:
- Each row describes a single object
- Each column describes a property of that object
- Columns are numeric whenever appropriate
- Columns contain atomic properties that cannot be further decomposed
Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
Explore group properties. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

Here's a preview of the raw data we'll use -- it's a list of the 10,000 movies made since 1950 with the most IMDB user ratings. It was scraped about a year ago from pages like this. Download the data at http://bit.ly/cs109_imdb.



In [60]:

    
!head imdb_top_10000.txt









    



tt0111161	The Shawshank Redemption (1994)	1994	 9.2	619479	142 mins.	Crime|Drama
tt0110912	Pulp Fiction (1994)	1994	 9.0	490065	154 mins.	Crime|Thriller
tt0137523	Fight Club (1999)	1999	 8.8	458173	139 mins.	Drama|Mystery|Thriller
tt0133093	The Matrix (1999)	1999	 8.7	448114	136 mins.	Action|Adventure|Sci-Fi
tt1375666	Inception (2010)	2010	 8.9	385149	148 mins.	Action|Adventure|Sci-Fi|Thriller
tt0109830	Forrest Gump (1994)	1994	 8.7	368994	142 mins.	Comedy|Drama|Romance
tt0169547	American Beauty (1999)	1999	 8.6	338332	122 mins.	Drama
tt0499549	Avatar (2009)	2009	 8.1	336855	162 mins.	Action|Adventure|Fantasy|Sci-Fi
tt0108052	Schindler's List (1993)	1993	 8.9	325888	195 mins.	Biography|Drama|History|War
tt0080684	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	 8.8	320105	124 mins.	Action|Adventure|Family|Sci-Fi

1. Build a DataFrame

The textfile is tab-separated, and doesn't have any column headers. We set the appropriate keywords in pd.read_csv to handle this



In [61]:

    
names = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
data = pd.read_csv('imdb_top_10000.txt', delimiter='\t', names=names).dropna()
print "Number of rows: %i" % data.shape[0]
data.head()  # print the first 5 rows









    



Number of rows: 9999






    Out[61]:






  
    
      
      imdbID
      title
      year
      score
      votes
      runtime
      genres
    
  
  
    
      0
       tt0111161
       The Shawshank Redemption (1994)
       1994
       9.2
       619479
       142 mins.
                            Crime|Drama
    
    
      1
       tt0110912
                   Pulp Fiction (1994)
       1994
       9.0
       490065
       154 mins.
                         Crime|Thriller
    
    
      2
       tt0137523
                     Fight Club (1999)
       1999
       8.8
       458173
       139 mins.
                 Drama|Mystery|Thriller
    
    
      3
       tt0133093
                     The Matrix (1999)
       1999
       8.7
       448114
       136 mins.
                Action|Adventure|Sci-Fi
    
    
      4
       tt1375666
                      Inception (2010)
       2010
       8.9
       385149
       148 mins.
       Action|Adventure|Sci-Fi|Thriller

2. Clean the DataFrame

There are several problems with the DataFrame at this point:

The runtime column describes a number, but is stored as a string
The genres column is not atomic -- it aggregates several genres together. This makes it hard, for example, to extract which movies are Comedies.
The movie year is repeated in the title and year column

Fixing the runtime column

The following snipptet converts a string like '142 mins.' to the number 142:



In [62]:

    
dirty = '142 mins.'
number, text = dirty.split(' ')
clean = int(number)
print number

We can package this up into a list comprehension



In [63]:

    
clean_runtime = [float(r.split(' ')[0]) for r in data.runtime]
data['runtime'] = clean_runtime
data.head()









    Out[63]:






  
    
      
      imdbID
      title
      year
      score
      votes
      runtime
      genres
    
  
  
    
      0
       tt0111161
       The Shawshank Redemption (1994)
       1994
       9.2
       619479
       142
                            Crime|Drama
    
    
      1
       tt0110912
                   Pulp Fiction (1994)
       1994
       9.0
       490065
       154
                         Crime|Thriller
    
    
      2
       tt0137523
                     Fight Club (1999)
       1999
       8.8
       458173
       139
                 Drama|Mystery|Thriller
    
    
      3
       tt0133093
                     The Matrix (1999)
       1999
       8.7
       448114
       136
                Action|Adventure|Sci-Fi
    
    
      4
       tt1375666
                      Inception (2010)
       2010
       8.9
       385149
       148
       Action|Adventure|Sci-Fi|Thriller

Splitting up the genres

We can use the concept of indicator variables to split the genres column into many columns. Each new column will correspond to a single genre, and each cell will be True or False.



In [64]:

    
#determine the unique genres
genres = set()
for m in data.genres:
    genres.update(g for g in m.split('|'))
genres = sorted(genres)

#make a column for each genre
for genre in genres:
    data[genre] = [genre in movie.split('|') for movie in data.genres]
         
data.head()









    Out[64]:






  
    
      
      imdbID
      title
      year
      score
      votes
      runtime
      genres
      Action
      Adult
      Adventure
      Animation
      Biography
      Comedy
      Crime
      Drama
      Family
      Fantasy
      Film-Noir
      History
      Horror
      Music
      Musical
      Mystery
      News
      Reality-TV
      Romance
      Sci-Fi
      Sport
      Thriller
      War
      Western
    
  
  
    
      0
       tt0111161
       The Shawshank Redemption (1994)
       1994
       9.2
       619479
       142
                            Crime|Drama
       False
       False
       False
       False
       False
       False
        True
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
    
    
      1
       tt0110912
                   Pulp Fiction (1994)
       1994
       9.0
       490065
       154
                         Crime|Thriller
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
    
    
      2
       tt0137523
                     Fight Club (1999)
       1999
       8.8
       458173
       139
                 Drama|Mystery|Thriller
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
       False
        True
       False
       False
    
    
      3
       tt0133093
                     The Matrix (1999)
       1999
       8.7
       448114
       136
                Action|Adventure|Sci-Fi
        True
       False
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
    
    
      4
       tt1375666
                      Inception (2010)
       2010
       8.9
       385149
       148
       Action|Adventure|Sci-Fi|Thriller
        True
       False
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
        True
       False
        True
       False
       False

Removing year from the title

We can fix each element by stripping off the last 7 characters



In [65]:

    
data['title'] = [t[0:-7] for t in data.title]
data.head()









    Out[65]:






  
    
      
      imdbID
      title
      year
      score
      votes
      runtime
      genres
      Action
      Adult
      Adventure
      Animation
      Biography
      Comedy
      Crime
      Drama
      Family
      Fantasy
      Film-Noir
      History
      Horror
      Music
      Musical
      Mystery
      News
      Reality-TV
      Romance
      Sci-Fi
      Sport
      Thriller
      War
      Western
    
  
  
    
      0
       tt0111161
       The Shawshank Redemption
       1994
       9.2
       619479
       142
                            Crime|Drama
       False
       False
       False
       False
       False
       False
        True
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
    
    
      1
       tt0110912
                   Pulp Fiction
       1994
       9.0
       490065
       154
                         Crime|Thriller
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
    
    
      2
       tt0137523
                     Fight Club
       1999
       8.8
       458173
       139
                 Drama|Mystery|Thriller
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
       False
        True
       False
       False
    
    
      3
       tt0133093
                     The Matrix
       1999
       8.7
       448114
       136
                Action|Adventure|Sci-Fi
        True
       False
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
        True
       False
       False
       False
       False
    
    
      4
       tt1375666
                      Inception
       2010
       8.9
       385149
       148
       Action|Adventure|Sci-Fi|Thriller
        True
       False
        True
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
       False
        True
       False
        True
       False
       False

3. Explore global properties

Next, we get a handle on some basic, global summaries of the DataFrame.

Call `describe` on relevant columns



In [66]:

    
data[['score', 'runtime', 'year', 'votes']].describe()









    Out[66]:






  
    
      
      score
      runtime
      year
      votes
    
  
  
    
      count
       9999.000000
       9999.000000
       9999.000000
         9999.000000
    
    
      mean
          6.385989
        103.580358
       1993.471447
        16605.462946
    
    
      std
          1.189965
         26.629310
         14.830049
        34564.883945
    
    
      min
          1.500000
          0.000000
       1950.000000
         1356.000000
    
    
      25%
          5.700000
         93.000000
       1986.000000
         2334.500000
    
    
      50%
          6.600000
        102.000000
       1998.000000
         4981.000000
    
    
      75%
          7.200000
        115.000000
       2005.000000
        15278.500000
    
    
      max
          9.200000
        450.000000
       2011.000000
       619479.000000



In [67]:

    
#hmmm, a runtime of 0 looks suspicious. How many movies have that?
print len(data[data.runtime == 0])

#probably best to flag those bad data as NAN
data.runtime[data.runtime==0] = np.nan

After flagging bad runtimes, we repeat



In [68]:

    
data.runtime.describe()









    Out[68]:





count    9717.000000
mean      106.586395
std        20.230330
min        45.000000
25%        93.000000
50%       103.000000
75%       115.000000
max       450.000000
dtype: float64

Make some basic plots



In [69]:

    
# more movies in recent years, but not *very* recent movies (they haven't had time to receive lots of votes yet?)
plt.hist(data.year, bins=np.arange(1950, 2013), color='#cccccc')
plt.xlabel("Release Year")
remove_border()



In [70]:

    
plt.hist(data.score, bins=20, color='#cccccc')
plt.xlabel("IMDB rating")
remove_border()



In [71]:

    
plt.hist(data.runtime.dropna(), bins=50, color='#cccccc')
plt.xlabel("Runtime distribution")
remove_border()



In [72]:

    
#hmm, more bad, recent movies. Real, or a selection bias?

plt.scatter(data.year, data.score, lw=0, alpha=.08, color='k')
plt.xlabel("Year")
plt.ylabel("IMDB Rating")
remove_border()



In [73]:

    
plt.scatter(data.votes, data.score, lw=0, alpha=.2, color='k')
plt.xlabel("Number of Votes")
plt.ylabel("IMDB Rating")
plt.xscale('log')
remove_border()

Identify some outliers



In [74]:

    
# low-score movies with lots of votes
data[(data.votes > 9e4) & (data.score < 5)][['title', 'year', 'score', 'votes', 'genres']]









    Out[74]:






  
    
      
      title
      year
      score
      votes
      genres
    
  
  
    
      317
             New Moon
       2009
       4.5
       90457
       Adventure|Drama|Fantasy|Romance
    
    
      334
       Batman & Robin
       1997
       3.5
       91875
           Action|Crime|Fantasy|Sci-Fi



In [75]:

    
# The lowest rated movies
data[data.score == data.score.min()][['title', 'year', 'score', 'votes', 'genres']]









    Out[75]:






  
    
      
      title
      year
      score
      votes
      genres
    
  
  
    
      1982
           Manos: The Hands of Fate
       1966
       1.5
       20927
                                   Horror
    
    
      2793
       Superbabies: Baby Geniuses 2
       2004
       1.5
       13196
                            Comedy|Family
    
    
      3746
                  Daniel the Wizard
       2004
       1.5
        8271
       Comedy|Crime|Family|Fantasy|Horror
    
    
      5158
                       Ben & Arthur
       2002
       1.5
        4675
                            Drama|Romance
    
    
      5993
          Night Train to Mundo Fine
       1966
       1.5
        3542
               Action|Adventure|Crime|War
    
    
      6257
                    Monster a-Go Go
       1965
       1.5
        3255
                            Sci-Fi|Horror
    
    
      6726
                         Dream Well
       2009
       1.5
        2848
                     Comedy|Romance|Sport



In [76]:

    
# The highest rated movies
data[data.score == data.score.max()][['title', 'year', 'score', 'votes', 'genres']]









    Out[76]:






  
    
      
      title
      year
      score
      votes
      genres
    
  
  
    
      0 
       The Shawshank Redemption
       1994
       9.2
       619479
       Crime|Drama
    
    
      26
                  The Godfather
       1972
       9.2
       474189
       Crime|Drama

Run aggregation functions like `sum` over several rows or columns

What genres are the most frequent?



In [77]:

    
#sum sums over rows by default
genre_count = np.sort(data[genres].sum())[::-1]
pd.DataFrame({'Genre Count': genre_count})









    Out[77]:






  
    
      
      Genre Count
    
  
  
    
      Drama
       5697
    
    
      Comedy
       3922
    
    
      Thriller
       2832
    
    
      Romance
       2441
    
    
      Action
       1891
    
    
      Crime
       1867
    
    
      Adventure
       1313
    
    
      Horror
       1215
    
    
      Mystery
       1009
    
    
      Fantasy
        916
    
    
      Sci-Fi
        897
    
    
      Family
        754
    
    
      War
        512
    
    
      Biography
        394
    
    
      Music
        371
    
    
      History
        358
    
    
      Animation
        314
    
    
      Sport
        288
    
    
      Musical
        260
    
    
      Western
        235
    
    
      Film-Noir
         40
    
    
      Adult
          9
    
    
      News
          1
    
    
      Reality-TV
          1

How many genres does a movie have, on average?



In [78]:

    
#axis=1 sums over columns instead
genre_count = data[genres].sum(axis=1) 
print "Average movie has %0.2f genres" % genre_count.mean()
genre_count.describe()









    



Average movie has 2.75 genres






    Out[78]:





count    9999.000000
mean        2.753975
std         1.168910
min         1.000000
25%         2.000000
50%         3.000000
75%         3.000000
max         8.000000
dtype: float64

Explore Group Properties

Let's split up movies by decade



In [79]:

    
decade =  (data.year // 10) * 10

tyd = data[['title', 'year']]
tyd['decade'] = decade

tyd.head()









    Out[79]:






  
    
      
      title
      year
      decade
    
  
  
    
      0
       The Shawshank Redemption
       1994
       1990
    
    
      1
                   Pulp Fiction
       1994
       1990
    
    
      2
                     Fight Club
       1999
       1990
    
    
      3
                     The Matrix
       1999
       1990
    
    
      4
                      Inception
       2010
       2010

GroupBy will gather movies into groups with equal decade values



In [80]:

    
#mean score for all movies in each decade
decade_mean = data.groupby(decade).score.mean()
decade_mean.name = 'Decade Mean'
print decade_mean

plt.plot(decade_mean.index, decade_mean.values, 'o-',
        color='r', lw=3, label='Decade Average')
plt.scatter(data.year, data.score, alpha=.04, lw=0, color='k')
plt.xlabel("Year")
plt.ylabel("Score")
plt.legend(frameon=False)
remove_border()









    



year
1950    7.244522
1960    7.062367
1970    6.842297
1980    6.248693
1990    6.199316
2000    6.277858
2010    6.344552
Name: Decade Mean, dtype: float64

We can go one further, and compute the scatter in each year as well



In [81]:

    
grouped_scores = data.groupby(decade).score

mean = grouped_scores.mean()
std = grouped_scores.std()

plt.plot(decade_mean.index, decade_mean.values, 'o-',
        color='r', lw=3, label='Decade Average')
plt.fill_between(decade_mean.index, (decade_mean + std).values,
                 (decade_mean - std).values, color='r', alpha=.2)
plt.scatter(data.year, data.score, alpha=.04, lw=0, color='k')
plt.xlabel("Year")
plt.ylabel("Score")
plt.legend(frameon=False)
remove_border()

You can also iterate over a GroupBy object. Each iteration yields two variables: one of the distinct values of the group key, and the subset of the dataframe where the key equals that value. To find the most popular movie each year:



In [82]:

    
for year, subset in data.groupby('year'):
    print year, subset[subset.score == subset.score.max()].title.values









    



1950 ['Sunset Blvd.']
1951 ['Strangers on a Train']
1952 ["Singin' in the Rain"]
1953 ['The Wages of Fear' 'Tokyo Story']
1954 ['Seven Samurai']
1955 ['Diabolique']
1956 ['The Killing']
1957 ['12 Angry Men']
1958 ['Vertigo']
1959 ['North by Northwest']
1960 ['Psycho']
1961 ['Yojimbo']
1962 ['To Kill a Mockingbird' 'Lawrence of Arabia']
1963 ['The Great Escape' 'High and Low']
1964 ['Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb']
1965 ['For a Few Dollars More']
1966 ['The Good, the Bad and the Ugly']
1967 ['Cool Hand Luke']
1968 ['Once Upon a Time in the West']
1969 ['Butch Cassidy and the Sundance Kid' 'Army of Shadows']
1970 ['Patton' 'The Conformist' 'Le Cercle Rouge']
1971 ['A Clockwork Orange']
1972 ['The Godfather']
1973 ['The Sting' 'Scenes from a Marriage']
1974 ['The Godfather: Part II']
1975 ['Outrageous Class']
1976 ['Tosun Pasa']
1977 ['Star Wars: Episode IV - A New Hope']
1978 ['The Girl with the Red Scarf']
1979 ['Apocalypse Now']
1980 ['Star Wars: Episode V - The Empire Strikes Back']
1981 ['Raiders of the Lost Ark']
1982 ['The Marathon Family']
1983 ['Star Wars: Episode VI - Return of the Jedi']
1984 ['Balkan Spy']
1985 ['The Broken Landlord']
1986 ['Aliens']
1987 ['Mr. Muhsin']
1988 ['Cinema Paradiso']
1989 ['Indiana Jones and the Last Crusade' "Don't Let Them Shoot the Kite"]
1990 ['Goodfellas']
1991 ['The Silence of the Lambs']
1992 ['Reservoir Dogs']
1993 ["Schindler's List"]
1994 ['The Shawshank Redemption']
1995 ['The Usual Suspects' 'Se7en']
1996 ['Fargo' 'The Bandit']
1997 ['Life Is Beautiful']
1998 ['American History X']
1999 ['Fight Club']
2000 ['Memento']
2001 ['The Lord of the Rings: The Fellowship of the Ring']
2002 ['City of God']
2003 ['The Lord of the Rings: The Return of the King']
2004 ['Eternal Sunshine of the Spotless Mind']
2005 ['My Father and My Son']
2006 ['The Departed' 'The Lives of Others']
2007 ['Like Stars on Earth']
2008 ['The Dark Knight']
2009 ['Inglourious Basterds']
2010 ['Inception']
2011 ['A Separation']

Small multiples

Let's split up the movies by genre, and look at how their release year/runtime/IMDB score vary. The distribution for all movies is shown as a grey background.

This isn't a standard groupby, so we can't use the groupby method here. A manual loop is needed



In [83]:

    
#create a 4x6 grid of plots.
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), 
                         tight_layout=True)

bins = np.arange(1950, 2013, 3)
for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].year, 
            bins=bins, histtype='stepfilled', normed=True, color='r', alpha=.3, ec='none')
    ax.hist(data.year, bins=bins, histtype='stepfilled', ec='None', normed=True, zorder=0, color='#cccccc')
    
    ax.annotate(genre, xy=(1955, 3e-2), fontsize=14)
    ax.xaxis.set_ticks(np.arange(1950, 2013, 30))
    ax.set_yticks([])
    remove_border(ax, left=False)
    ax.set_xlabel('Year')

Some subtler patterns here:

Westerns and Musicals have a more level distribution
Film Noir movies were much more popular in the 50s and 60s



In [84]:

    
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), tight_layout=True)

bins = np.arange(30, 240, 10)

for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].runtime, 
            bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
               
    ax.hist(data.runtime, bins=bins, normed=True,
            histtype='stepfilled', ec='none', color='#cccccc',
            zorder=0)
    
    ax.set_xticks(np.arange(30, 240, 60))
    ax.set_yticks([])
    ax.set_xlabel("Runtime [min]")
    remove_border(ax, left=False)
    ax.annotate(genre, xy=(230, .02), ha='right', fontsize=12)

Biographies and history movies are longer
Animated movies are shorter
Film-Noir movies have the same mean, but are more conentrated around a 100 minute runtime
Musicals have the same mean, but greater dispersion in runtimes



In [85]:

    
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), tight_layout=True)

bins = np.arange(0, 10, .5)

for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].score, 
            bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
               
    ax.hist(data.score, bins=bins, normed=True,
            histtype='stepfilled', ec='none', color='#cccccc',
            zorder=0)
    
    ax.set_yticks([])
    ax.set_xlabel("Score")
    remove_border(ax, left=False)
    ax.set_ylim(0, .4)
    ax.annotate(genre, xy=(0, .2), ha='left', fontsize=12)

Film-noirs, histories, and biographies have higher ratings (a selection effect?)
Horror movies and adult films have lower ratings

Other Resources

Chapters 5 and 7 of Python for Data Analysis
IPython notebook on Data wrangling with Pandas
Lecture 4 of Harvard's CS 109: Data Science

css tweaks in this cell

	imdbID	title	year	score	votes	runtime	genres
0	tt0111161	The Shawshank Redemption (1994)	1994	9.2	619479	142 mins.	Crime\|Drama
1	tt0110912	Pulp Fiction (1994)	1994	9.0	490065	154 mins.	Crime\|Thriller
2	tt0137523	Fight Club (1999)	1999	8.8	458173	139 mins.	Drama\|Mystery\|Thriller
3	tt0133093	The Matrix (1999)	1999	8.7	448114	136 mins.	Action\|Adventure\|Sci-Fi
4	tt1375666	Inception (2010)	2010	8.9	385149	148 mins.	Action\|Adventure\|Sci-Fi\|Thriller

	imdbID	title	year	score	votes	runtime	genres	Action	Adult	Adventure	Animation	Biography	Comedy	Crime	Drama	Family	Fantasy	Film-Noir	History	Horror	Music	Musical	Mystery	News	Reality-TV	Romance	Sci-Fi	Sport	Thriller	War	Western
0	tt0111161	The Shawshank Redemption	1994	9.2	619479	142	Crime\|Drama	False	False	False	False	False	False	True	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
1	tt0110912	Pulp Fiction	1994	9.0	490065	154	Crime\|Thriller	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False
2	tt0137523	Fight Club	1999	8.8	458173	139	Drama\|Mystery\|Thriller	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	True	False	False	False	False	False	True	False	False
3	tt0133093	The Matrix	1999	8.7	448114	136	Action\|Adventure\|Sci-Fi	True	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False
4	tt1375666	Inception	2010	8.9	385149	148	Action\|Adventure\|Sci-Fi\|Thriller	True	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	True	False	False

	score	runtime	year	votes
count	9999.000000	9999.000000	9999.000000	9999.000000
mean	6.385989	103.580358	1993.471447	16605.462946
std	1.189965	26.629310	14.830049	34564.883945
min	1.500000	0.000000	1950.000000	1356.000000
25%	5.700000	93.000000	1986.000000	2334.500000
50%	6.600000	102.000000	1998.000000	4981.000000
75%	7.200000	115.000000	2005.000000	15278.500000
max	9.200000	450.000000	2011.000000	619479.000000

	title	year	score	votes	genres
317	New Moon	2009	4.5	90457	Adventure\|Drama\|Fantasy\|Romance
334	Batman & Robin	1997	3.5	91875	Action\|Crime\|Fantasy\|Sci-Fi

	title	year	score	votes	genres
1982	Manos: The Hands of Fate	1966	1.5	20927	Horror
2793	Superbabies: Baby Geniuses 2	2004	1.5	13196	Comedy\|Family
3746	Daniel the Wizard	2004	1.5	8271	Comedy\|Crime\|Family\|Fantasy\|Horror
5158	Ben & Arthur	2002	1.5	4675	Drama\|Romance
5993	Night Train to Mundo Fine	1966	1.5	3542	Action\|Adventure\|Crime\|War
6257	Monster a-Go Go	1965	1.5	3255	Sci-Fi\|Horror
6726	Dream Well	2009	1.5	2848	Comedy\|Romance\|Sport

	Genre Count
Drama	5697
Comedy	3922
Thriller	2832
Romance	2441
Action	1891
Crime	1867
Adventure	1313
Horror	1215
Mystery	1009
Fantasy	916
Sci-Fi	897
Family	754
War	512
Biography	394
Music	371
History	358
Animation	314
Sport	288
Musical	260
Western	235
Film-Noir	40
Adult	9
News	1
Reality-TV	1

	title	year	decade
0	The Shawshank Redemption	1994	1990
1	Pulp Fiction	1994	1990
2	Fight Club	1999	1990
3	The Matrix	1999	1990
4	Inception	2010	2010