Pandas

pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data

Additional Recommended Resources:

pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries



In [1]:

    
import pandas as pd

Introduction to pandas Data Structures

*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.



In [2]:

    
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])



In [3]:

    
ser









    Out[3]:





tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object



In [4]:

    
ser.index









    Out[4]:





Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')



In [5]:

    
ser.loc[['nancy','bob']]









    Out[5]:





nancy    300
bob      foo
dtype: object



In [6]:

    
ser[[4, 3, 1]]









    Out[6]:





eric    500
dan     bar
bob     foo
dtype: object



In [7]:

    
ser.iloc[2]









    Out[7]:





300



In [8]:

    
'bob' in ser









    Out[8]:





True



In [9]:

    
ser









    Out[9]:





tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object



In [10]:

    
ser * 2









    Out[10]:





tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object



In [11]:

    
ser[['nancy', 'eric']] ** 2









    Out[11]:





nancy     90000
eric     250000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series



In [12]:

    
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}



In [13]:

    
d









    Out[13]:





{'one': apple    100.0
 ball     200.0
 clock    300.0
 dtype: float64, 'two': apple      111.0
 ball       222.0
 cerill     333.0
 dancy     4444.0
 dtype: float64}



In [14]:

    
df = pd.DataFrame(d)
print(df)









    



          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0



In [15]:

    
df.index









    Out[15]:





Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')



In [16]:

    
df.columns









    Out[16]:





Index(['one', 'two'], dtype='object')



In [17]:

    
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])



In [18]:

    
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])









    Out[18]:






  
    
      
      two
      five
    
  
  
    
      dancy
      4444.0
      NaN
    
    
      ball
      222.0
      NaN
    
    
      apple
      111.0
      NaN

Create DataFrame from list of Python dictionaries



In [19]:

    
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]



In [20]:

    
pd.DataFrame(data)



In [21]:

    
pd.DataFrame(data, index=['orange', 'red'])









    Out[21]:






  
    
      
      alex
      alice
      dora
      ema
      joe
    
  
  
    
      orange
      1.0
      NaN
      NaN
      NaN
      2.0
    
    
      red
      NaN
      20.0
      10.0
      5.0
      NaN



In [22]:

    
pd.DataFrame(data, columns=['joe', 'dora','alice'])

Basic DataFrame operations



In [23]:

    
df



In [24]:

    
df['one']









    Out[24]:





apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64



In [25]:

    
df['three'] = df['one'] * df['two']
df



In [26]:

    
df['flag'] = df['one'] > 250
df









    Out[26]:






  
    
      
      one
      two
      three
      flag
    
  
  
    
      apple
      100.0
      111.0
      11100.0
      False
    
    
      ball
      200.0
      222.0
      44400.0
      False
    
    
      cerill
      NaN
      333.0
      NaN
      False
    
    
      clock
      300.0
      NaN
      NaN
      True
    
    
      dancy
      NaN
      4444.0
      NaN
      False



In [27]:

    
three = df.pop('three')



In [28]:

    
three









    Out[28]:





apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64



In [29]:

    
df









    Out[29]:






  
    
      
      one
      two
      flag
    
  
  
    
      apple
      100.0
      111.0
      False
    
    
      ball
      200.0
      222.0
      False
    
    
      cerill
      NaN
      333.0
      False
    
    
      clock
      300.0
      NaN
      True
    
    
      dancy
      NaN
      4444.0
      False



In [30]:

    
del df['two']



In [31]:

    
df









    Out[31]:






  
    
      
      one
      flag
    
  
  
    
      apple
      100.0
      False
    
    
      ball
      200.0
      False
    
    
      cerill
      NaN
      False
    
    
      clock
      300.0
      True
    
    
      dancy
      NaN
      False



In [32]:

    
df.insert(2, 'copy_of_one', df['one'])
df









    Out[32]:






  
    
      
      one
      flag
      copy_of_one
    
  
  
    
      apple
      100.0
      False
      100.0
    
    
      ball
      200.0
      False
      200.0
    
    
      cerill
      NaN
      False
      NaN
    
    
      clock
      300.0
      True
      300.0
    
    
      dancy
      NaN
      False
      NaN



In [33]:

    
df['one_upper_half'] = df['one'][:2]
df









    Out[33]:






  
    
      
      one
      flag
      copy_of_one
      one_upper_half
    
  
  
    
      apple
      100.0
      False
      100.0
      100.0
    
    
      ball
      200.0
      False
      200.0
      200.0
    
    
      cerill
      NaN
      False
      NaN
      NaN
    
    
      clock
      300.0
      True
      300.0
      NaN
    
    
      dancy
      NaN
      False
      NaN
      NaN

Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using pandas.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Data Source: MovieLens web site (filename: ml-20m.zip)
Location: https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.



In [34]:

    
path = r'/Users/Harish/Documents/HK_Work/Python/ml-20m/'



In [35]:

    
# Note: Adjust the name of the folder to match your local directory

!ls /Users/Harish/Documents/HK_Work/Python/ml-20m/









    



README.txt        genome-tags.csv   movies.csv        tags.csv
genome-scores.csv links.csv         ratings.csv



In [36]:

    
!cat /Users/Harish/Documents/HK_Work/Python/ml-20m/movies.csv | wc -l



In [37]:

    
!head -5 /Users/Harish/Documents/HK_Work/Python/ml-20m/ratings.csv

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files:

ratings.csv : userId,movieId,rating, timestamp
tags.csv : userId,movieId, tag, timestamp
movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.



In [38]:

    
movies = pd.read_csv(path+'movies.csv', sep=',')
print(type(movies))
movies.head(15)









    



<class 'pandas.core.frame.DataFrame'>






    Out[38]:






  
    
      
      movieId
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy
    
    
      5
      6
      Heat (1995)
      Action|Crime|Thriller
    
    
      6
      7
      Sabrina (1995)
      Comedy|Romance
    
    
      7
      8
      Tom and Huck (1995)
      Adventure|Children
    
    
      8
      9
      Sudden Death (1995)
      Action
    
    
      9
      10
      GoldenEye (1995)
      Action|Adventure|Thriller
    
    
      10
      11
      American President, The (1995)
      Comedy|Drama|Romance
    
    
      11
      12
      Dracula: Dead and Loving It (1995)
      Comedy|Horror
    
    
      12
      13
      Balto (1995)
      Adventure|Animation|Children
    
    
      13
      14
      Nixon (1995)
      Drama
    
    
      14
      15
      Cutthroat Island (1995)
      Action|Adventure|Romance



In [39]:

    
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv(path+'tags.csv', sep=',')
tags.head()









    Out[39]:






  
    
      
      userId
      movieId
      tag
      timestamp
    
  
  
    
      0
      18
      4141
      Mark Waters
      1240597180
    
    
      1
      65
      208
      dark hero
      1368150078
    
    
      2
      65
      353
      dark hero
      1368150079
    
    
      3
      65
      521
      noir thriller
      1368149983
    
    
      4
      65
      592
      dark hero
      1368150078



In [40]:

    
ratings = pd.read_csv(path+'ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()



In [41]:

    
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series



In [42]:

    
#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)









    Out[42]:





pandas.core.series.Series



In [43]:

    
print(row_0)









    



userId              18
movieId           4141
tag        Mark Waters
Name: 0, dtype: object



In [44]:

    
row_0.index









    Out[44]:





Index(['userId', 'movieId', 'tag'], dtype='object')



In [45]:

    
row_0['userId']









    Out[45]:





18



In [46]:

    
'rating' in row_0









    Out[46]:





False



In [47]:

    
row_0.name









    Out[47]:





0



In [48]:

    
row_0 = row_0.rename('first_row')
row_0.name









    Out[48]:





'first_row'

DataFrames



In [49]:

    
tags.head()









    Out[49]:






  
    
      
      userId
      movieId
      tag
    
  
  
    
      0
      18
      4141
      Mark Waters
    
    
      1
      65
      208
      dark hero
    
    
      2
      65
      353
      dark hero
    
    
      3
      65
      521
      noir thriller
    
    
      4
      65
      592
      dark hero



In [50]:

    
tags.index









    Out[50]:





RangeIndex(start=0, stop=465564, step=1)



In [51]:

    
tags.columns









    Out[51]:





Index(['userId', 'movieId', 'tag'], dtype='object')



In [52]:

    
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]









    Out[52]:






  
    
      
      userId
      movieId
      tag
    
  
  
    
      0
      18
      4141
      Mark Waters
    
    
      11
      65
      1783
      noir thriller
    
    
      2000
      910
      68554
      conspiracy theory

Descriptive Statistics

Let's look how the ratings are distributed!



In [53]:

    
ratings['rating'].describe()









    Out[53]:





count    2.000026e+07
mean     3.525529e+00
std      1.051989e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64



In [54]:

    
ratings.describe()









    Out[54]:






  
    
      
      userId
      movieId
      rating
    
  
  
    
      count
      2.000026e+07
      2.000026e+07
      2.000026e+07
    
    
      mean
      6.904587e+04
      9.041567e+03
      3.525529e+00
    
    
      std
      4.003863e+04
      1.978948e+04
      1.051989e+00
    
    
      min
      1.000000e+00
      1.000000e+00
      5.000000e-01
    
    
      25%
      3.439500e+04
      9.020000e+02
      3.000000e+00
    
    
      50%
      6.914100e+04
      2.167000e+03
      3.500000e+00
    
    
      75%
      1.036370e+05
      4.770000e+03
      4.000000e+00
    
    
      max
      1.384930e+05
      1.312620e+05
      5.000000e+00



In [55]:

    
ratings['rating'].mean()









    Out[55]:





3.5255285642993797



In [56]:

    
ratings.mean()









    Out[56]:





userId     69045.872583
movieId     9041.567330
rating         3.525529
dtype: float64



In [57]:

    
ratings['rating'].min()









    Out[57]:





0.5



In [58]:

    
ratings['rating'].max()









    Out[58]:





5.0



In [59]:

    
ratings['rating'].std()









    Out[59]:





1.051988919275684



In [60]:

    
ratings['rating'].mode()









    Out[60]:





0    4.0
dtype: float64



In [61]:

    
ratings.corr()



In [62]:

    
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()









    



0           False
1           False
2           False
3           False
4           False
5           False
6           False
7           False
8           False
9           False
10          False
11          False
12          False
13          False
14          False
15          False
16          False
17          False
18          False
19          False
20          False
21          False
22          False
23          False
24          False
25          False
26          False
27          False
28          False
29          False
            ...  
20000233    False
20000234    False
20000235    False
20000236    False
20000237    False
20000238    False
20000239    False
20000240    False
20000241    False
20000242    False
20000243    False
20000244    False
20000245    False
20000246    False
20000247    False
20000248    False
20000249    False
20000250    False
20000251    False
20000252    False
20000253    False
20000254    False
20000255    False
20000256    False
20000257    False
20000258    False
20000259    False
20000260    False
20000261    False
20000262    False
Name: rating, dtype: bool






    Out[62]:





False



In [63]:

    
filter_2 = ratings['rating'] > 0
filter_2.all()









    Out[63]:





True

Data Cleaning: Handling Missing Data



In [64]:

    
movies.shape









    Out[64]:





(27278, 3)



In [65]:

    
#is any row NULL ?

movies.isnull().any()









    Out[65]:





movieId    False
title      False
genres     False
dtype: bool

Thats nice ! No NULL values !



In [66]:

    
ratings.shape









    Out[66]:





(20000263, 3)



In [67]:

    
#is any row NULL ?

ratings.isnull().any()









    Out[67]:





userId     False
movieId    False
rating     False
dtype: bool

Thats nice ! No NULL values !



In [68]:

    
tags.shape









    Out[68]:





(465564, 3)



In [69]:

    
#is any row NULL ?

tags.isnull().any()









    Out[69]:





userId     False
movieId    False
tag         True
dtype: bool

We have some tags which are NULL.



In [70]:

    
tags = tags.dropna()



In [71]:

    
#Check again: is any row NULL ?

tags.isnull().any()









    Out[71]:





userId     False
movieId    False
tag        False
dtype: bool



In [72]:

    
tags.shape









    Out[72]:





(465548, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization



In [73]:

    
%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))









    Out[73]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x114e30080>]], dtype=object)



In [74]:

    
ratings.boxplot(column='rating', figsize=(15,20))









    Out[74]:





<matplotlib.axes._subplots.AxesSubplot at 0x114fcdac8>

Slicing Out Columns



In [75]:

    
tags['tag'].head()









    Out[75]:





0      Mark Waters
1        dark hero
2        dark hero
3    noir thriller
4        dark hero
Name: tag, dtype: object



In [76]:

    
movies[['title','genres']].head()









    Out[76]:






  
    
      
      title
      genres
    
  
  
    
      0
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      1
      Jumanji (1995)
      Adventure|Children|Fantasy
    
    
      2
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
    
    
      4
      Father of the Bride Part II (1995)
      Comedy



In [77]:

    
ratings[-10:]



In [78]:

    
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]









    Out[78]:





clockwork orange cinematography                                   1
Im gonna make him an offer he can't refuse. A true masterpiece    1
Nadezhda Kosheverova                                              1
That's my stapler                                                 1
trap                                                              1
creates a new world                                               1
hungarian                                                         1
Ewe Mean Sausage McHoody With Eggs                                1
Anders Nilsson                                                    1
Lawyer                                                            1
Name: tag, dtype: int64



In [79]:

    
tag_counts[:10].plot(kind='bar', figsize=(15,10))









    Out[79]:





<matplotlib.axes._subplots.AxesSubplot at 0x1167330b8>

Filters for Selecting Rows



In [80]:

    
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]



In [81]:

    
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]









    Out[81]:






  
    
      
      movieId
      title
      genres
    
  
  
    
      310
      313
      Swan Princess, The (1994)
      Animation|Children
    
    
      360
      364
      Lion King, The (1994)
      Adventure|Animation|Children|Drama|Musical|IMAX
    
    
      388
      392
      Secret Adventures of Tom Thumb, The (1993)
      Adventure|Animation
    
    
      547
      551
      Nightmare Before Christmas, The (1993)
      Animation|Children|Fantasy|Musical
    
    
      553
      558
      Pagemaster, The (1994)
      Action|Adventure|Animation|Children|Fantasy
    
    
      582
      588
      Aladdin (1992)
      Adventure|Animation|Children|Comedy|Musical
    
    
      588
      594
      Snow White and the Seven Dwarfs (1937)
      Animation|Children|Drama|Fantasy|Musical
    
    
      589
      595
      Beauty and the Beast (1991)
      Animation|Children|Fantasy|Musical|Romance|IMAX
    
    
      590
      596
      Pinocchio (1940)
      Animation|Children|Fantasy|Musical
    
    
      604
      610
      Heavy Metal (1981)
      Action|Adventure|Animation|Horror|Sci-Fi



In [82]:

    
movies[is_animation].head(15)









    Out[82]:






  
    
      
      movieId
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      12
      13
      Balto (1995)
      Adventure|Animation|Children
    
    
      47
      48
      Pocahontas (1995)
      Animation|Children|Drama|Musical|Romance
    
    
      236
      239
      Goofy Movie, A (1995)
      Animation|Children|Comedy|Romance
    
    
      241
      244
      Gumby: The Movie (1995)
      Animation|Children
    
    
      310
      313
      Swan Princess, The (1994)
      Animation|Children
    
    
      360
      364
      Lion King, The (1994)
      Adventure|Animation|Children|Drama|Musical|IMAX
    
    
      388
      392
      Secret Adventures of Tom Thumb, The (1993)
      Adventure|Animation
    
    
      547
      551
      Nightmare Before Christmas, The (1993)
      Animation|Children|Fantasy|Musical
    
    
      553
      558
      Pagemaster, The (1994)
      Action|Adventure|Animation|Children|Fantasy
    
    
      582
      588
      Aladdin (1992)
      Adventure|Animation|Children|Comedy|Musical
    
    
      588
      594
      Snow White and the Seven Dwarfs (1937)
      Animation|Children|Drama|Fantasy|Musical
    
    
      589
      595
      Beauty and the Beast (1991)
      Animation|Children|Fantasy|Musical|Romance|IMAX
    
    
      590
      596
      Pinocchio (1940)
      Animation|Children|Fantasy|Musical
    
    
      604
      610
      Heavy Metal (1981)
      Action|Adventure|Animation|Horror|Sci-Fi

Group By and Aggregate



In [83]:

    
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count



In [84]:

    
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()



In [85]:

    
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()



In [86]:

    
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Merge Dataframes



In [87]:

    
tags.head()









    Out[87]:






  
    
      
      userId
      movieId
      tag
    
  
  
    
      0
      18
      4141
      Mark Waters
    
    
      1
      65
      208
      dark hero
    
    
      2
      65
      353
      dark hero
    
    
      3
      65
      521
      noir thriller
    
    
      4
      65
      592
      dark hero



In [88]:

    
movies.head()









    Out[88]:






  
    
      
      movieId
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy



In [89]:

    
t = movies.merge(tags, on='movieId', how='inner')
t.head()









    Out[89]:






  
    
      
      movieId
      title
      genres
      userId
      tag
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1644
      Watched
    
    
      1
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1741
      computer animation
    
    
      2
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1741
      Disney animated feature
    
    
      3
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1741
      Pixar animation
    
    
      4
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1741
      TÃ©a Leoni does not star in this movie

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

Combine aggreagation, merging, and filters to get useful analytics



In [103]:

    
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()



In [104]:

    
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()









    Out[104]:






  
    
      
      movieId
      title
      genres
      year
      rating
    
  
  
    
      26739
      131254
      Kein Bund für's Leben (2007)
      Comedy
      2007
      4.0
    
    
      26740
      131256
      Feuer, Eis & Dosenbier (2002)
      Comedy
      2002
      4.0
    
    
      26741
      131258
      The Pirates (2014)
      Adventure
      2014
      2.5
    
    
      26742
      131260
      Rentun Ruusu (2001)
      (no genres listed)
      2001
      3.0
    
    
      26743
      131262
      Innocence (2014)
      Adventure|Fantasy|Horror
      2014
      4.0



In [105]:

    
is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]









    Out[105]:






  
    
      
      movieId
      title
      genres
      year
      rating
    
  
  
    
      26737
      131250
      No More School (2000)
      Comedy
      2000
      4.0
    
    
      26738
      131252
      Forklift Driver Klaus: The First Day on the Jo...
      Comedy|Horror
      2001
      4.0
    
    
      26739
      131254
      Kein Bund für's Leben (2007)
      Comedy
      2007
      4.0
    
    
      26740
      131256
      Feuer, Eis & Dosenbier (2002)
      Comedy
      2002
      4.0
    
    
      26743
      131262
      Innocence (2014)
      Adventure|Fantasy|Horror
      2014
      4.0



In [106]:

    
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]









    Out[106]:






  
    
      
      movieId
      title
      genres
      year
      rating
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1995
      3.921240
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
      1995
      3.151040
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
      1995
      2.861393
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy
      1995
      3.064592
    
    
      6
      7
      Sabrina (1995)
      Comedy|Romance
      1995
      3.366484



In [107]:

    
box_office[is_comedy & is_highly_rated][-5:]









    Out[107]:






  
    
      
      movieId
      title
      genres
      year
      rating
    
  
  
    
      26736
      131248
      Brother Bear 2 (2006)
      Adventure|Animation|Children|Comedy|Fantasy
      2006
      4.0
    
    
      26737
      131250
      No More School (2000)
      Comedy
      2000
      4.0
    
    
      26738
      131252
      Forklift Driver Klaus: The First Day on the Jo...
      Comedy|Horror
      2001
      4.0
    
    
      26739
      131254
      Kein Bund für's Leben (2007)
      Comedy
      2007
      4.0
    
    
      26740
      131256
      Feuer, Eis & Dosenbier (2002)
      Comedy
      2002
      4.0

Vectorized String Operations



In [108]:

    
movies.head()









    Out[108]:






  
    
      
      movieId
      title
      genres
      year
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1995
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
      1995
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
      1995
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
      1995
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy
      1995

Split 'genres' into multiple columns



In [109]:

    
movie_genres = movies['genres'].str.split('|', expand=True)



In [110]:

    
movie_genres[:10]









    Out[110]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
    
  
  
    
      0
      Adventure
      Animation
      Children
      Comedy
      Fantasy
      None
      None
      None
      None
      None
    
    
      1
      Adventure
      Children
      Fantasy
      None
      None
      None
      None
      None
      None
      None
    
    
      2
      Comedy
      Romance
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      3
      Comedy
      Drama
      Romance
      None
      None
      None
      None
      None
      None
      None
    
    
      4
      Comedy
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      5
      Action
      Crime
      Thriller
      None
      None
      None
      None
      None
      None
      None
    
    
      6
      Comedy
      Romance
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      7
      Adventure
      Children
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      8
      Action
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      9
      Action
      Adventure
      Thriller
      None
      None
      None
      None
      None
      None
      None

Add a new column for comedy genre flag



In [98]:

    
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')



In [111]:

    
movie_genres[:10]









    Out[111]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
    
  
  
    
      0
      Adventure
      Animation
      Children
      Comedy
      Fantasy
      None
      None
      None
      None
      None
    
    
      1
      Adventure
      Children
      Fantasy
      None
      None
      None
      None
      None
      None
      None
    
    
      2
      Comedy
      Romance
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      3
      Comedy
      Drama
      Romance
      None
      None
      None
      None
      None
      None
      None
    
    
      4
      Comedy
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      5
      Action
      Crime
      Thriller
      None
      None
      None
      None
      None
      None
      None
    
    
      6
      Comedy
      Romance
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      7
      Adventure
      Children
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      8
      Action
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      9
      Action
      Adventure
      Thriller
      None
      None
      None
      None
      None
      None
      None

Extract year from title e.g. (1995)



In [112]:

    
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)



In [113]:

    
movies.tail()









    Out[113]:






  
    
      
      movieId
      title
      genres
      year
    
  
  
    
      27273
      131254
      Kein Bund für's Leben (2007)
      Comedy
      2007
    
    
      27274
      131256
      Feuer, Eis & Dosenbier (2002)
      Comedy
      2002
    
    
      27275
      131258
      The Pirates (2014)
      Adventure
      2014
    
    
      27276
      131260
      Rentun Ruusu (2001)
      (no genres listed)
      2001
    
    
      27277
      131262
      Innocence (2014)
      Adventure|Fantasy|Horror
      2014

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!



In [115]:

    
tags = pd.read_csv(path+r'tags.csv', sep=',')



In [116]:

    
tags.dtypes









    Out[116]:





userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970



In [117]:

    
tags.head(5)









    Out[117]:






  
    
      
      userId
      movieId
      tag
      timestamp
    
  
  
    
      0
      18
      4141
      Mark Waters
      1240597180
    
    
      1
      65
      208
      dark hero
      1368150078
    
    
      2
      65
      353
      dark hero
      1368150079
    
    
      3
      65
      521
      noir thriller
      1368149983
    
    
      4
      65
      592
      dark hero
      1368150078



In [118]:

    
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either M8[ns] depending on the hardware



In [119]:

    
tags['parsed_time'].dtype









    Out[119]:





dtype('<M8[ns]')



In [120]:

    
tags.head(2)









    Out[120]:






  
    
      
      userId
      movieId
      tag
      timestamp
      parsed_time
    
  
  
    
      0
      18
      4141
      Mark Waters
      1240597180
      2009-04-24 18:19:40
    
    
      1
      65
      208
      dark hero
      1368150078
      2013-05-10 01:41:18

Selecting rows based on timestamps



In [121]:

    
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape









    Out[121]:





((465564, 5), (12130, 5))

Sorting the table using the timestamps



In [122]:

    
tags.sort_values(by='parsed_time', ascending=True)[:10]









    Out[122]:






  
    
      
      userId
      movieId
      tag
      timestamp
      parsed_time
    
  
  
    
      333932
      100371
      2788
      monty python
      1135429210
      2005-12-24 13:00:10
    
    
      333927
      100371
      1732
      coen brothers
      1135429236
      2005-12-24 13:00:36
    
    
      333924
      100371
      1206
      stanley kubrick
      1135429248
      2005-12-24 13:00:48
    
    
      333923
      100371
      1193
      jack nicholson
      1135429371
      2005-12-24 13:02:51
    
    
      333939
      100371
      5004
      peter sellers
      1135429399
      2005-12-24 13:03:19
    
    
      333922
      100371
      47
      morgan freeman
      1135429412
      2005-12-24 13:03:32
    
    
      333921
      100371
      47
      brad pitt
      1135429412
      2005-12-24 13:03:32
    
    
      333936
      100371
      4011
      brad pitt
      1135429431
      2005-12-24 13:03:51
    
    
      333937
      100371
      4011
      guy ritchie
      1135429431
      2005-12-24 13:03:51
    
    
      333920
      100371
      32
      bruce willis
      1135429442
      2005-12-24 13:04:02

Average Movie Ratings over Time



In [123]:

    
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()



In [124]:

    
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()









    Out[124]:






  
    
      
      movieId
      rating
    
  
  
    
      movieId
      1.000000
      -0.090369
    
    
      rating
      -0.090369
      1.000000



In [125]:

    
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]



In [126]:

    
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)









    Out[126]:





<matplotlib.axes._subplots.AxesSubplot at 0x119b8bf60>

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

	userId	movieId	rating
20000253	138493	60816	4.5
20000254	138493	61160	4.0
20000255	138493	65682	4.5
20000256	138493	66762	4.5
20000257	138493	68319	4.5
20000258	138493	68954	4.5
20000259	138493	69526	4.5
20000260	138493	69644	3.0
20000261	138493	70286	5.0
20000262	138493	71619	2.5

	rating
movieId
1	49695
2	22243
3	12735
4	2756
5	12161

	year	rating
0	1891	3.000000
1	1893	3.375000
2	1894	3.071429
3	1895	3.125000
4	1896	3.183036
5	1898	3.850000
6	1899	3.625000
7	1900	3.166667
8	1901	5.000000
9	1902	3.738189

	one	two
apple	100.0	111.0
ball	200.0	222.0
cerill	NaN	333.0
clock	300.0	NaN
dancy	NaN	4444.0

	one	flag
apple	100.0	False
ball	200.0	False
cerill	NaN	False
clock	300.0	True
dancy	NaN	False

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller
10	11	American President, The (1995)	Comedy\|Drama\|Romance
11	12	Dracula: Dead and Loving It (1995)	Comedy\|Horror
12	13	Balto (1995)	Adventure\|Animation\|Children
13	14	Nixon (1995)	Drama
14	15	Cutthroat Island (1995)	Action\|Adventure\|Romance

	userId	movieId	tag	timestamp
0	18	4141	Mark Waters	1240597180
1	65	208	dark hero	1368150078
2	65	353	dark hero	1368150079
3	65	521	noir thriller	1368149983
4	65	592	dark hero	1368150078

	userId	movieId	rating	timestamp
0	1	2	3.5	1112486027
1	1	29	3.5	1112484676
2	1	32	3.5	1112484819
3	1	47	3.5	1112484727
4	1	50	3.5	1112484580

	userId	movieId	tag
0	18	4141	Mark Waters
11	65	1783	noir thriller
2000	910	68554	conspiracy theory

	userId	movieId	rating
count	2.000026e+07	2.000026e+07	2.000026e+07
mean	6.904587e+04	9.041567e+03	3.525529e+00
std	4.003863e+04	1.978948e+04	1.051989e+00
min	1.000000e+00	1.000000e+00	5.000000e-01
25%	3.439500e+04	9.020000e+02	3.000000e+00
50%	6.914100e+04	2.167000e+03	3.500000e+00
75%	1.036370e+05	4.770000e+03	4.000000e+00
max	1.384930e+05	1.312620e+05	5.000000e+00

	userId	movieId	rating
userId	1.000000	-0.000850	0.001175
movieId	-0.000850	1.000000	0.002606
rating	0.001175	0.002606	1.000000

	userId	movieId	rating
68	1	2021	4.0
69	1	2100	4.0
70	1	2118	4.0
71	1	2138	4.0
72	1	2140	4.0
73	1	2143	4.0
74	1	2173	4.0
75	1	2174	4.0
76	1	2193	4.0
79	1	2288	4.0
80	1	2291	4.0
81	1	2542	4.0
82	1	2628	4.0
90	1	2762	4.0
92	1	2872	4.0
94	1	2944	4.0
96	1	2959	4.0
97	1	2968	4.0
101	1	3081	4.0
102	1	3153	4.0

	movieId	title	genres
310	313	Swan Princess, The (1994)	Animation\|Children
360	364	Lion King, The (1994)	Adventure\|Animation\|Children\|Drama\|Musical\|IMAX
388	392	Secret Adventures of Tom Thumb, The (1993)	Adventure\|Animation
547	551	Nightmare Before Christmas, The (1993)	Animation\|Children\|Fantasy\|Musical
553	558	Pagemaster, The (1994)	Action\|Adventure\|Animation\|Children\|Fantasy
582	588	Aladdin (1992)	Adventure\|Animation\|Children\|Comedy\|Musical
588	594	Snow White and the Seven Dwarfs (1937)	Animation\|Children\|Drama\|Fantasy\|Musical
589	595	Beauty and the Beast (1991)	Animation\|Children\|Fantasy\|Musical\|Romance\|IMAX
590	596	Pinocchio (1940)	Animation\|Children\|Fantasy\|Musical
604	610	Heavy Metal (1981)	Action\|Adventure\|Animation\|Horror\|Sci-Fi

	movieId
rating
0.5	239125
1.0	680732
1.5	279252
2.0	1430997
2.5	883398
3.0	4291193
3.5	2200156
4.0	5561926
4.5	1534824
5.0	2898660

	movieId	title	genres	userId	tag
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1644	Watched
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	computer animation
2	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	Disney animated feature
3	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	Pixar animation
4	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	TÃ©a Leoni does not star in this movie

	movieId	title	genres	year	rating
26739	131254	Kein Bund für's Leben (2007)	Comedy	2007	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	2002	4.0
26741	131258	The Pirates (2014)	Adventure	2014	2.5
26742	131260	Rentun Ruusu (2001)	(no genres listed)	2001	3.0
26743	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	2014	4.0

	movieId	title	genres	year	rating
26737	131250	No More School (2000)	Comedy	2000	4.0
26738	131252	Forklift Driver Klaus: The First Day on the Jo...	Comedy\|Horror	2001	4.0
26739	131254	Kein Bund für's Leben (2007)	Comedy	2007	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	2002	4.0
26743	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	2014	4.0

	movieId	title	genres	year	rating
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995	3.921240
2	3	Grumpier Old Men (1995)	Comedy\|Romance	1995	3.151040
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	1995	2.861393
4	5	Father of the Bride Part II (1995)	Comedy	1995	3.064592
6	7	Sabrina (1995)	Comedy\|Romance	1995	3.366484

	movieId	title	genres	year	rating
26736	131248	Brother Bear 2 (2006)	Adventure\|Animation\|Children\|Comedy\|Fantasy	2006	4.0
26737	131250	No More School (2000)	Comedy	2000	4.0
26738	131252	Forklift Driver Klaus: The First Day on the Jo...	Comedy\|Horror	2001	4.0
26739	131254	Kein Bund für's Leben (2007)	Comedy	2007	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	2002	4.0

	0	1	2	3	4	5	6	7	8	9
0	Adventure	Animation	Children	Comedy	Fantasy	None	None	None	None	None
1	Adventure	Children	Fantasy	None	None	None	None	None	None	None
2	Comedy	Romance	None	None	None	None	None	None	None	None
3	Comedy	Drama	Romance	None	None	None	None	None	None	None
4	Comedy	None	None	None	None	None	None	None	None	None
5	Action	Crime	Thriller	None	None	None	None	None	None	None
6	Comedy	Romance	None	None	None	None	None	None	None	None
7	Adventure	Children	None	None	None	None	None	None	None	None
8	Action	None	None	None	None	None	None	None	None	None
9	Action	Adventure	Thriller	None	None	None	None	None	None	None

	movieId	title	genres	year
27273	131254	Kein Bund für's Leben (2007)	Comedy	2007
27274	131256	Feuer, Eis & Dosenbier (2002)	Comedy	2002
27275	131258	The Pirates (2014)	Adventure	2014
27276	131260	Rentun Ruusu (2001)	(no genres listed)	2001
27277	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	2014

	userId	movieId	tag	timestamp	parsed_time
333932	100371	2788	monty python	1135429210	2005-12-24 13:00:10
333927	100371	1732	coen brothers	1135429236	2005-12-24 13:00:36
333924	100371	1206	stanley kubrick	1135429248	2005-12-24 13:00:48
333923	100371	1193	jack nicholson	1135429371	2005-12-24 13:02:51
333939	100371	5004	peter sellers	1135429399	2005-12-24 13:03:19
333922	100371	47	morgan freeman	1135429412	2005-12-24 13:03:32
333921	100371	47	brad pitt	1135429412	2005-12-24 13:03:32
333936	100371	4011	brad pitt	1135429431	2005-12-24 13:03:51
333937	100371	4011	guy ritchie	1135429431	2005-12-24 13:03:51
333920	100371	32	bruce willis	1135429442	2005-12-24 13:04:02