MovieLens

MovieLens is an online movie rating and recommendation system created by the GroupLens lab in the CS department of the University of Minnesota. What's useful for us is that they post lots of data online that we can use. We'll look at what they call their "latest" "small" dataset, which is enough to give us a sense of what's there. Note specifically the license terms (roughly: mention source, problems are your own). We got the idea from Wes McKinney's pandas book.

There are two aspects of this that take some work. The first is reading individual files from a zip posted on the internet. The easy way is to do this is manually -- download the file, unzip it, read it from your hard drive -- but we prefer automated to easy. The second aspect is merging information from different files.

This IPython notebook was created by Dave Backus and Brian LeBlanc in Python 3.5 for the NYU Stern course Data Bootcamp.

Import packages


In [2]:
import pandas as pd             # data package
import requests, io             # internet and input tools  
import zipfile as zf            # zip file tools 
import sys                      # system module, used to get Python version 
import datetime as dt           # date tools, used to note current date  

%matplotlib inline 

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())


Python version:  3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.18.0
Requests version:  2.9.1
Today's date: 2016-04-20

Data input

The data comes as a zip file that contains several csv's. We get the details from the README inside. (It's written in Markdown, so it's easier to read if we use a browser to format it. Or we could cut and paste into a Markdown cell in an IPython notebook, which we do at the bottom of this notebook.)

The file descriptions are:

  • ratings.csv: each line is an individual film rating with the rater and movie id's and the rating. Order: userId, movieId, rating, timestamp.
  • tags.csv: each line is a tag on a specific film. Order: userId, movieId, tag, timestamp.
  • movies.csv: each line is a movie name, it's id, and its genre. Order: movieId, title, genres. Multiple genres are separated by "pipes" |.
  • links.csv: each line contains the movie id and corresponding id's at IMBd and TMDb.

The easy way to input this data is to download the zip file onto our computer, unzip it, and read the individual csv files using read.csv(). But anyone can do it the easy way. We're looking for an automated way, so that if we do this again, possibly with updated data, the whole process is in our code.

Automated data entry involves these steps:

  • Get the file. This uses the requests package, which handles internet files and comes pre-installed with Anaconda. This kind of thing was hidden behind the scenes in the Pandas read_csv() and read_excel() functions, but here we need to do it for ourselves. The package authors add:

    Recreational use of other HTTP libraries may result in dangerous side-effects, including: security vulnerabilities, verbose code, reinventing the wheel, constantly reading documentation, depression, headaches, or even death."

  • Convert to zip. Requests simply loads whatever's at the given url. The io module's io.Bytes reconstructs it as a file, here a zip file.
  • Unzip the file. We use the zipfile module, which is part of core Python, to extract the files inside.
  • Read in the csv's. We use read_csv as usual.

We found this Stack Overflow exchange helpful.

Digression. This is probably more than you want to know, but it's a reminder of what goes on behind the scenes when we apply read_csv to a url. Here we grab whatever is at the url. Then we get its contents, convert it to bytes, identify it as a zip file, and read its components using read_csv. It's a lot easier when this happens automatically, but a reminder what's involved if we ever have to look into the details.


In [13]:
# get "response" from url 
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
r = requests.get(url) 

print('Response type:', type(r))
print('Response .content:', type(r.content)) 
print('Respnse headers:\n', r.headers, sep='')


Response type: <class 'requests.models.Response'>
Response .content: <class 'bytes'>
Respnse headers:
{'Date': 'Sun, 21 Feb 2016 01:44:25 GMT', 'Server': 'Apache/2.2.22 (Ubuntu)', 'Content-Length': '1040425', 'Accept-Ranges': 'bytes', 'Keep-Alive': 'timeout=5, max=100', 'ETag': '"80552-fe029-5291222b37ae7"', 'Connection': 'Keep-Alive', 'Content-Type': 'application/zip', 'Last-Modified': 'Mon, 11 Jan 2016 17:19:11 GMT'}

In [14]:
# convert bytes to zip file  
mlz = zf.ZipFile(io.BytesIO(r.content))   
print('Type of zipfile object:', type(mlz))


Type of zipfile object: <class 'zipfile.ZipFile'>

In [15]:
# what's in the zip file?
mlz.namelist()


Out[15]:
['ml-latest-small/',
 'ml-latest-small/links.csv',
 'ml-latest-small/movies.csv',
 'ml-latest-small/ratings.csv',
 'ml-latest-small/README.txt',
 'ml-latest-small/tags.csv']

In [16]:
# extract and read as csv's
movies  = pd.read_csv(mlz.open(mlz.namelist()[2]))
ratings = pd.read_csv(mlz.open(mlz.namelist()[3]))
tags    = pd.read_csv(mlz.open(mlz.namelist()[5]))

In [7]:
# what do we have? 
for df in [movies, ratings, tags]:
    print('\nType:', type(df))
    print('Dimensions:', df.shape)
    print('Variables:', list(df))
    print('Head:', df.head())


Type: <class 'pandas.core.frame.DataFrame'>
Dimensions: (10329, 3)
Variables: ['movieId', 'title', 'genres']
Head:    movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

Type: <class 'pandas.core.frame.DataFrame'>
Dimensions: (105339, 4)
Variables: ['userId', 'movieId', 'rating', 'timestamp']
Head:    userId  movieId  rating   timestamp
0       1       16     4.0  1217897793
1       1       24     1.5  1217895807
2       1       32     4.0  1217896246
3       1       47     4.0  1217896556
4       1       50     4.0  1217896523

Type: <class 'pandas.core.frame.DataFrame'>
Dimensions: (6138, 4)
Variables: ['userId', 'movieId', 'tag', 'timestamp']
Head:    userId  movieId             tag   timestamp
0      12       16        20060407  1144396544
1      12       16  robert de niro  1144396554
2      12       16        scorcese  1144396564
3      17    64116    movie to see  1234720092
4      21      260          action  1428011080

In [18]:


In [ ]:


In [ ]:

Merge movie names


In [ ]:

Open Toy Story's IMDb page...


In [ ]:


In [ ]:


The README.txt from the MovieLens zip file


Summary

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users between April 03, 1996 and January 09, 2016. This dataset was generated on January 11, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

This is a development dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available benchmark datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.

Usage License

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

  • The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
  • The user must acknowledge the use of the data set in publications resulting from the use of the data set, and must send us an electronic or paper copy of those publications.
  • The user may redistribute the data set, including transformations, so long as it is distributed under these same license conditions.
  • The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
  • The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.

In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).

If you have any further questions or comments, please email grouplens-info@cs.umn.edu

Further Information About GroupLens

GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:

  • recommender systems
  • online communities
  • mobile and ubiquitious technologies
  • digital libraries
  • local geographic information systems

GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit http://movielens.org to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at grouplens-info@cs.umn.edu - we are always interested in working with external collaborators.

Content and Use of Files

Formatting and Encoding

The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double-quotes ("). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

User Ids

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

Movie Ids

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

Ratings Data File Structure (ratings.csv)

All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Tags Data File Structure (tags.csv)

All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Movies Data File Structure (movies.csv)

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres

Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

  • Action
  • Adventure
  • Animation
  • Children's
  • Comedy
  • Crime
  • Documentary
  • Drama
  • Fantasy
  • Film-Noir
  • Horror
  • Musical
  • Mystery
  • Romance
  • Sci-Fi
  • Thriller
  • War
  • Western
  • (no genres listed)

Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,imdbId,tmdbId

movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

Use of the resources listed above is subject to the terms of each provider.

Cross-Validation

Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see LensKit for tools, documentation, and open-source code examples.


In [ ]: