In [4]:
%matplotlib inline

In [5]:
import datetime

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
!unzip ml-100k.zip


Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         

In [6]:
%load ml-100k/README

SUMMARY & USAGE LICENSE

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of:

* 100,000 ratings (1-5) from 943 users on 1682 movies. 
* Each user has rated at least 20 movies. 
    * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

 * The user may not state or imply any endorsement from the
   University of Minnesota or the GroupLens Research Group.

 * The user must acknowledge the use of the data set in
   publications resulting from the use of the data set, and must
   send us an electronic or paper copy of those publications.

 * The user may not redistribute the data without separate
   permission.

 * The user may not use this information for any commercial or
   revenue-bearing purposes without first obtaining permission
   from a faculty member of the GroupLens Research Project at the
   University of Minnesota.

If you have any further questions or comments, please contact Jon Herlocker herlocke@cs.umn.edu.

ACKNOWLEDGEMENTS

Thanks to Al Borchers for cleaning up this data and writing the accompanying scripts.

PUBLISHED WORK THAT HAS USED THIS DATASET

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic Framework for Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. Aug. 1999.

FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT

The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Members of the GroupLens Research Project are involved in many research projects related to the fields of information filtering, collaborative filtering, and recommender systems. The project is lead by professors John Riedl and Joseph Konstan. The project began to explore automated collaborative filtering in 1992, but is most well known for its world wide trial of an automated collaborative filtering system for Usenet news in 1996. The technology developed in the Usenet trial formed the base for the formation of Net Perceptions, Inc., which was founded by members of GroupLens Research. Since then the project has expanded its scope to research overall information filtering solutions, integrating in content-based methods as well as improving current collaborative filtering technology.

Further information on the GroupLens Research project, including research publications, can be found at the following web site:

    http://www.grouplens.org/

GroupLens Research currently operates a movie recommender based on collaborative filtering:

    http://www.movielens.org/

DETAILED DESCRIPTIONS OF DATA FILES

Here are brief descriptions of the data.

ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this: gunzip ml-data.tar.gz tar xvf ml-data.tar mku.sh

u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC

u.info -- The number of users, items, and ratings in the u data set.

u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.

u.genre -- A list of the genres.

u.user -- Demographic information about the users; this is a tab separated list of user id | age | gender | occupation | zip code The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base -- The data sets u1.base and u1.test through u5.base and u5.test u1.test are 80%/20% splits of the u data into training and test data. u2.base Each of u1, ..., u5 have disjoint test sets; this if for u2.test 5 fold cross validation (where you repeat your experiment u3.base with each training and test set and average the results). u3.test These data sets can be generated from u.data by mku.sh. u4.base u4.test u5.base u5.test

ua.base -- The data sets ua.base, ua.test, ub.base, and ub.test ua.test split the u data into a training set and a test set with ub.base exactly 10 ratings per user in the test set. The sets ub.test ua.test and ub.test are disjoint. These data sets can be generated from u.data by mku.sh.

allbut.pl -- The script that generates training and test sets where all but n of a users ratings are in the training data.

mku.sh -- A shell script to generate all the u data sets from u.data.


In [8]:
ls ml-100k/


README        u.genre       u.user        u2.test       u4.test       ua.test
allbut.pl*    u.info        u1.base       u3.base       u5.base       ub.base
mku.sh*       u.item        u1.test       u3.test       u5.test       ub.test
u.data        u.occupation  u2.base       u4.base       ua.base

In [9]:
!head -n 2 ml-100k/u.data


196	242	3	881250949
186	302	3	891717742

In [13]:
# user id | item id | rating | timestamp
data_cols = ['user_id', 'item_id', 'rating', 'timestamp']

ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=data_cols)
ratings.head()


Out[13]:
user_id item_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

Selecting Data

Select by row labels .loc[] or position with .iloc.


In [30]:
ratings.loc[0]


Out[30]:
user_id            196
item_id            242
rating               3
timestamp    881250949
Name: 0, dtype: int64

Select columns with __getitem__.


In [28]:
ratings[['user_id', 'item_id']].head()


Out[28]:
user_id item_id
0 196 242
1 186 302
2 22 377
3 244 51
4 166 346

In [31]:
ratings.loc[0:10, ['user_id', 'rating']]


Out[31]:
user_id rating
0 196 3
1 186 3
2 22 1
3 244 2
4 166 1
5 298 4
6 115 2
7 253 5
8 305 3
9 6 3
10 62 2

Heterogenous Data types


In [33]:
ratings['timestamp'] = pd.to_datetime(ratings.timestamp, unit='s')
ratings.dtypes


Out[33]:
user_id               int64
item_id               int64
rating                int64
timestamp    datetime64[ns]
dtype: object

Useful Methods


In [34]:
ratings.describe()


Out[34]:
user_id item_id rating
count 100000.00000 100000.000000 100000.000000
mean 462.48475 425.530130 3.529860
std 266.61442 330.798356 1.125674
min 1.00000 1.000000 1.000000
25% 254.00000 175.000000 3.000000
50% 447.00000 322.000000 4.000000
75% 682.00000 631.000000 4.000000
max 943.00000 1682.000000 5.000000

In [37]:
topids = pd.value_counts(ratings.item_id).head(10).index
topids


Out[37]:
Int64Index([50, 258, 100, 181, 294, 286, 288, 1, 300, 121], dtype='int64')

In [40]:
top_movies = ratings[ratings.item_id.isin(topids)]
top_movies.head()


Out[40]:
user_id item_id rating timestamp
24 308 1 4 1998-02-17 17:28:52
50 251 100 4 1998-01-31 18:38:04
53 25 181 5 1998-01-26 22:23:35
61 20 288 1 1997-11-16 08:06:24
100 32 294 3 1998-01-02 02:57:43

In [44]:
top_movies.describe()


Out[44]:
user_id item_id rating
count 4863.000000 4863.000000 4863.000000
mean 470.011310 185.589143 3.772980
std 269.422128 106.855218 1.078888
min 1.000000 1.000000 1.000000
25% 246.000000 100.000000 3.000000
50% 466.000000 181.000000 4.000000
75% 701.000000 288.000000 5.000000
max 943.000000 300.000000 5.000000

In [46]:
pd.value_counts(top_movies.rating).sort_index().plot(kind='bar')


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x10934fe90>

Are more reviewed movies better liked?

Use groupby:

  1. Split (by item_id)
  2. Apply (count, mean)
  3. Combine (aggregate, 1 row per item)

In [58]:
ratings.groupby('item_id')['rating'].agg(['count', 'mean']).head()


Out[58]:
count mean
item_id
1 452 3.878319
2 131 3.206107
3 90 3.033333
4 209 3.550239
5 86 3.302326

In [73]:
ax = (ratings.groupby('item_id')['rating']
             .agg(['count', 'mean'])
             .plot(kind='scatter', x='count', y='mean'))



In [60]:
import statsmodels.api as sm

In [82]:
mat = ratings.groupby('item_id')['rating'].agg(['mean', 'count'])
mod = sm.OLS.from_formula('mean ~ count', mat)
res = mod.fit()
res.summary()


Out[82]:
OLS Regression Results
Dep. Variable: mean R-squared: 0.185
Model: OLS Adj. R-squared: 0.184
Method: Least Squares F-statistic: 380.4
Date: Wed, 12 Nov 2014 Prob (F-statistic): 1.60e-76
Time: 23:05:06 Log-Likelihood: -1800.2
No. Observations: 1682 AIC: 3604.
Df Residuals: 1680 BIC: 3615.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.8276 0.021 132.042 0.000 2.786 2.870
count 0.0042 0.000 19.503 0.000 0.004 0.005
Omnibus: 71.045 Durbin-Watson: 1.360
Prob(Omnibus): 0.000 Jarque-Bera (JB): 88.618
Skew: -0.444 Prob(JB): 5.71e-20
Kurtosis: 3.691 Cond. No. 124.

In [87]:
def plot_reg(res):
    ax = (ratings.groupby('item_id')['rating']
             .agg(['count', 'mean'])
             .plot(kind='scatter', x='count', y='mean'))
    xx = np.linspace(0, 500)
    y = res.params['Intercept'] + xx * res.params['count']
    ax.plot(xx, y)
    return ax

In [88]:
plot_reg(res)


Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d683ed0>

In [13]:
pd.value_counts(ratings.item_id).hist(bins=80)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d837f28>

Bring in some other data


In [17]:
!head -n 3 ml-100k/u.genre


unknown|0
Action|1
Adventure|2

In [11]:
genres = (pd.read_csv('ml-100k/u.genre', sep='|', index_col=1)
            .squeeze()
            .to_dict())
genres


Out[11]:
{1: 'Action',
 2: 'Adventure',
 3: 'Animation',
 4: "Children's",
 5: 'Comedy',
 6: 'Crime',
 7: 'Documentary',
 8: 'Drama',
 9: 'Fantasy',
 10: 'Film-Noir',
 11: 'Horror',
 12: 'Musical',
 13: 'Mystery',
 14: 'Romance',
 15: 'Sci-Fi',
 16: 'Thriller',
 17: 'War',
 18: 'Western'}

In [12]:
!head -n 3 ml-100k/u.item


1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |


In [93]:
# Cp from readme
names = "movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western"
names = names.replace(' ', '_').split('_|_')
items = pd.read_csv('ml-100k/u.item', names=names, encoding='latin1', sep='|')
items['release_date'] = pd.to_datetime(items.release_date)
items.head()


Out[93]:
movie_id movie_title release_date video_release_date IMDb_URL unknown Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 1 Toy Story (1995) 1995-01-01 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 GoldenEye (1995) 1995-01-01 NaN http://us.imdb.com/M/title-exact?GoldenEye%20(... 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2 3 Four Rooms (1995) 1995-01-01 NaN http://us.imdb.com/M/title-exact?Four%20Rooms%... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 4 Get Shorty (1995) 1995-01-01 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%... 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
4 5 Copycat (1995) 1995-01-01 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0

In [63]:
ratings.head()


Out[63]:
user_id item_id rating timestamp
0 196 242 3 1997-12-04 15:55:49
1 186 302 3 1998-04-04 19:22:22
2 22 377 1 1997-11-07 07:18:36
3 244 51 2 1997-11-27 05:02:03
4 166 346 1 1998-02-02 05:33:16

In [94]:
top_raters = pd.value_counts(ratings.user_id).head(10)
top_rater = top_raters.max()

In [86]:
pd.merge(ratings.query('user_id == @top_rater'), items,
         left_on='item_id', right_on='movie_id')


Out[86]:
user_id item_id rating timestamp movie_id movie_title release_date video_release_date IMDb_URL unknown Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western_|
0 737 428 4 1998-01-09 03:04:26 428 Harold and Maude (1971) 1971-01-01 NaN http://us.imdb.com/M/title-exact?Harold%20and%... 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 737 186 5 1998-01-09 03:02:24 186 Blues Brothers, The (1980) 1980-01-01 NaN http://us.imdb.com/M/title-exact?Blues%20Broth... 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
2 737 89 4 1998-01-09 02:57:44 89 Blade Runner (1982) 1982-01-01 NaN http://us.imdb.com/M/title-exact?Blade%20Runne... 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
3 737 222 3 1998-01-09 03:05:27 222 Star Trek: First Contact (1996) 1996-11-22 NaN http://us.imdb.com/M/title-exact?Star%20Trek:%... 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4 737 47 3 1998-01-09 03:02:50 47 Ed Wood (1994) 1994-01-01 NaN http://us.imdb.com/M/title-exact?Ed%20Wood%20(... 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
5 737 175 5 1998-01-09 03:07:26 175 Brazil (1985) 1985-01-01 NaN http://us.imdb.com/M/title-exact?Brazil%20(1985) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
6 737 180 4 1998-01-09 02:57:24 180 Apocalypse Now (1979) 1979-01-01 NaN http://us.imdb.com/M/title-exact?Apocalypse%20... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
7 737 12 4 1998-01-09 03:02:02 12 Usual Suspects, The (1995) 1995-08-14 NaN http://us.imdb.com/M/title-exact?Usual%20Suspe... 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
8 737 127 5 1998-01-09 03:06:15 127 Godfather, The (1972) 1972-01-01 NaN http://us.imdb.com/M/title-exact?Godfather,%20... 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
9 737 427 3 1998-01-09 03:02:50 427 To Kill a Mockingbird (1962) 1962-01-01 NaN http://us.imdb.com/M/title-exact?To%20Kill%20a... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
10 737 156 5 1998-01-09 02:58:13 156 Reservoir Dogs (1992) 1992-01-01 NaN http://us.imdb.com/M/title-exact?Reservoir%20D... 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
11 737 96 2 1998-01-09 02:58:35 96 Terminator 2: Judgment Day (1991) 1991-01-01 NaN http://us.imdb.com/M/title-exact?Terminator%20... 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
12 737 173 4 1998-01-09 03:02:50 173 Princess Bride, The (1987) 1987-01-01 NaN http://us.imdb.com/M/title-exact?Princess%20Br... 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
13 737 187 5 1998-01-09 03:06:15 187 Godfather: Part II, The (1974) 1974-01-01 NaN http://us.imdb.com/M/title-exact?Godfather:%20... 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
14 737 58 4 1998-01-09 03:02:50 58 Quiz Show (1994) 1994-01-01 NaN http://us.imdb.com/M/title-exact?Quiz%20Show%2... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
15 737 196 3 1998-01-09 02:58:14 196 Dead Poets Society (1989) 1989-01-01 NaN http://us.imdb.com/M/title-exact?Dead%20Poets%... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
16 737 474 5 1998-01-09 02:59:00 474 Dr. Strangelove or: How I Learned to Stop Worr... 1963-01-01 NaN http://us.imdb.com/M/title-exact?Dr.%20Strange... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
17 737 11 3 1998-01-09 03:01:43 11 Seven (Se7en) (1995) 1995-01-01 NaN http://us.imdb.com/M/title-exact?Se7en%20(1995) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
18 737 64 4 1998-01-09 02:59:00 64 Shawshank Redemption, The (1994) 1994-01-01 NaN http://us.imdb.com/M/title-exact?Shawshank%20R... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
19 737 154 4 1998-01-09 02:58:14 154 Monty Python's Life of Brian (1979) 1979-01-01 NaN http://us.imdb.com/M/title-exact?Life%20of%20B... 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
20 737 169 4 1998-01-09 02:57:24 169 Wrong Trousers, The (1993) 1993-01-01 NaN http://us.imdb.com/M/title-exact?Wrong%20Trous... 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
21 737 192 5 1998-01-09 03:02:50 192 Raging Bull (1980) 1980-01-01 NaN http://us.imdb.com/M/title-exact?Raging%20Bull... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
22 737 475 4 1998-01-09 02:58:13 475 Trainspotting (1996) 1996-07-19 NaN http://us.imdb.com/Title?Trainspotting+(1996) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
23 737 32 4 1998-01-09 03:03:13 32 Crumb (1994) 1994-01-01 NaN http://us.imdb.com/M/title-exact?Crumb%20(1994) 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
24 737 171 4 1998-01-09 02:57:24 171 Delicatessen (1991) 1991-01-01 NaN http://us.imdb.com/M/title-exact?Delicatessen%... 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
25 737 258 5 1998-01-09 03:05:27 258 Contact (1997) 1997-07-11 NaN http://us.imdb.com/Title?Contact+(1997/I) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
26 737 160 4 1998-01-09 03:01:21 160 Glengarry Glen Ross (1992) 1992-01-01 NaN http://us.imdb.com/M/title-exact?Glengarry%20G... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
27 737 137 5 1998-01-09 02:58:14 137 Big Night (1996) 1996-09-20 NaN http://us.imdb.com/M/title-exact?Big%20Night%2... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
28 737 501 1 1998-01-09 03:02:02 501 Dumbo (1941) 1941-01-01 NaN http://us.imdb.com/M/title-exact?Dumbo%20(1941) 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
29 737 174 2 1998-01-09 02:59:00 174 Raiders of the Lost Ark (1981) 1981-01-01 NaN http://us.imdb.com/M/title-exact?Raiders%20of%... 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 737 357 5 1998-01-09 03:02:24 357 One Flew Over the Cuckoo's Nest (1975) 1975-01-01 NaN http://us.imdb.com/M/title-exact?One%20Flew%20... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
31 737 100 5 1998-01-09 02:57:44 100 Fargo (1996) 1997-02-14 NaN http://us.imdb.com/M/title-exact?Fargo%20(1996) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0
32 737 22 4 1998-01-09 03:03:13 22 Braveheart (1995) 1996-02-16 NaN http://us.imdb.com/M/title-exact?Braveheart%20... 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0

In [10]:
gen


Out[10]:
user_id item_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

In [13]:
!head -n 2 ml-100k/u.user


1|24|M|technician|85711
2|53|F|other|94043

In [22]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip']
users = pd.read_csv('ml-100k/u.user', sep='|', names=user_cols,
                    index_col='user_id')
users.head()


Out[22]:
age gender occupation zip
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

In [23]:
sns.kdeplot(users.age)


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x115ab4f28>

In [24]:
users.head()


Out[24]:
age gender occupation zip
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

In [25]:
ratings.groupby('user_id')


Out[25]:
<pandas.core.groupby.DataFrameGroupBy object at 0x10e9ab780>