3 Important steps before starting anything

3.1 General information

3.1.1 Submission

3.1.2 Tips

3.1.3 Credits

4.0.1 downloading the dataset

5 Loading the dataset

6 Normalizing the dataset

6.1 the .dt and the .str namespaces

6.1.1 .dt namespace

6.1.2 .str namespace

6.2 Let's extract the release year of a movie into a separate

6.2.1 the most common years are

6.2.2 video_release_date is always NaT (not a time), let's drop it

7 Basic analysis of the dataset

8 Basic queries

9 Which movies were released in 1956?

10 How many movies were released in the 80s?

11 When were the Die Hard movies released?

11.1 How many movies are both action and romance? What about action or romance?

12 Problem Set 1: simple queries

13 Q1.1 How many action movies were release before 1985 and in 1985 or later? (2 points)

14 Q1.2 Are there thrillers for children? Find an example. (2 points)

15 Q1.3 How many movies have title longer than 40 characters? (3 points)

16 pd.Series.apply: running arbitrary functions on each element

17 Q1.4* How many content words does the average title have? Which title has the most/least words?

17.1 Step 1. Count words in the title (2 points)

17.2 Step 2. Shortest and longest titles by word count (3 points)

18 Q1.5* How many movies have the word 'and' in their title? Write a function that counts movies with a particular word. (3 points)

19 Groupby and visualization

19.0.1 we can groupby on multiple columns

19.0.2 we can also group on arbitrary conditions

20 Problem Set 2: Groupby and visualization

21 Q2.1 Write a function that takes a genre and groups movies of that genre by year. Do not include movies older than 1985. (3 points)

22 Q2.2 Plot the number of adventure movies from 1985 to 1999 on a bar chart. Use your group_genre_by_year function. (2 points)

23 Q2.3 Plot the distribution of release day (day of month) on a pie chart.

23.1 Step 1. groupby (2 points)

23.2 Step 2. pie chart. Add percent values. (3 points)

24 Q2.4 We're building a traditional lexicon of the titles. What is the distribution of initial letters (i.e. how many titles start with S?)? Plot it on a bar chart.

24.1 Step 1. Compute frequencies. (3 points)

24.2 Step 2. Plot it on a bar chart in descending order. (3 points)

25 Problem Set 3. Handling multiple dataframes

26 Merging with the movies dataframe

26.1 How many ratings are timestamped before the release date?

26.2 Which movies were rated before the release date?

27 Q3.1. How many times was each movie rated?

27.1 Step 1. Compute the frequencies of movie ratings. Use titles instead of movie ids. (2 points)

27.2 Step 2. Plot the frequencies on a histogram. (2 points)

28 Q3.2. How many ratings were submitted by each day of the week? What were their averages?

28.1 Step 1. groupby (3 points)

28.2 Step 2. number of ratings per day (2 points)

28.3 Step 3. mean rating by day (2 points)

29 Q3.3** What is the mean of ratings by genre?

29.1 Step 1. Compute the mean scores. (5 points)

29.2 Step 2. Plot it on a bar chart in descending order by score. Set the limits of the y-axis to (2.5, 4). (3 points)

30 Problem Set 4. User demographics

31 Q4.1 Load the users table from the file u.user. (3 points)

32 Q4.2 Merge the users table with ratings. Do not discard any columns. (3 points)

33 Q 4.3 How strict are people by occupation? Compute the average of ratings by occupation. Plot it on a bar chart in descending order.

33.1 Step 1. Compute the averages by occupation. (2 points)

33.2 Step 2. Plot it on a bar chart. (2 points)

34 Q4.4* Plot the averages by occupation and gender on a multiple bar plot. (4 points)

35 Q4.5 How likely are different age groups to rate movies? Compute the number of ratings by age grouped into 10-19, 20-29, etc. Plot it on a bar chart.

35.1 Step 1. Number of ratings by age group (3 points)

35.2 Step 2. Plot it on a bar chart. (2 points)

36 Q4.6 What hour of the day do different occupations rate? (3 points)

36.1 Create a function that computes the number of ratings per hour for a single occupation.

37 Q4.7* Plot the rating hours of marketing employees and programmers on two pie charts. (4 points)

38 Q4.8* Do older people prefer movies with longer titles? Compute the average title length by age group (0-10, 10-20).

38.1 Step1. compute mean length (4 points)

38.2 Step 2. Plot it on a bar chart. Choose a reasonable range for the y-axis. (2 points)

39 Problem Set 5. A simple recommendation system

40 Q5.1. Extract genre information as a matrix. (2 points)

41 Q5.2 Run the k-nearest neighbor algorithm on X. (3 points)

42 Q5.3 Increment by one (3 points)

43 Q5.4* Find the movies corresponding to these indices (5 points)

44 Q5.5* Replace the index of the movie by its title. (2 points)

45 Q5.6** Improve your recommedation system by adding other columns. (5 points)

46 Q6* Extra (3 points)

47 Submission

YOUR NAME (NEPTUN)

Business Intelligence - Pandas

March 11, 2020

Important steps before starting anything

Pandas is outdated on most lab computers. Upgrading can be done via Anaconda Prompt:

conda upgrade --all
conda upgrade pandas

Both commands ask for confirmation, just press Enter. If Windows asks for any permissions, you can deny it (allowing would require Administrator priviliges).

You should not see a warning here:



In [ ]:

    
import pandas as pd
if pd.__version__ < '1':
    print("WARNING: Pandas version older than 1.0.0: {}".format(pd.__version__))
else:
    print("Pandas version OK: {}".format(pd.__version__))

General information

This goal of this notebook is to give a brief introduction to the pandas library, a popular data manipulation and analysis tool for Python.

Problems are numbered from Q1 to Q5 with many subproblems such as Q1.1. The scores range from 2 to 5 based on the difficulty of the problem. Grades are determined using this table:

score	grade
80+	5
60+	4
40+	3
20+	2
20-	1

Your answer should go in place of YOUR CODE HERE. Please remove raise NotImplementedError. Most of the tasks are automatically graded using visible and hidden tests. Visible tests are available in this version, hidden tests are not available to you. This means that passing all visible tests does not ensure that your answer is correct. Not passing the visible tests means that your answer is incorrect. Do not delete or copy cells and do not edit the source code. You may add cells but they will not be graded. You will not find the hidden tests in the source code but you can mess up the autograder if you manually edit it.

VERY IMPORTANT Do not edit cell metadata or the raw .ipynb. Autograde will fail if you change something carelessly.

Advanced exercises are marked with *. More advanced ones have more stars. Some problems build on each other - it should be obvious how from the code - but advanced problems can safely be skipped. Completing all non-advanced exercises correctly is worth 63 points.

Submission

You only need to submit this notebook (no separate report). Make sure that you save the last version of your notebook. Please rename the notebook to your neptun code (ABC123.ipynb), package it in an archive (.zip) and upload it to the class website. You are free to continue working on this problem set after class but make sure that you upload it by the end of the week (Sunday).

VERY IMPORTANT Run Kernel->Restart & Run All and make sure that it finishes without errors. If you skip exercises, you need to manually run the remaining cells. You can run a single cell and step to the next cell by pressing Shift+Enter. Skipping exercises won't affect the autograder.

Tips

You generally don't need to leave any DataFrames printed as cell outputs. You can do it for debug purposes and it won't affect the autograder but please don't leave long tables in the output. Use .head() instead.

Be concise. All exercises can be solved with less than 5 lines of code.

Avoid for loops. Almost all tasks can be solved with efficient pandas operations.

Avoid overriding Python built-in functions with your own variables (max = 2).

If you mess up, you can always do one of the following

Kernel -> Restart & Run All - this will run all cells from top to bottom until an exception is thrown
Kernel -> Restart, Cell -> Run All Above - this will run all cells from top to bottom until the current cell is reached or and exception is thrown

If your notebook runs for longer than a minute, one or more of your solutions is very inefficient.

It's easy to accidentally change the type of a cell from code to Markdown/raw text. When this happens, you lose syntax highlight in that cell. You can change it back in the toolbar or in the Cell menu.

Jupyter has an extension called Table of Contents (2). Table of Contents lists all headers in a separate frame and makes navigation much easier. You can install and enable it the following way in Anaconda Prompt:

conda install -c conda-forge jupyter_contrib_nbextensions
python -m jupyter nbextension enable toc2/main

Save and refresh this notebook and you should see a new button in the toolbar that looks like a bullet point list. That button turns on Table of Contents.

Credits

This assignment was created by Judit Ács using nbgrader.

Feedback

Please fill out this short survey after you completed the problems.

Code quality

You can get 3 extra points for code quality.

PEP8 style guide

You can get 2 extra points for adhering to the PEP8 Python style guide.

Figure quality

You can get 5 extra points for the quality of your figures. Good figures have labeled axes with meaningful names, reasonable figure size and reasonable axes limits. Extra attention to details also helps.

Pie charts are ugly no matter what you do, do not sweat it too much, they are the Comic Sans of visualization.

Main imports



In [ ]:

    
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import seaborn as sns
sns.set_context('notebook')

downloading the dataset



In [ ]:

    
import os

data_dir = os.getenv("MOVIELENS")
if data_dir is None:
    data_dir = ""

ml_path = os.path.join(data_dir, "ml.zip")

if os.path.exists(ml_path):
    print("File already exists, skipping download step.")
else:
    print("Downloading the Movielens dataset")
    import urllib
    u = urllib.request.URLopener()
    u.retrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", ml_path)



In [ ]:

    
unzip_path = os.path.join(data_dir, "ml-100k")

if os.path.exists(unzip_path):
    print("Dataset already unpacked, skipping unpacking step.")
else:
    print("Unziping the dataset.")
    from zipfile import ZipFile
    with ZipFile(ml_path) as myzip:
        myzip.extractall(data_dir)
        
data_dir = unzip_path

Loading the dataset

pd.read_table loads a tabular dataset. The full function signature is:

pandas.read_table(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep='t', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None

let's try it with defaults



In [ ]:

    
# df = pd.read_table("ml-100k/u.item")  # UnicodeDecodeErrort kapunk, mert rossz dekódert használ
df = pd.read_table(os.path.join(data_dir, "u.item"), encoding="latin1")
df.head()

A couple improvements:

Use a different separator. | instead of \t.
The first line of the file is used as the header. The real names of the columns are listed in the README, they can be specified with the names parameters.
read_table added an index (0..N-1), but the dataset already has an index, let's use that one (index_col='movie_id')
two columns, release_date and video_release_date are dates, pandas can parse them and create its own datetype.



In [ ]:

    
column_names = [
    "movie_id", "title", "release_date", "video_release_date", "imdb_url", "unknown", "action", "adventure", "animation",
    "children", "comedy", "crime", "documentary", "drama", "fantasy", "film_noir", "horror", "musical", "mystery",
    "romance", "sci_fi", "thriller", "war", "western"]
df = pd.read_table(
    os.path.join(data_dir, "u.item"), sep="|",
    names=column_names, encoding="latin1", index_col='movie_id',
    parse_dates=['release_date', 'video_release_date']
)
df.head()

Normalizing the dataset

the `.dt` and the `.str` namespaces

The year in the title seems redundant, let's check if it's always the same as the release date. The .dt namespace has various methods and attributes for handling dates, and the .str namespace has string handling methods.

`.dt` namespace



In [ ]:

    
print(", ".join([d for d in dir(df.release_date.dt) if not d.startswith('_')]))

`.str` namespace



In [ ]:

    
print(", ".join([d for d in dir(df.title.str) if not d.startswith('_')]))

.str can also be indexed like a string



In [ ]:

    
df.title.str[1].head()



In [ ]:

    
df.title.str[-5:].tail()

Let's extract the release year of a movie into a separate



In [ ]:

    
df['year'] = df.release_date.dt.year

the most common years are



In [ ]:

    
df.year.value_counts().head()

`video_release_date` is always `NaT` (not a time), let's drop it



In [ ]:

    
df.video_release_date.isnull().value_counts()



In [ ]:

    
df = df.drop('video_release_date', axis=1)

Basic analysis of the dataset

describe generate descriptive statistics.



In [ ]:

    
df.describe()

Only numeric columns are included by default. A single column (pd.Series) has a describe function too



In [ ]:

    
df.release_date.describe()

Numberic statistics are available as separate functions too:

count: number of non-NA cells. NA is NOT the same as 0
mean: average
std: standard deviation
var: variance
min, max

etc.



In [ ]:

    
df.mean()

Basic queries

Which movies were released in 1956?



In [ ]:

    
df[df.release_date.dt.year==1956]

How many movies were released in the 80s?

Let's print 5 examples too.



In [ ]:

    
d = df[(df.release_date.dt.year >=  1980) & (df.release_date.dt.year < 1990)]
print(f"{len(d)} movies were released in the 80s.")
print("\nA few examples:")
print("\n".join(d.sample(5).title))

When were the Die Hard movies released?



In [ ]:

    
df[df.title.str.contains('Die Hard')]

Die Hard 4 and 5 are missing. This is because the dataset only contains movies released between:



In [ ]:

    
df.release_date.min(), df.release_date.max()

and Die Hard 4 and 5 were released in 2007 and 2013 respectively.

How many movies are both action and romance? What about action or romance?

Make sure you parenthesize the conditions



In [ ]:

    
print("Action and romance:", len(df[(df.action==1) & (df.romance==1)]))
print("Action or romance:", len(df[(df.action==1) | (df.romance==1)]))

Problem Set 1: simple queries

Q1.1 How many action movies were release before 1985 and in 1985 or later? (2 points)



In [ ]:

    
def count_movies_before_1985(df):
    # YOUR CODE HERE
    raise NotImplementedError()
    
def count_movies_after_1984(df):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
before = count_movies_before_1985(df)
assert isinstance(before, int) or isinstance(before, np.integer)
# action movies only, not all!
assert before != 272

after = count_movies_after_1984(df)
assert isinstance(after, int) or isinstance(after, np.integer)

Q1.2 Are there thrillers for children? Find an example. (2 points)



In [ ]:

    
def child_thriller(df):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
title = child_thriller(df)
assert isinstance(title, str)

Q1.3 How many movies have title longer than 40 characters? (3 points)



In [ ]:

    
def long_titles(df):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
title_cnt = long_titles(df)
assert isinstance(title_cnt, int) or isinstance(title_cnt, np.integer)

`pd.Series.apply`: running arbitrary functions on each element

The apply function allow running arbitrary functions on pd.Series:



In [ ]:

    
def number_of_words(field):
    return len(field.split(" "))

df.title.apply(number_of_words).value_counts().sort_index()
# or
# df.title.apply(lambda t: len(t.split(" "))).value_counts().sort_index()



In [ ]:

    
df.title.apply(number_of_words).value_counts().sort_index().plot(kind='bar')

Q1.4* How many content words does the average title have? Which title has the most/least words?

Content words are capitalized. The opposite of content words are function words (and, a, an, the, more etc.).

We should not include the release year in the word count. The release year is always the last word of the title.

Step 1. Count words in the title (2 points)



In [ ]:

    
#df['title_word_cnt'] = ...
def count_content_words(title):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
df['title_word_cnt'] = df.title.apply(count_content_words)
assert 'title_word_cnt' in df.columns
assert df.loc[1424, 'title_word_cnt'] == 5
assert df.loc[1170, 'title_word_cnt'] == 2



In [ ]:

    
df.title_word_cnt.value_counts().sort_index().plot(kind='bar')

Step 2. Shortest and longest titles by word count (3 points)



In [ ]:

    
#shortest_title = ...
#longest_title = ...
#shortest_title_len = ...
#longest_title_len = ...
# YOUR CODE HERE
raise NotImplementedError()



In [ ]:

    
assert isinstance(shortest_title, str)
assert isinstance(longest_title, str)
assert isinstance(shortest_title_len, np.integer) or isinstance(shortest_title_len, int)
assert isinstance(longest_title_len, np.integer) or isinstance(longest_title_len, int)
assert shortest_title_len == 0
assert longest_title == 'Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)'
assert shortest_title == 'unknown'
assert longest_title_len == 10

Q1.5* How many movies have the word 'and' in their title? Write a function that counts movies with a particular word. (3 points)

Disregard case and avoid matching subwords for examples. For example 'and' should not match 'Holland' nor should it match the movie 'Andrew'.



In [ ]:

    
def movies_with_word(df, word):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
and_movies = movies_with_word(df, "and")
assert isinstance(and_movies, pd.DataFrame)
assert and_movies.shape == (66, 24)
assert 'Mr. Holland\'s Opus (1995)' not in and_movies.title.values
assert movies_with_word(df, "The").shape == (465, 24)
assert movies_with_word(df, "the").shape == (465, 24)

Groupby and visualization

How many movies are released each year?



In [ ]:

    
df.groupby('year').size().plot()

another option is the use pd.Series.value_counts



In [ ]:

    
df.year.value_counts().sort_index().plot()

most movies were released in the late 80s and 90s, let's zoom in. Let's also change the figure size.

We create the plot object with one subplot, we then specify which axis pandas should use for plotting (ax=ax).



In [ ]:

    
fig, ax = plt.subplots(1, figsize=(10, 6))
d = df[df.year>1985]
d.groupby('year').size().plot(kind='bar', ax=ax)

we can groupby on multiple columns

Back to the romance-action combinations



In [ ]:

    
df.groupby(['action', 'romance']).size()

we can also group on arbitrary conditions

how many movies were released each decade?



In [ ]:

    
df.groupby(df.year // 10 * 10).size()

Problem Set 2: Groupby and visualization

Q2.1 Write a function that takes a genre and groups movies of that genre by year. Do not include movies older than 1985. (3 points)



In [ ]:

    
def group_genre_by_year(df, genre):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
crime = group_genre_by_year(df, 'crime')
assert type(crime) == pd.core.groupby.DataFrameGroupBy
assert len(crime) <= 15  # movies between 1985-1999

Q2.2 Plot the number of adventure movies from 1985 to 1999 on a bar chart. Use your `group_genre_by_year` function. (2 points)



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Q2.3 Plot the distribution of release day (day of month) on a pie chart.

Step 1. groupby (2 points)



In [ ]:

    
def groupby_release_day(df):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
by_day = groupby_release_day(df)
assert type(by_day) == pd.core.groupby.DataFrameGroupBy

# the longest month is 31 days
assert len(by_day) < 32

# not month but DAY of month
assert len(by_day) != 12

# shouldn't group on day of week
assert len(by_day) > 7

Step 2. pie chart. Add percent values. (3 points)



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Q2.4 We're building a traditional lexicon of the titles. What is the distribution of initial letters (i.e. how many titles start with S?)? Plot it on a bar chart.

Step 1. Compute frequencies. (3 points)

You don't need to perform any preprocessing.



In [ ]:

    
def compute_initial_letter_frequencies(df):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
initial = compute_initial_letter_frequencies(df)

assert type(initial) == pd.Series

# frequency counts should be >= 1
assert initial.min() >= 1

# the largest one cannot be larger than the full dataframe
assert initial.max() <= len(df)

# there are 32 initial letters in the dataset
assert len(initial) == 32

assert 'B' in initial.index

Step 2. Plot it on a bar chart in descending order. (3 points)

The most common letter should be the first bar.



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Problem Set 3. Handling multiple dataframes

The main table of this dataset is u.data with 100000 ratings.



In [ ]:

    
cols = ['user', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join(data_dir, "u.data"), names=cols)
ratings.head()

The timestamp column is a Unix timestamp, let's convert it to pd.DateTime:



In [ ]:

    
ratings['timestamp'] = pd.to_datetime(ratings.timestamp, unit='s')
ratings.head()

Merging with the movies dataframe

We overwrite ratings:



In [ ]:

    
movies = df
ratings = pd.merge(ratings, movies, left_on='movie_id', right_index=True)

How many ratings are timestamped before the release date?



In [ ]:

    
(ratings.timestamp <= ratings.release_date).value_counts()

Which movies were rated before the release date?



In [ ]:

    
ratings[ratings.timestamp <= ratings.release_date].title.value_counts()

Q3.1. How many times was each movie rated?

Step 1. Compute the frequencies of movie ratings. Use titles instead of movie ids. (2 points)



In [ ]:

    
def compute_movie_rated_frequencies(df):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
title_freq = compute_movie_rated_frequencies(ratings)
assert isinstance(title_freq, pd.Series)

# use titles
assert 'Lone Star (1996)' in title_freq.index

Step 2. Plot the frequencies on a histogram. (2 points)

pd.Series has a hist function. It uses 10 bins by default, use more.



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Q3.2. How many ratings were submitted by each day of the week? What were their averages?

Step 1. groupby (3 points)

Tip: look around in the .dt namespace.



In [ ]:

    
def groupby_day_of_week(ratings):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
by_day = groupby_day_of_week(ratings)
assert isinstance(by_day, pd.core.groupby.generic.DataFrameGroupBy)
# there are 7 days
assert len(by_day) == 7
# use names of the days
assert 'Monday' in by_day.groups

Step 2. number of ratings per day (2 points)



In [ ]:

    
#number_of_ratings_per_day = ...
# YOUR CODE HERE
raise NotImplementedError()



In [ ]:

    
assert isinstance(number_of_ratings_per_day, pd.Series)
assert len(number_of_ratings_per_day) == 7
assert number_of_ratings_per_day.min() > 10000

Step 3. mean rating by day (2 points)



In [ ]:

    
#mean_rating_by_day = ...
# YOUR CODE HERE
raise NotImplementedError()



In [ ]:

    
assert isinstance(mean_rating_by_day, pd.Series)

Q3.3** What is the mean of ratings by genre?

If a movie has multiple genres, include it in every genre.

Step 1. Compute the mean scores. (5 points)

There are many ways to solve this problem. Try to do it without explicit for loops.



In [ ]:

    
genres = ['unknown', 'action', 'adventure', 'animation',
       'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy',
       'film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi',
       'thriller', 'war', 'western']

def compute_mean_rating_by_genre(ratings):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
genre_rating = compute_mean_rating_by_genre(ratings)
assert len(genre_rating) == len(genres)
# all means are between 3 and 4
assert genre_rating.min() > 3.0
assert genre_rating.max() < 4.0
# film noir is rated highest
assert genre_rating.idxmax() == 'film_noir'
for g in genres:
    assert g in genre_rating.index

Step 2. Plot it on a bar chart in descending order by score. Set the limits of the y-axis to (2.5, 4). (3 points)



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Problem Set 4. User demographics

Q4.1 Load the users table from the file `u.user`. (3 points)

u.users has the following columns: user_id, age, gender, occupation, zip. Use user_id as the index.



In [ ]:

    
# users = ...
# YOUR CODE HERE
raise NotImplementedError()



In [ ]:

    
assert type(users) == pd.DataFrame

# users are indexed from 0
assert 0 not in users.index
assert users.shape == (943, 4)
# user_id should be the index
assert 'user_id' not in users.columns

Q4.2 Merge the `users` table with `ratings`. Do not discard any columns. (3 points)



In [ ]:

    
# ratings = ratings.merge...
# YOUR CODE HERE
raise NotImplementedError()



In [ ]:

    
assert type(ratings) == pd.DataFrame
# all movies have ratings (nunique return the number of unique elements)
assert ratings.movie_id.nunique() == 1682

Q 4.3 How strict are people by occupation? Compute the average of ratings by occupation. Plot it on a bar chart in descending order.

Step 1. Compute the averages by occupation. (2 points)



In [ ]:

    
def compute_mean_by_occupation(ratings):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
mean_by_occupation = compute_mean_by_occupation(ratings)
assert isinstance(mean_by_occupation, pd.Series)
# ratings are between 1 and 5
assert mean_by_occupation.min() > 1
assert mean_by_occupation.max() < 5

Step 2. Plot it on a bar chart. (2 points)

Make the bar chart wider and restrict the y-axis to (2.5, 4).



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Q4.4* Plot the averages by occupation and gender on a multiple bar plot. (4 points)

Tip: there is an example of a multiple bar plot here

Tip 2: there are many ways to solve this problem, one is a one-liner using DataFrame.unstack.



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Q4.5 How likely are different age groups to rate movies? Compute the number of ratings by age grouped into 10-19, 20-29, etc. Plot it on a bar chart.

Step 1. Number of ratings by age group (3 points)



In [ ]:

    
def count_ratings_by_age_group(ratings):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
rating_by_age_group = count_ratings_by_age_group(ratings)

assert isinstance(rating_by_age_group, pd.Series)
assert 20 in rating_by_age_group
assert 25 not in rating_by_age_group

Step 2. Plot it on a bar chart. (2 points)



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Q4.6 What hour of the day do different occupations rate? (3 points)

Create a function that computes the number of ratings per hour for a single occupation.



In [ ]:

    
def count_rating_by_hour_occupation(ratings, occupation):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
marketing = count_rating_by_hour_occupation(ratings, "marketing")
assert isinstance(marketing, pd.Series)

# there are only 24 hours
assert len(marketing) < 25

Q4.7* Plot the rating hours of marketing employees and programmers on two pie charts. (4 points)

A two-subplot figure is created. ax is an array of the two subplots, use ax[0] for marketing employees and ax[1] for programmers. Set the titles of the subplots accordingly.



In [ ]:

    
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
# YOUR CODE HERE
raise NotImplementedError()

Q4.8* Do older people prefer movies with longer titles? Compute the average title length by age group (0-10, 10-20).

Step1. compute mean length (4 points)

Tip: You should probably create a copy of some of the columns.



In [ ]:

    
def get_mean_length_by_age_group(ratings):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
title_len_by_age = get_mean_length_by_age_group(ratings)
assert isinstance(title_len_by_age, pd.Series)
assert len(title_len_by_age) == 8
# titles are long
assert title_len_by_age.min() >= 20

# index should contain the lower bound of the age group
assert 0 in title_len_by_age.index
assert 20 in title_len_by_age.index
# the maximum age in the dataset is 73, there should be no 80-90 age group
assert 80 not in title_len_by_age.index

Step 2. Plot it on a bar chart. Choose a reasonable range for the y-axis. (2 points)



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Problem Set 5. A simple recommendation system

Let's build a simple recommendation system that finds similar movies based on genre.

Q5.1. Extract genre information as a matrix. (2 points)

The .values attribute represents the underlying values as a Numpy ndarray.



In [ ]:

    
#X = ...
# YOUR CODE HERE
raise NotImplementedError()



In [ ]:

    
assert isinstance(X, np.ndarray)
# shape should be movies X genres
assert X.shape == (1682, 19)
assert list(np.unique(X)) == [0, 1]

Q5.2 Run the k-nearest neighbor algorithm on X. (3 points)

Find a usage example in the documentation of NearestNeighbors.

Store the indices in a variable names indices.

K is the number of nearest neighbors. It should be a parameter of your function.



In [ ]:

    
from sklearn.neighbors import NearestNeighbors
def run_knn(X, K):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
assert run_knn(X, 2).shape == (1682, 2)
K = 4
indices = run_knn(X, K)
assert isinstance(indices, np.ndarray)
assert indices.shape[1] == K

indices is more convenient as a DataFrame



In [ ]:

    
ind = pd.DataFrame(indices)
ind.head()

Q5.3 Increment by one (3 points)

The index of this DataFrame refers to a particular movie and the rest of the rows are the indices of similar movies. The problem is that this matrix is zero-indexed, while the dataset (movies table) is indexed from 1.

Both the index and all values should be increased by one.



In [ ]:

    
def increment_table(df):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
indices = increment_table(ind)
assert indices.shape[1] == 4
assert indices.index[0] == 1
assert indices.index[-1] == len(indices)



In [ ]:

    
indices.head()

Q5.4* Find the movies corresponding to these indices (5 points)

You'll need multiple merge operations.

Tip: the names of the columns in indices are not strings but integers, you can rename the columns of a dataframe:

df = df.rename(columns={'old': 'new', 'other old': 'other new'})

You can discard all other columns.



In [ ]:

    
def find_neighbor_titles(movies, indices):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
neighbors = find_neighbor_titles(movies, indices)
assert isinstance(neighbors, pd.DataFrame)
assert neighbors.shape[1] == K



In [ ]:

    
neighbors.head()

Q5.5* Replace the index of the movie by its title. (2 points)



In [ ]:

    
def recover_titles(movies, neighbors):
    # YOUR CODE HERE
    raise NotImplementedError()



In [ ]:

    
most_similar = recover_titles(movies, neighbors)

assert type(most_similar) == pd.DataFrame
assert "Toy Story (1995)" in most_similar.index

Q5.6** Improve your recommedation system by adding other columns. (5 points)

Tips: you can add the average rating of a movie by occupation/age group/gender

Please fit your solution in one cell.



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Q6* Extra (3 points)

Add any extra observations that you find interesting. Did you find any interesting patterns in the dataset? Are certain genres more appealing to particular demographic groups?

You can add multiple observations for extra points.

Please explain your answer in this field (double click on YOUR CODE GOES HERE):

YOUR ANSWER HERE

And the code here:



In [ ]:

    
# YOUR CODE HERE
raise NotImplementedError()

Submission

Please fill out this short survey after you completed the problems.

Table of Contents

YOUR NAME (NEPTUN)

Business Intelligence - Pandas

Important steps before starting anything

General information

Submission

Tips

Table of Contents

Credits

Feedback

Code quality

PEP8 style guide

Figure quality

Main imports

downloading the dataset

Loading the dataset

Normalizing the dataset

the .dt and the .str namespaces

.dt namespace

.str namespace

Let's extract the release year of a movie into a separate

the most common years are

video_release_date is always NaT (not a time), let's drop it

Basic analysis of the dataset

Basic queries

Which movies were released in 1956?

How many movies were released in the 80s?

When were the Die Hard movies released?

How many movies are both action and romance? What about action or romance?

Problem Set 1: simple queries

Q1.1 How many action movies were release before 1985 and in 1985 or later? (2 points)

Q1.2 Are there thrillers for children? Find an example. (2 points)

Q1.3 How many movies have title longer than 40 characters? (3 points)

pd.Series.apply: running arbitrary functions on each element

Q1.4* How many content words does the average title have? Which title has the most/least words?

Step 1. Count words in the title (2 points)

Step 2. Shortest and longest titles by word count (3 points)

Q1.5* How many movies have the word 'and' in their title? Write a function that counts movies with a particular word. (3 points)

Groupby and visualization

we can groupby on multiple columns

we can also group on arbitrary conditions

Problem Set 2: Groupby and visualization

Q2.1 Write a function that takes a genre and groups movies of that genre by year. Do not include movies older than 1985. (3 points)

Q2.2 Plot the number of adventure movies from 1985 to 1999 on a bar chart. Use your group_genre_by_year function. (2 points)

Q2.3 Plot the distribution of release day (day of month) on a pie chart.

Step 1. groupby (2 points)

Step 2. pie chart. Add percent values. (3 points)

Q2.4 We're building a traditional lexicon of the titles. What is the distribution of initial letters (i.e. how many titles start with S?)? Plot it on a bar chart.

Step 1. Compute frequencies. (3 points)

Step 2. Plot it on a bar chart in descending order. (3 points)

Problem Set 3. Handling multiple dataframes

Merging with the movies dataframe

How many ratings are timestamped before the release date?

Which movies were rated before the release date?

Q3.1. How many times was each movie rated?

Step 1. Compute the frequencies of movie ratings. Use titles instead of movie ids. (2 points)

Step 2. Plot the frequencies on a histogram. (2 points)

Q3.2. How many ratings were submitted by each day of the week? What were their averages?

Step 1. groupby (3 points)

Step 2. number of ratings per day (2 points)

Step 3. mean rating by day (2 points)

Q3.3** What is the mean of ratings by genre?

Step 1. Compute the mean scores. (5 points)

Step 2. Plot it on a bar chart in descending order by score. Set the limits of the y-axis to (2.5, 4). (3 points)

Problem Set 4. User demographics

Q4.1 Load the users table from the file u.user. (3 points)

Q4.2 Merge the users table with ratings. Do not discard any columns. (3 points)

Q 4.3 How strict are people by occupation? Compute the average of ratings by occupation. Plot it on a bar chart in descending order.

Step 1. Compute the averages by occupation. (2 points)

Step 2. Plot it on a bar chart. (2 points)

Q4.4* Plot the averages by occupation and gender on a multiple bar plot. (4 points)

Q4.5 How likely are different age groups to rate movies? Compute the number of ratings by age grouped into 10-19, 20-29, etc. Plot it on a bar chart.

Step 1. Number of ratings by age group (3 points)

Step 2. Plot it on a bar chart. (2 points)

Q4.6 What hour of the day do different occupations rate? (3 points)

Create a function that computes the number of ratings per hour for a single occupation.

Q4.7* Plot the rating hours of marketing employees and programmers on two pie charts. (4 points)

Q4.8* Do older people prefer movies with longer titles? Compute the average title length by age group (0-10, 10-20).

Step1. compute mean length (4 points)

Step 2. Plot it on a bar chart. Choose a reasonable range for the y-axis. (2 points)

the `.dt` and the `.str` namespaces

`.dt` namespace

`.str` namespace

`video_release_date` is always `NaT` (not a time), let's drop it

`pd.Series.apply`: running arbitrary functions on each element

Q2.2 Plot the number of adventure movies from 1985 to 1999 on a bar chart. Use your `group_genre_by_year` function. (2 points)

Q4.1 Load the users table from the file `u.user`. (3 points)

Q4.2 Merge the `users` table with `ratings`. Do not discard any columns. (3 points)