This goal of this notebook is to give a brief introduction to the pandas
library, a popular data manipulation and analysis tool for Python.
Problems are numbered from Q1 to Q5 with many subproblems such as Q1.1. The scores range from 2 to 5 based on the difficulty of the problem. Grades are determined using this table (TODO update):
score | grade |
---|---|
80+ | 5 |
60+ | 4 |
40+ | 3 |
20+ | 2 |
20- | 1 |
Your answer should go in place of YOUR CODE HERE
. Please remove raise NotImplementedError
.
Most of the tasks are automatically graded using visible and hidden tests.
Visible tests are available in this version, hidden tests are not available to you.
This means that passing all visible tests does not ensure that your answer is correct.
Not passing the visible tests means that your answer is incorrect.
Do not delete or copy cells and do not edit the source code.
You may add cells but they will not be graded.
You will not find the hidden tests in the source code but you can mess up the autograder if you manually edit it.
VERY IMPORTANT Do not edit cell metadata or the raw .ipynb
. Autograde will fail if you change something carelessly.
Advanced exercises are marked with *. More advanced ones have more stars. Some problems build on each other - it should be obvious how from the code - but advanced problems can safely be skipped. Completing all non-advanced exercises correctly is worth 63 points.
You only need to submit this notebook (no separate report).
Make sure that you save the last version of your notebook.
Please rename the notebook to your neptun code (ABC123.ipynb
), package it in an archive (.zip
) and upload it to the class website.
You are free to continue working on this problem set after class but make sure that you upload it by the end of the week (Sunday).
VERY IMPORTANT Run Kernel->Restart & Run All and make sure that it finishes without errors. If you skip exercises, you need to manually run the remaining cells. Skipping exercises won't affect the autograder.
You generally don't need to leave any DataFrames printed as cell outputs. You can do it for debug purposes and it won't affect the autograder but please don't leave long tables in the output. Use .head()
instead.
Be concise. All exercises can be solved with less than 5 lines of code.
Avoid for loops. Almost all tasks can be solved with efficient pandas operations.
Avoid overriding Python built-in functions with your own variables (max = 2
).
If you mess up, you can always do one of the following
If your notebook runs for longer than a minute, one or more of your solutions is very inefficient.
This assignment was created by Judit Ács using nbgrader.
Please fill out this short survey after you completed the problems.
In [ ]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_context('notebook')
In [ ]:
import os
data_dir = os.getenv("MOVIELENS")
if data_dir is None:
data_dir = ""
ml_path = os.path.join(data_dir, "ml.zip")
if os.path.exists(ml_path):
print("File already exists, skipping download step.")
else:
print("Downloading the Movielens dataset")
import urllib
u = urllib.request.URLopener()
u.retrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", ml_path)
In [ ]:
unzip_path = os.path.join(data_dir, "ml-100k")
if os.path.exists(unzip_path):
print("Dataset already unpacked, skipping unpacking step.")
else:
print("Unziping the dataset.")
from zipfile import ZipFile
with ZipFile(ml_path) as myzip:
myzip.extractall(data_dir)
data_dir = unzip_path
pd.read_table
loads a tabular dataset. The full function signature is:
pandas.read_table(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep='t', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None
let's try it with defaults
In [ ]:
# df = pd.read_table("ml-100k/u.item") # UnicodeDecodeErrort kapunk, mert rossz dekódert használ
df = pd.read_table(os.path.join(data_dir, "u.item"), encoding="latin1")
df.head()
A couple improvements:
|
instead of \t
.README
, they can be specified with the names
parameters.read_table
added an index (0..N-1), but the dataset already has an index, let's use that one (index_col='movie_id'
)release_date
and video_release_date
are dates, pandas can parse them and create its own datetype.
In [ ]:
column_names = [
"movie_id", "title", "release_date", "video_release_date", "imdb_url", "unknown", "action", "adventure", "animation",
"children", "comedy", "crime", "documentary", "drama", "fantasy", "film_noir", "horror", "musical", "mystery",
"romance", "sci_fi", "thriller", "war", "western"]
df = pd.read_table(
os.path.join(data_dir, "u.item"), sep="|",
names=column_names, encoding="latin1", index_col='movie_id',
parse_dates=['release_date', 'video_release_date']
)
df.head()
In [ ]:
print(", ".join([d for d in dir(df.release_date.dt) if not d.startswith('_')]))
In [ ]:
print(", ".join([d for d in dir(df.title.str) if not d.startswith('_')]))
In [ ]:
df['year'] = df.release_date.dt.year
In [ ]:
df.year.value_counts().head()
In [ ]:
df.video_release_date.isnull().value_counts()
In [ ]:
df = df.drop('video_release_date', axis=1)
In [ ]:
df.describe()
Only numeric columns are included by default. A single column (pd.Series
) has a describe function too
In [ ]:
df.release_date.describe()
Numberic statistics are available as separate functions too:
etc.
In [ ]:
df.mean()
In [ ]:
df[df.release_date.dt.year==1956]
In [ ]:
d = df[(df.release_date.dt.year >= 1980) & (df.release_date.dt.year < 1990)]
print(f"{len(d)} movies were released in the 80s.")
print("\nA few examples:")
print("\n".join(d.sample(5).title))
In [ ]:
df[df.title.str.contains('Die Hard')]
Die Hard 4 and 5 are missing. This is because the dataset only contains movies released between:
In [ ]:
df.release_date.min(), df.release_date.max()
and Die Hard 4 and 5 were released in 2007 and 2013 respectively.
In [ ]:
print("Action and romance:", len(df[(df.action==1) & (df.romance==1)]))
print("Action or romance:", len(df[(df.action==1) | (df.romance==1)]))
In [ ]:
def count_movies_before_1985(df):
# YOUR CODE HERE
raise NotImplementedError()
def count_movies_after_1984(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
before = count_movies_before_1985(df)
assert isinstance(before, int) or isinstance(before, np.integer)
# action movies only, not all!
assert before != 272
after = count_movies_after_1984(df)
assert isinstance(after, int) or isinstance(after, np.integer)
In [ ]:
def child_thriller(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title = child_thriller(df)
assert isinstance(title, str)
In [ ]:
def long_titles(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title_cnt = long_titles(df)
assert isinstance(title_cnt, int) or isinstance(title_cnt, np.integer)
The apply
function allow running arbitrary functions on pd.Series
:
In [ ]:
def number_of_words(field):
return len(field.split(" "))
df.title.apply(number_of_words).value_counts().sort_index()
# or
# df.title.apply(lambda t: len(t.split(" "))).value_counts().sort_index()
In [ ]:
df.title.apply(number_of_words).value_counts().sort_index().plot(kind='bar')
Content words are capitalized. The opposite of content words are function words (and, a, an, the, more etc.).
We should not include the release year in the word count.
In [ ]:
#df['title_word_cnt'] = ...
def count_content_words(title):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
df['title_word_cnt'] = df.title.apply(count_content_words)
assert 'title_word_cnt' in df.columns
assert df['title_word_cnt'].dtype == int
assert df.loc[1424, 'title_word_cnt'] == 5
assert df.loc[1170, 'title_word_cnt'] == 2
In [ ]:
df.title_word_cnt.value_counts().sort_index().plot(kind='bar')
In [ ]:
#shortest_title = ...
#longest_title = ...
#shortest_title_len = ...
#longest_title_len = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(shortest_title, str)
assert isinstance(longest_title, str)
assert isinstance(shortest_title_len, np.int64)
assert isinstance(longest_title_len, np.int64)
assert shortest_title_len == 0
assert longest_title == 'Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)'
assert shortest_title == 'unknown'
assert longest_title_len == 10
In [ ]:
def movies_with_word(df, word):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
and_movies = movies_with_word(df, "and")
assert isinstance(and_movies, pd.DataFrame)
assert and_movies.shape == (66, 24)
assert 'Mr. Holland\'s Opus (1995)' not in and_movies.title.values
In [ ]:
df.groupby('year').size().plot()
another option is the use pd.Series.value_counts
In [ ]:
df.year.value_counts().sort_index().plot()
most movies were released in the late 80s and 90s, let's zoom in. Let's also change the figure size.
We create the plot object with one subplot, we then specify which axis pandas should use for plotting (ax=ax
).
In [ ]:
fig, ax = plt.subplots(1, figsize=(10, 6))
d = df[df.year>1985]
d.groupby('year').size().plot(kind='bar', ax=ax)
In [ ]:
df.groupby(['action', 'romance']).size()
In [ ]:
df.groupby(df.year // 10 * 10).size()
In [ ]:
def group_genre_by_year(df, genre):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
crime = group_genre_by_year(df, 'crime')
assert type(crime) == pd.core.groupby.DataFrameGroupBy
assert len(crime) <= 15 # movies between 1985-1999
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def groupby_release_day(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
by_day = groupby_release_day(df)
assert type(by_day) == pd.core.groupby.DataFrameGroupBy
# the longest month is 31 days
assert len(by_day) < 32
# shouldn't group on day of week
assert len(by_day) > 7
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def compute_initial_letter_frequencies(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
initial = compute_initial_letter_frequencies(df)
assert type(initial) == pd.Series
# frequency counts should be >= 1
assert initial.min() >= 1
# the largest one cannot be larger than the full dataframe
assert initial.max() <= len(df)
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
cols = ['user', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join(data_dir, "u.data"), names=cols)
ratings.head()
The timestamp
column is a Unix timestamp, let's convert it to pd.DateTime
:
In [ ]:
ratings['timestamp'] = pd.to_datetime(ratings.timestamp, unit='s')
ratings.head()
In [ ]:
movies = df
ratings = pd.merge(ratings, movies, left_on='movie_id', right_index=True)
In [ ]:
(ratings.timestamp <= ratings.release_date).value_counts()
In [ ]:
ratings[ratings.timestamp <= ratings.release_date].title.value_counts()
In [ ]:
def count_greater_than_4(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
greater = count_greater_than_4(ratings)
assert isinstance(greater, np.integer) or isinstance(greater, int)
In [ ]:
def compute_movie_rated_frequencies(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title_freq = compute_movie_rated_frequencies(ratings)
assert isinstance(title_freq, pd.Series)
# use titles
assert 'Lone Star (1996)' in title_freq.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def groupby_day_of_week(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
by_day = groupby_day_of_week(ratings)
assert isinstance(by_day, pd.core.groupby.generic.DataFrameGroupBy)
# there are 7 days
assert len(by_day) == 7
# use names of the days
assert 'Monday' in by_day.groups
In [ ]:
#number_of_ratings_per_day = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(number_of_ratings_per_day, pd.Series)
assert len(number_of_ratings_per_day) == 7
assert number_of_ratings_per_day.min() > 10000
In [ ]:
#mean_rating_by_day = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(mean_rating_by_day, pd.Series)
In [ ]:
genres = ['unknown', 'action', 'adventure', 'animation',
'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy',
'film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi',
'thriller', 'war', 'western']
def compute_mean_rating_by_genre(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
genre_rating = compute_mean_rating_by_genre(ratings)
assert len(genre_rating) == len(genres)
# all means are between 3 and 4
assert genre_rating.min() > 3.0
assert genre_rating.max() < 4.0
# film noir is rated highest
assert genre_rating.idxmax() == 'film_noir'
for g in genres:
assert g in genre_rating.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# users = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert type(users) == pd.DataFrame
# users are indexed from 0
assert 0 not in users.index
In [ ]:
# ratings = ratings.merge...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert type(ratings) == pd.DataFrame
# all movies have ratings (nunique return the number of unique elements)
assert ratings.movie_id.nunique() == 1682
In [ ]:
def compute_mean_by_occupation(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
mean_by_occupation = compute_mean_by_occupation(ratings)
assert isinstance(mean_by_occupation, pd.Series)
# ratings are between 1 and 5
assert mean_by_occupation.min() > 1
assert mean_by_occupation.max() < 5
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
Tip: there is an example of a multiple bar plot here
Tip 2: there are many ways to solve this problem, one is a one-liner using DataFrame.unstack
. It's a little longer if you make the figure nicer.
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def count_ratings_by_age_group(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
rating_by_age_group = count_ratings_by_age_group(ratings)
assert isinstance(rating_by_age_group, pd.Series)
assert 20 in rating_by_age_group
assert 25 not in rating_by_age_group
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def count_rating_by_hour_occupation(ratings, occupation):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
marketing = count_rating_by_hour_occupation(ratings, "marketing")
assert isinstance(marketing, pd.Series)
# there are only 24 hours
assert len(marketing) < 25
In [ ]:
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def get_mean_length_by_age_group(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title_len_by_age = get_mean_length_by_age_group(ratings)
assert isinstance(title_len_by_age, pd.Series)
assert len(title_len_by_age) == 8
# titles are long
assert title_len_by_age.min() >= 20
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
#X = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(X, np.ndarray)
assert X.shape[0] == len(movies)
Find a usage example in the documentation of NearestNeighbors
.
Store the indices in a variable names indices
.
K
is the number of nearest neighbors. It should be a parameter of your function.
In [ ]:
from sklearn.neighbors import NearestNeighbors
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert run_knn(X, 2).shape == (1682, 2)
K = 4
indices = run_knn(X, K)
assert isinstance(indices, np.ndarray)
assert indices.shape[1] == K
indices
is more convenient as a DataFrame
In [ ]:
ind = pd.DataFrame(indices)
ind.head()
The index of this DataFrame refers to a particular movie and the rest of the rows are the indices of similar movies. The problem is that this matrix is zero-indexed, while the dataset (movies
table) is indexed from 1.
Both the index and all values should be increased by one.
In [ ]:
def increment_table(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
indices = increment_table(ind)
assert indices.shape[1] == 4
assert indices.index[0] == 1
assert indices.index[-1] == len(indices)
In [ ]:
indices.head()
You'll need multiple merge operations.
Tip: the names of the columns in indices
are not strings but integers, you can rename the columns of a dataframe:
df = df.rename(columns={'old': 'new', 'other old': 'other new'})
You can discard all other columns.
In [ ]:
def find_neighbor_titles(movies, indices):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
neighbors = find_neighbor_titles(movies, indices)
assert isinstance(neighbors, pd.DataFrame)
assert neighbors.shape[1] == K
In [ ]:
neighbors.head()
In [ ]:
def recover_titles(movies, neighbors):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
most_similar = recover_titles(movies, neighbors)
assert type(most_similar) == pd.DataFrame
assert "Toy Story (1995)" in most_similar.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
Please fill out this short survey after you completed the problems.