Pandas is outdated on most lab computers. Upgrading can be done via Anaconda Prompt:
conda upgrade --all
conda upgrade pandas
Both commands ask for confirmation, just press Enter. If Windows asks for any permissions, you can deny it (allowing would require Administrator priviliges).
You should not see a warning here:
In [ ]:
import pandas as pd
if pd.__version__ < '1':
print("WARNING: Pandas version older than 1.0.0: {}".format(pd.__version__))
else:
print("Pandas version OK: {}".format(pd.__version__))
This goal of this notebook is to give a brief introduction to the pandas
library, a popular data manipulation and analysis tool for Python.
Problems are numbered from Q1 to Q5 with many subproblems such as Q1.1. The scores range from 2 to 5 based on the difficulty of the problem. Grades are determined using this table:
score | grade |
---|---|
80+ | 5 |
60+ | 4 |
40+ | 3 |
20+ | 2 |
20- | 1 |
Your answer should go in place of YOUR CODE HERE
. Please remove raise NotImplementedError
.
Most of the tasks are automatically graded using visible and hidden tests.
Visible tests are available in this version, hidden tests are not available to you.
This means that passing all visible tests does not ensure that your answer is correct.
Not passing the visible tests means that your answer is incorrect.
Do not delete or copy cells and do not edit the source code.
You may add cells but they will not be graded.
You will not find the hidden tests in the source code but you can mess up the autograder if you manually edit it.
VERY IMPORTANT Do not edit cell metadata or the raw .ipynb
. Autograde will fail if you change something carelessly.
Advanced exercises are marked with *. More advanced ones have more stars. Some problems build on each other - it should be obvious how from the code - but advanced problems can safely be skipped. Completing all non-advanced exercises correctly is worth 63 points.
You only need to submit this notebook (no separate report).
Make sure that you save the last version of your notebook.
Please rename the notebook to your neptun code (ABC123.ipynb
), package it in an archive (.zip
) and upload it to the class website.
You are free to continue working on this problem set after class but make sure that you upload it by the end of the week (Sunday).
VERY IMPORTANT Run Kernel->Restart & Run All and make sure that it finishes without errors. If you skip exercises, you need to manually run the remaining cells. You can run a single cell and step to the next cell by pressing Shift+Enter. Skipping exercises won't affect the autograder.
You generally don't need to leave any DataFrames printed as cell outputs. You can do it for debug purposes and it won't affect the autograder but please don't leave long tables in the output. Use .head()
instead.
Be concise. All exercises can be solved with less than 5 lines of code.
Avoid for loops. Almost all tasks can be solved with efficient pandas operations.
Avoid overriding Python built-in functions with your own variables (max = 2
).
If you mess up, you can always do one of the following
If your notebook runs for longer than a minute, one or more of your solutions is very inefficient.
It's easy to accidentally change the type of a cell from code to Markdown/raw text. When this happens, you lose syntax highlight in that cell. You can change it back in the toolbar or in the Cell menu.
Jupyter has an extension called Table of Contents (2). Table of Contents lists all headers in a separate frame and makes navigation much easier. You can install and enable it the following way in Anaconda Prompt:
conda install -c conda-forge jupyter_contrib_nbextensions
python -m jupyter nbextension enable toc2/main
Save and refresh this notebook and you should see a new button in the toolbar that looks like a bullet point list. That button turns on Table of Contents.
This assignment was created by Judit Ács using nbgrader.
Please fill out this short survey after you completed the problems.
You can get 5 extra points for the quality of your figures. Good figures have labeled axes with meaningful names, reasonable figure size and reasonable axes limits. Extra attention to details also helps.
Pie charts are ugly no matter what you do, do not sweat it too much, they are the Comic Sans of visualization.
In [ ]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_context('notebook')
In [ ]:
import os
data_dir = os.getenv("MOVIELENS")
if data_dir is None:
data_dir = ""
ml_path = os.path.join(data_dir, "ml.zip")
if os.path.exists(ml_path):
print("File already exists, skipping download step.")
else:
print("Downloading the Movielens dataset")
import urllib
u = urllib.request.URLopener()
u.retrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", ml_path)
In [ ]:
unzip_path = os.path.join(data_dir, "ml-100k")
if os.path.exists(unzip_path):
print("Dataset already unpacked, skipping unpacking step.")
else:
print("Unziping the dataset.")
from zipfile import ZipFile
with ZipFile(ml_path) as myzip:
myzip.extractall(data_dir)
data_dir = unzip_path
pd.read_table
loads a tabular dataset. The full function signature is:
pandas.read_table(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep='t', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None
let's try it with defaults
In [ ]:
# df = pd.read_table("ml-100k/u.item") # UnicodeDecodeErrort kapunk, mert rossz dekódert használ
df = pd.read_table(os.path.join(data_dir, "u.item"), encoding="latin1")
df.head()
A couple improvements:
|
instead of \t
.README
, they can be specified with the names
parameters.read_table
added an index (0..N-1), but the dataset already has an index, let's use that one (index_col='movie_id'
)release_date
and video_release_date
are dates, pandas can parse them and create its own datetype.
In [ ]:
column_names = [
"movie_id", "title", "release_date", "video_release_date", "imdb_url", "unknown", "action", "adventure", "animation",
"children", "comedy", "crime", "documentary", "drama", "fantasy", "film_noir", "horror", "musical", "mystery",
"romance", "sci_fi", "thriller", "war", "western"]
df = pd.read_table(
os.path.join(data_dir, "u.item"), sep="|",
names=column_names, encoding="latin1", index_col='movie_id',
parse_dates=['release_date', 'video_release_date']
)
df.head()
In [ ]:
print(", ".join([d for d in dir(df.release_date.dt) if not d.startswith('_')]))
In [ ]:
print(", ".join([d for d in dir(df.title.str) if not d.startswith('_')]))
.str
can also be indexed like a string
In [ ]:
df.title.str[1].head()
In [ ]:
df.title.str[-5:].tail()
In [ ]:
df['year'] = df.release_date.dt.year
In [ ]:
df.year.value_counts().head()
In [ ]:
df.video_release_date.isnull().value_counts()
In [ ]:
df = df.drop('video_release_date', axis=1)
In [ ]:
df.describe()
Only numeric columns are included by default. A single column (pd.Series
) has a describe function too
In [ ]:
df.release_date.describe()
Numberic statistics are available as separate functions too:
etc.
In [ ]:
df.mean()
In [ ]:
df[df.release_date.dt.year==1956]
In [ ]:
d = df[(df.release_date.dt.year >= 1980) & (df.release_date.dt.year < 1990)]
print(f"{len(d)} movies were released in the 80s.")
print("\nA few examples:")
print("\n".join(d.sample(5).title))
In [ ]:
df[df.title.str.contains('Die Hard')]
Die Hard 4 and 5 are missing. This is because the dataset only contains movies released between:
In [ ]:
df.release_date.min(), df.release_date.max()
and Die Hard 4 and 5 were released in 2007 and 2013 respectively.
In [ ]:
print("Action and romance:", len(df[(df.action==1) & (df.romance==1)]))
print("Action or romance:", len(df[(df.action==1) | (df.romance==1)]))
In [ ]:
def count_movies_before_1985(df):
# YOUR CODE HERE
raise NotImplementedError()
def count_movies_after_1984(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
before = count_movies_before_1985(df)
assert isinstance(before, int) or isinstance(before, np.integer)
# action movies only, not all!
assert before != 272
after = count_movies_after_1984(df)
assert isinstance(after, int) or isinstance(after, np.integer)
In [ ]:
def child_thriller(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title = child_thriller(df)
assert isinstance(title, str)
In [ ]:
def long_titles(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title_cnt = long_titles(df)
assert isinstance(title_cnt, int) or isinstance(title_cnt, np.integer)
The apply
function allow running arbitrary functions on pd.Series
:
In [ ]:
def number_of_words(field):
return len(field.split(" "))
df.title.apply(number_of_words).value_counts().sort_index()
# or
# df.title.apply(lambda t: len(t.split(" "))).value_counts().sort_index()
In [ ]:
df.title.apply(number_of_words).value_counts().sort_index().plot(kind='bar')
Content words are capitalized. The opposite of content words are function words (and, a, an, the, more etc.).
We should not include the release year in the word count. The release year is always the last word of the title.
In [ ]:
#df['title_word_cnt'] = ...
def count_content_words(title):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
df['title_word_cnt'] = df.title.apply(count_content_words)
assert 'title_word_cnt' in df.columns
assert df.loc[1424, 'title_word_cnt'] == 5
assert df.loc[1170, 'title_word_cnt'] == 2
In [ ]:
df.title_word_cnt.value_counts().sort_index().plot(kind='bar')
In [ ]:
#shortest_title = ...
#longest_title = ...
#shortest_title_len = ...
#longest_title_len = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(shortest_title, str)
assert isinstance(longest_title, str)
assert isinstance(shortest_title_len, np.integer) or isinstance(shortest_title_len, int)
assert isinstance(longest_title_len, np.integer) or isinstance(longest_title_len, int)
assert shortest_title_len == 0
assert longest_title == 'Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)'
assert shortest_title == 'unknown'
assert longest_title_len == 10
In [ ]:
def movies_with_word(df, word):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
and_movies = movies_with_word(df, "and")
assert isinstance(and_movies, pd.DataFrame)
assert and_movies.shape == (66, 24)
assert 'Mr. Holland\'s Opus (1995)' not in and_movies.title.values
assert movies_with_word(df, "The").shape == (465, 24)
assert movies_with_word(df, "the").shape == (465, 24)
In [ ]:
df.groupby('year').size().plot()
another option is the use pd.Series.value_counts
In [ ]:
df.year.value_counts().sort_index().plot()
most movies were released in the late 80s and 90s, let's zoom in. Let's also change the figure size.
We create the plot object with one subplot, we then specify which axis pandas should use for plotting (ax=ax
).
In [ ]:
fig, ax = plt.subplots(1, figsize=(10, 6))
d = df[df.year>1985]
d.groupby('year').size().plot(kind='bar', ax=ax)
In [ ]:
df.groupby(['action', 'romance']).size()
In [ ]:
df.groupby(df.year // 10 * 10).size()
In [ ]:
def group_genre_by_year(df, genre):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
crime = group_genre_by_year(df, 'crime')
assert type(crime) == pd.core.groupby.DataFrameGroupBy
assert len(crime) <= 15 # movies between 1985-1999
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def groupby_release_day(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
by_day = groupby_release_day(df)
assert type(by_day) == pd.core.groupby.DataFrameGroupBy
# the longest month is 31 days
assert len(by_day) < 32
# not month but DAY of month
assert len(by_day) != 12
# shouldn't group on day of week
assert len(by_day) > 7
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def compute_initial_letter_frequencies(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
initial = compute_initial_letter_frequencies(df)
assert type(initial) == pd.Series
# frequency counts should be >= 1
assert initial.min() >= 1
# the largest one cannot be larger than the full dataframe
assert initial.max() <= len(df)
# there are 32 initial letters in the dataset
assert len(initial) == 32
assert 'B' in initial.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
cols = ['user', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(os.path.join(data_dir, "u.data"), names=cols)
ratings.head()
The timestamp
column is a Unix timestamp, let's convert it to pd.DateTime
:
In [ ]:
ratings['timestamp'] = pd.to_datetime(ratings.timestamp, unit='s')
ratings.head()
In [ ]:
movies = df
ratings = pd.merge(ratings, movies, left_on='movie_id', right_index=True)
In [ ]:
(ratings.timestamp <= ratings.release_date).value_counts()
In [ ]:
ratings[ratings.timestamp <= ratings.release_date].title.value_counts()
In [ ]:
def compute_movie_rated_frequencies(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title_freq = compute_movie_rated_frequencies(ratings)
assert isinstance(title_freq, pd.Series)
# use titles
assert 'Lone Star (1996)' in title_freq.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def groupby_day_of_week(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
by_day = groupby_day_of_week(ratings)
assert isinstance(by_day, pd.core.groupby.generic.DataFrameGroupBy)
# there are 7 days
assert len(by_day) == 7
# use names of the days
assert 'Monday' in by_day.groups
In [ ]:
#number_of_ratings_per_day = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(number_of_ratings_per_day, pd.Series)
assert len(number_of_ratings_per_day) == 7
assert number_of_ratings_per_day.min() > 10000
In [ ]:
#mean_rating_by_day = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(mean_rating_by_day, pd.Series)
In [ ]:
genres = ['unknown', 'action', 'adventure', 'animation',
'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy',
'film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi',
'thriller', 'war', 'western']
def compute_mean_rating_by_genre(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
genre_rating = compute_mean_rating_by_genre(ratings)
assert len(genre_rating) == len(genres)
# all means are between 3 and 4
assert genre_rating.min() > 3.0
assert genre_rating.max() < 4.0
# film noir is rated highest
assert genre_rating.idxmax() == 'film_noir'
for g in genres:
assert g in genre_rating.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# users = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert type(users) == pd.DataFrame
# users are indexed from 0
assert 0 not in users.index
assert users.shape == (943, 4)
# user_id should be the index
assert 'user_id' not in users.columns
In [ ]:
# ratings = ratings.merge...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert type(ratings) == pd.DataFrame
# all movies have ratings (nunique return the number of unique elements)
assert ratings.movie_id.nunique() == 1682
In [ ]:
def compute_mean_by_occupation(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
mean_by_occupation = compute_mean_by_occupation(ratings)
assert isinstance(mean_by_occupation, pd.Series)
# ratings are between 1 and 5
assert mean_by_occupation.min() > 1
assert mean_by_occupation.max() < 5
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
Tip: there is an example of a multiple bar plot here
Tip 2: there are many ways to solve this problem, one is a one-liner using DataFrame.unstack
.
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def count_ratings_by_age_group(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
rating_by_age_group = count_ratings_by_age_group(ratings)
assert isinstance(rating_by_age_group, pd.Series)
assert 20 in rating_by_age_group
assert 25 not in rating_by_age_group
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def count_rating_by_hour_occupation(ratings, occupation):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
marketing = count_rating_by_hour_occupation(ratings, "marketing")
assert isinstance(marketing, pd.Series)
# there are only 24 hours
assert len(marketing) < 25
In [ ]:
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
def get_mean_length_by_age_group(ratings):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
title_len_by_age = get_mean_length_by_age_group(ratings)
assert isinstance(title_len_by_age, pd.Series)
assert len(title_len_by_age) == 8
# titles are long
assert title_len_by_age.min() >= 20
# index should contain the lower bound of the age group
assert 0 in title_len_by_age.index
assert 20 in title_len_by_age.index
# the maximum age in the dataset is 73, there should be no 80-90 age group
assert 80 not in title_len_by_age.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
#X = ...
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(X, np.ndarray)
# shape should be movies X genres
assert X.shape == (1682, 19)
assert list(np.unique(X)) == [0, 1]
Find a usage example in the documentation of NearestNeighbors
.
Store the indices in a variable names indices
.
K
is the number of nearest neighbors. It should be a parameter of your function.
In [ ]:
from sklearn.neighbors import NearestNeighbors
def run_knn(X, K):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert run_knn(X, 2).shape == (1682, 2)
K = 4
indices = run_knn(X, K)
assert isinstance(indices, np.ndarray)
assert indices.shape[1] == K
indices
is more convenient as a DataFrame
In [ ]:
ind = pd.DataFrame(indices)
ind.head()
The index of this DataFrame refers to a particular movie and the rest of the rows are the indices of similar movies. The problem is that this matrix is zero-indexed, while the dataset (movies
table) is indexed from 1.
Both the index and all values should be increased by one.
In [ ]:
def increment_table(df):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
indices = increment_table(ind)
assert indices.shape[1] == 4
assert indices.index[0] == 1
assert indices.index[-1] == len(indices)
In [ ]:
indices.head()
You'll need multiple merge operations.
Tip: the names of the columns in indices
are not strings but integers, you can rename the columns of a dataframe:
df = df.rename(columns={'old': 'new', 'other old': 'other new'})
You can discard all other columns.
In [ ]:
def find_neighbor_titles(movies, indices):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
neighbors = find_neighbor_titles(movies, indices)
assert isinstance(neighbors, pd.DataFrame)
assert neighbors.shape[1] == K
In [ ]:
neighbors.head()
In [ ]:
def recover_titles(movies, neighbors):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
most_similar = recover_titles(movies, neighbors)
assert type(most_similar) == pd.DataFrame
assert "Toy Story (1995)" in most_similar.index
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
Add any extra observations that you find interesting. Did you find any interesting patterns in the dataset? Are certain genres more appealing to particular demographic groups?
You can add multiple observations for extra points.
Please explain your answer in this field (double click on YOUR CODE GOES HERE):
YOUR ANSWER HERE
And the code here:
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
You only need to submit this notebook (no separate report).
Make sure that you save the last version of your notebook.
Please rename the notebook to your neptun code (ABC123.ipynb
), package it in an archive (.zip
) and upload it to the class website.
You are free to continue working on this problem set after class but make sure that you upload it by the end of the week (Sunday).
VERY IMPORTANT Run Kernel->Restart & Run All and make sure that it finishes without errors. If you skip exercises, you need to manually run the remaining cells. You can run a single cell and step to the next cell by pressing Shift+Enter. Skipping exercises won't affect the autograder.
Please fill out this short survey after you completed the problems.