We live in a world surrounded by recommendation systems - our shopping habbits, our reading habits, political opinions are heavily influenced by recommendation algorithms. So lets take a closer look at how to build a basic recommendation system.
Simply put a recommendation system learns from your previous behavior and tries to recommend items that are similar to your previous choices. While there are a multitude of approaches for building recommendation systems, we will take a simple approach that is easy to understand and has a reasonable performance.
For this exercise we will build a recommendation system that predicts which talks you'll enjoy at a conference.
This project is still in alpha stage. Bugs, typos, spelling, grammar, terminologies - there's every scope of finding bugs. If you have found one - open an issue on github. Pull Requests with corrections, fixes and enhancements will be received with open arms! Don't forget to add yourself to the list of contributors to this project.
With 32 tuotorials, 12 sponsor workshops, 16 talks at the education summit, and 95 talks at the main conference - Pycon has a lot to offer. Reading through all the talk descriptions and filtering out the ones that you should go to is a tedious process. Lets build a recommendation system that recommends talks from this year's Pycon based on the ones that you went to last year. This way you don't waste any time deciding which talk to go to and spend more time making friends on the hallway track!
We will be using pandas
and scikit-learn
to build the recommnedation system using the text description of talks.
In our example the talk descriptions are the documents.
We have two classes to classify our documents
A talk description is labeled 0 would mean the user has chosen to watch it later and a label 1 would mean the user has chose to watch it in person.
In Supervised learning we inspect each observation in a given dataset and manually label them. These manually labeled data is used to construct a model that can predict the labels on new data. We will use a Supervised Learning technique called Support Vector Machines.
In unsupervised learning we do not need any manual labeling. The recommendation system finds the pattern in the data to build a model that can be used for recommendation.
The dataset contains the talk description and speaker details from Pycon 2017 and 2018. All the 2017 talk data has been labeled by a user who has been to Pycon 2017.
In [3]:
import pandas as pd
import numpy as np
df=pd.read_csv('talks.csv')
df.head()
Out[3]:
Here is a brief description of the interesting fields.
variable | description |
---|---|
title |
Title of the talk |
description |
Description of the talk |
year |
Is it a 2017 talk or 2018 |
label |
1 indicates the user preferred seeing the talk in person,0 indicates they would schedule it for later. |
Note all 2018 talks are set to 1. However they are only placeholders, and are not used in training the model. We will use 2017 data for training, and predict the labels on the 2018 talks.
Lets start by selecting the 2017 talk descriptions that were labeled by the user for watching in person.
df[(df.year==2017) & (df.label==1)]['description']
Print the description of the talks that the user preferred watching in person. How many such talks are there?
In [ ]:
The 2017 talks will be used for training and the 2018 talks will we used for predicting. Set the values of year_labeled
and year_predict
to appropriate values and print out the values of description_labeled
and description_predict
.
In [ ]:
year_labeled=
year_predict=
description_labeled = df[df.year==year_labeled]['description']
description_predict = df[df.year==year_predict]['description']
Lets have a quick overview of text analysis. Our end goal is to train a machine learning algorithm by making it go through enough documents from each class to recognize the distingusihing characteristics in documents from a particular class.
Using sklearn we will build the feature set by tokenization, counting and normalization of the bi-grams from the text descriptions of the talk. You can find more information on text feature extraction here and TfidfVectorizer here.
We will use the fit_transform method to learn the vocabulary dictionary and return term-document matrix.
For the data on which we will do our predictions, we will use the transform method to get the document-term matrix.
Next we will split our data into training set and testing set. This allows us to do cross validation and avoid overfitting. Use the train_test_split method from sklearn.model_selection
.
Finally we get to the stage for training the model. We are going to use a linear support vector classifier and check its precision and recall by using the classification_report.