Topic Based Recommender

Topic Based Recommender

  1. Represent articles in terms of Topic Vector
  2. Represent user in terms of Topic Vector of read articles
  3. Calculate cosine similarity between read and unread articles
  4. Get the recommended articles

Describing parameters:


In [1]:
PATH_ARTICLE_TOPIC_DISTRIBUTION = "/home/phoenix/Documents/HandsOn/Final/python/Topic Model/model/Article_Topic_Distribution.csv"
PATH_NEWS_ARTICLES = "/home/phoenix/Documents/HandsOn/Final/news_articles.csv"
NO_OF_TOPICS=150
ARTICLES_READ=[7,6,76,61,761]
NUM_RECOMMENDED_ARTICLES=5

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

1. Represent Read Article in terms of Topic Vector


In [3]:
article_topic_distribution = pd.read_csv(PATH_ARTICLE_TOPIC_DISTRIBUTION)
article_topic_distribution.shape


Out[3]:
(22186, 3)

In [4]:
article_topic_distribution.head()


Out[4]:
Article_Id Topic_Id Topic_Weight
0 0 25 0.324485
1 0 27 0.131476
2 0 127 0.535940
3 1 5 0.306691
4 1 47 0.277037

Generate Article-Topic Distribution matrix


In [5]:
#Pivot the dataframe
article_topic_pivot = article_topic_distribution.pivot(index='Article_Id', columns='Topic_Id', values='Topic_Weight')
#Fill NaN with 0
article_topic_pivot.fillna(value=0, inplace=True)
#Get the values in dataframe as matrix
articles_topic_matrix = article_topic_pivot.values
articles_topic_matrix.shape


Out[5]:
(4831, 150)

In [6]:
article_topic_pivot.head()


Out[6]:
Topic_Id 0 1 2 3 4 5 6 7 8 9 ... 140 141 142 143 144 145 146 147 148 149
Article_Id
0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.000000 0.0 0.306691 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.015589 0.0 0.077002 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.000000 0.0 0.396528 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 150 columns

2. Represent user in terms of Topic Vector of read articles

A user vector is represented in terms of average of read articles topic vector


In [7]:
#Select user in terms of read article topic distribution
row_idx = np.array(ARTICLES_READ)
read_articles_topic_matrix=articles_topic_matrix[row_idx[:, None]]
#Calculate the average of read articles topic vector 
user_vector = np.mean(read_articles_topic_matrix, axis=0)
user_vector.shape


Out[7]:
(1, 150)

In [8]:
user_vector


Out[8]:
array([[ 0.        ,  0.        ,  0.        ,  0.02488209,  0.06438433,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.02753025,
         0.        ,  0.18989699,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.04683422,
         0.        ,  0.06889868,  0.        ,  0.00411056,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.00662661,
         0.        ,  0.        ,  0.09912603,  0.        ,  0.        ,
         0.01028336,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00661727,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.05856521,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.04954107,  0.01280254,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00393764,  0.        ,
         0.        ,  0.03582032,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.07245383,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.08082968,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.12301701,  0.        ,  0.        ,  0.        ,  0.        ]])

3. Calculate cosine similarity between read and unread articles


In [9]:
def calculate_cosine_similarity(articles_topic_matrix, user_vector):
    articles_similarity_score=cosine_similarity(articles_topic_matrix, user_vector)
    recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]
    #Remove read articles from recommendations
    final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                     if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]
    return final_recommended_articles_id

In [10]:
recommended_articles_id = calculate_cosine_similarity(articles_topic_matrix, user_vector)
recommended_articles_id


Out[10]:
[864, 2150, 2450, 629, 3643]

4. Recommendation Using Topic Model:-


In [11]:
#Recommended Articles and their title
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title']


Articles Read
6      Infosys shares likely to fall on Tuesday after...
7      Dialogue crucial in finding permanent solution...
61     Revathy to direct Queen s Tamil  Telugu remake...
76     When cricketer R Ashwin started fans club for ...
761     Baahubali  to have world television premiere ...
Name: Title, dtype: object


Recommender 
629      Dilwale  review roundup  What critics have to...
864     Shah Rukh Khan-Kajol appear on Vijay TV show  ...
2150    Year 2014 for Aamir Khan  Shah Rukh Khan and S...
2450    Times Celebex  Akshay  Katrina top the list  S...
3643    Will Aditya Chopra Bring Shah Rukh Khan and Ra...
Name: Title, dtype: object

Topics + NER Recommender

Topic + NER Based Recommender

  1. Represent user in terms of -
     (Alpha) <Topic Vector> + (1-Alpha) <NER Vector> <br/>
    
    where
    Alpha => [0,1]
    [Topic Vector] => Topic vector representation of concatenated read articles
    [NER Vector] => Topic vector representation of NERs associated with concatenated read articles
  2. Calculate cosine similarity between user vector and articles Topic matrix
  3. Get the recommended articles

In [12]:
ALPHA = 0.5
DICTIONARY_PATH = "/home/phoenix/Documents/HandsOn/Final/python/Topic Model/model/dictionary_of_words.p"
LDA_MODEL_PATH = "/home/phoenix/Documents/HandsOn/Final/python/Topic Model/model/lda.model"

In [13]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
import re
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem.snowball import SnowballStemmer
import pickle
import gensim
from gensim import corpora, models

1. Represent User in terms of Topic Distribution and NER

  1. Represent user in terms of read article topic distribution
  2. Represent user in terms of NERs associated with read articles
     2.1 Get NERs of read articles
     2.2 Load LDA model
     2.3 Get topic distribution for the concated NERs
  3. Generate user vector

1.1. Represent user in terms of read article topic distribution


In [14]:
row_idx = np.array(ARTICLES_READ)
read_articles_topic_matrix=articles_topic_matrix[row_idx[:, None]]
#Calculate the average of read articles topic vector 
user_topic_vector = np.mean(read_articles_topic_matrix, axis=0)
user_topic_vector.shape


Out[14]:
(1, 150)

1.2. Represent user in terms of NERs associated with read articles


In [15]:
# Get NERs of read articles
def get_ner(article):
    ne_tree = ne_chunk(pos_tag(word_tokenize(article)))
    iob_tagged = tree2conlltags(ne_tree)
    ner_token = ' '.join([token for token,pos,ner_tag in iob_tagged if not ner_tag==u'O']) #Discarding tokens with 'Other' tag
    return ner_token

In [16]:
articles = news_articles['Content'].tolist()
user_articles_ner = ' '.join([get_ner(articles[i]) for i in ARTICLES_READ])
print "NERs of Read Article =>", user_articles_ner


NERs of Read Article => Narendra Modi Kashmir Modi Jammu Kashmir Modi Burhan Wani Omar Abdullah Abdullah National Conference Congress PCC CPI Tarigami Valley Modi Kashmir Jammu Kashmir Infosys Royal Bank Scotland RBS Williams Glyn Infosys IBM Infosys Application Delivery India Royal Bank Scotland Williams Glyn RBS Infosys Infosys Infosys Bombay Stock Infosys FY2017 Infosys YoY Cricketer Ravichandran Trisha Krishnan Ashwin Tamil Trisha Tamil Lesa Lesa Veteran Revathy Bollywood Queen Actress Suhasini Mani Ratnam Vikas Bahl Queen Paris Kangana Queen Telugu Tamil Filmmaker Thiagarajan Queen Telugu Tamil Revathy Suhasini Mani Ratnam Revathy Suhasini Mani Ratnam Suhasini Revathy Suhasini Mani Ratnam South Indian Telugu Suhasini Rajamouli Baahubali Malayalam Mazhavil Manorama Malayalam Prabhas Rana Daggubati Anushka Shetty Tamannaah Bhatia Baahubali Anushka Tamannaah Rana Prabhas Rajamouli Mazhavil Manorma Manorama Music Baahubali Malayalam VCD DVD Telugu MAA Baahubali Dussehra

In [17]:
stop_words = set(stopwords.words('english'))
tknzr = TweetTokenizer()
stemmer = SnowballStemmer("english")

In [18]:
def clean_text(text):
    cleaned_text=re.sub('[^\w_\s-]', ' ', text)                                            #remove punctuation marks 
    return cleaned_text                                                                    #and other symbols 

def tokenize(text):
    word = tknzr.tokenize(text)                                                             #tokenization
    filtered_sentence = [w for w in word if not w.lower() in stop_words]                    #removing stop words
    stemmed_filtered_tokens = [stemmer.stem(plural) for plural in filtered_sentence]        #stemming
    tokens = [i for i in stemmed_filtered_tokens if i.isalpha() and len(i) not in [0, 1]]
    return tokens

In [19]:
#Cleaning the article
cleaned_text = clean_text(user_articles_ner)
article_vocabulary = tokenize(cleaned_text)

In [20]:
#Load model dictionary
model_dictionary = pickle.load(open(DICTIONARY_PATH,"rb"))
#Generate article maping using IDs associated with vocab
corpus = [model_dictionary.doc2bow(text) for text in [article_vocabulary]]

In [21]:
#Load LDA Model
lda =  models.LdaModel.load(LDA_MODEL_PATH)

In [22]:
# Get topic distribution for the concated NERs
article_topic_distribution=lda.get_document_topics(corpus[0])
article_topic_distribution


Out[22]:
[(9, 0.016833535075786221),
 (16, 0.13360130412473772),
 (21, 0.011354918964910036),
 (29, 0.048363063836432151),
 (31, 0.18383978754651545),
 (44, 0.016568883655345965),
 (84, 0.017429078934066415),
 (93, 0.041131451368969535),
 (106, 0.12909919386972013),
 (119, 0.20018375535066402),
 (127, 0.10765513761665123),
 (128, 0.036829724632851543),
 (145, 0.041718476547869268)]

In [23]:
ner_vector =[0]*NO_OF_TOPICS
for topic_id, topic_weight in article_topic_distribution:
    ner_vector[topic_id]=topic_weight
user_ner_vector = np.asarray(ner_vector).reshape(1,150)

1.3. Generate user vector


In [24]:
alpha_topic_vector = ALPHA*user_topic_vector
alpha_ner_vector = (1-ALPHA) * user_ner_vector
user_vector = np.add(alpha_topic_vector,alpha_ner_vector)
user_vector


Out[24]:
array([[ 0.        ,  0.        ,  0.        ,  0.01244104,  0.03219216,
         0.        ,  0.        ,  0.        ,  0.        ,  0.00841677,
         0.        ,  0.        ,  0.        ,  0.        ,  0.01376513,
         0.        ,  0.16174915,  0.        ,  0.        ,  0.        ,
         0.        ,  0.00567746,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.04759864,
         0.        ,  0.12636923,  0.        ,  0.00205528,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.01159775,
         0.        ,  0.        ,  0.04956302,  0.        ,  0.        ,
         0.00514168,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00330864,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.03799714,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.04533626,  0.00640127,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00196882,  0.        ,
         0.        ,  0.08245976,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.13631879,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.09424241,  0.01841486,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.08236774,  0.        ,  0.        ,  0.        ,  0.        ]])

2. Calculate cosine similarity between user vector and articles Topic matrix


In [25]:
recommended_articles_id = calculate_cosine_similarity(articles_topic_matrix, user_vector)
recommended_articles_id
# [array([ 0.75807146]), array([ 0.74644157]), array([ 0.74440326]), array([ 0.7420562]), array([ 0.73966259])]


Out[25]:
[1913, 2003, 1995, 1997, 864]

3. Get recommended articles


In [26]:
#Recommended Articles and their title
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title']


Articles Read
6      Infosys shares likely to fall on Tuesday after...
7      Dialogue crucial in finding permanent solution...
61     Revathy to direct Queen s Tamil  Telugu remake...
76     When cricketer R Ashwin started fans club for ...
761     Baahubali  to have world television premiere ...
Name: Title, dtype: object


Recommender 
864     Shah Rukh Khan-Kajol appear on Vijay TV show  ...
1913    AIB Roast Controversy and Stringent Censor Boa...
1995     Bajirao Mastani  Director Bhansali found AIB ...
1997    Twinkle Khanna s Blog on AIB Roast Goes Viral ...
2003    Deepika Padukone  Sonakshi Sinha  Alia Bhatt C...
Name: Title, dtype: object

In [ ]: