Topic Based Recommender

Represent articles in terms of Topic Vector
Represent user in terms of Topic Vector of read articles
Calculate cosine similarity between read and unread articles
Get the recommended articles

Describing parameters:



In [1]:

    
PATH_ARTICLE_TOPIC_DISTRIBUTION = "/home/phoenix/Documents/HandsOn/Final/python/Topic Model/model/Article_Topic_Distribution.csv"
PATH_NEWS_ARTICLES = "/home/phoenix/Documents/HandsOn/Final/news_articles.csv"
NO_OF_TOPICS=150
ARTICLES_READ=[7,6,76,61,761]
NUM_RECOMMENDED_ARTICLES=5



In [2]:

    
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

1. Represent Read Article in terms of Topic Vector



In [3]:

    
article_topic_distribution = pd.read_csv(PATH_ARTICLE_TOPIC_DISTRIBUTION)
article_topic_distribution.shape









    Out[3]:





(22186, 3)



In [4]:

    
article_topic_distribution.head()









    Out[4]:






  
    
      
      Article_Id
      Topic_Id
      Topic_Weight
    
  
  
    
      0
      0
      25
      0.324485
    
    
      1
      0
      27
      0.131476
    
    
      2
      0
      127
      0.535940
    
    
      3
      1
      5
      0.306691
    
    
      4
      1
      47
      0.277037

Generate Article-Topic Distribution matrix



In [5]:

    
#Pivot the dataframe
article_topic_pivot = article_topic_distribution.pivot(index='Article_Id', columns='Topic_Id', values='Topic_Weight')
#Fill NaN with 0
article_topic_pivot.fillna(value=0, inplace=True)
#Get the values in dataframe as matrix
articles_topic_matrix = article_topic_pivot.values
articles_topic_matrix.shape









    Out[5]:





(4831, 150)



In [6]:

    
article_topic_pivot.head()









    Out[6]:






  
    
      Topic_Id
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      140
      141
      142
      143
      144
      145
      146
      147
      148
      149
    
    
      Article_Id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.306691
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      0.0
      0.0
      0.0
      0.015589
      0.0
      0.077002
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.396528
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 150 columns

2. Represent user in terms of Topic Vector of read articles

A user vector is represented in terms of average of read articles topic vector



In [7]:

    
#Select user in terms of read article topic distribution
row_idx = np.array(ARTICLES_READ)
read_articles_topic_matrix=articles_topic_matrix[row_idx[:, None]]
#Calculate the average of read articles topic vector 
user_vector = np.mean(read_articles_topic_matrix, axis=0)
user_vector.shape









    Out[7]:





(1, 150)



In [8]:

    
user_vector









    Out[8]:





array([[ 0.        ,  0.        ,  0.        ,  0.02488209,  0.06438433,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.02753025,
         0.        ,  0.18989699,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.04683422,
         0.        ,  0.06889868,  0.        ,  0.00411056,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.00662661,
         0.        ,  0.        ,  0.09912603,  0.        ,  0.        ,
         0.01028336,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00661727,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.05856521,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.04954107,  0.01280254,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00393764,  0.        ,
         0.        ,  0.03582032,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.07245383,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.08082968,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.12301701,  0.        ,  0.        ,  0.        ,  0.        ]])

3. Calculate cosine similarity between read and unread articles



In [9]:

    
def calculate_cosine_similarity(articles_topic_matrix, user_vector):
    articles_similarity_score=cosine_similarity(articles_topic_matrix, user_vector)
    recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]
    #Remove read articles from recommendations
    final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                     if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]
    return final_recommended_articles_id



In [10]:

    
recommended_articles_id = calculate_cosine_similarity(articles_topic_matrix, user_vector)
recommended_articles_id









    Out[10]:





[864, 2150, 2450, 629, 3643]

4. Recommendation Using Topic Model:-



In [11]:

    
#Recommended Articles and their title
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title']









    



Articles Read
6      Infosys shares likely to fall on Tuesday after...
7      Dialogue crucial in finding permanent solution...
61     Revathy to direct Queen s Tamil  Telugu remake...
76     When cricketer R Ashwin started fans club for ...
761     Baahubali  to have world television premiere ...
Name: Title, dtype: object


Recommender 
629      Dilwale  review roundup  What critics have to...
864     Shah Rukh Khan-Kajol appear on Vijay TV show  ...
2150    Year 2014 for Aamir Khan  Shah Rukh Khan and S...
2450    Times Celebex  Akshay  Katrina top the list  S...
3643    Will Aditya Chopra Bring Shah Rukh Khan and Ra...
Name: Title, dtype: object

Topics + NER Recommender

Topic + NER Based Recommender

Represent user in terms of -
```
 (Alpha) <Topic Vector> + (1-Alpha) <NER Vector> <br/>
```
where
Alpha => [0,1]
[Topic Vector] => Topic vector representation of concatenated read articles
[NER Vector] => Topic vector representation of NERs associated with concatenated read articles
Calculate cosine similarity between user vector and articles Topic matrix
Get the recommended articles



In [12]:

    
ALPHA = 0.5
DICTIONARY_PATH = "/home/phoenix/Documents/HandsOn/Final/python/Topic Model/model/dictionary_of_words.p"
LDA_MODEL_PATH = "/home/phoenix/Documents/HandsOn/Final/python/Topic Model/model/lda.model"



In [13]:

    
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
import re
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem.snowball import SnowballStemmer
import pickle
import gensim
from gensim import corpora, models

1. Represent User in terms of Topic Distribution and NER

Represent user in terms of read article topic distribution

Represent user in terms of NERs associated with read articles

 2.1 Get NERs of read articles
 2.2 Load LDA model
 2.3 Get topic distribution for the concated NERs

Generate user vector

1.1. Represent user in terms of read article topic distribution



In [14]:

    
row_idx = np.array(ARTICLES_READ)
read_articles_topic_matrix=articles_topic_matrix[row_idx[:, None]]
#Calculate the average of read articles topic vector 
user_topic_vector = np.mean(read_articles_topic_matrix, axis=0)
user_topic_vector.shape









    Out[14]:





(1, 150)

1.2. Represent user in terms of NERs associated with read articles



In [15]:

    
# Get NERs of read articles
def get_ner(article):
    ne_tree = ne_chunk(pos_tag(word_tokenize(article)))
    iob_tagged = tree2conlltags(ne_tree)
    ner_token = ' '.join([token for token,pos,ner_tag in iob_tagged if not ner_tag==u'O']) #Discarding tokens with 'Other' tag
    return ner_token



In [16]:

    
articles = news_articles['Content'].tolist()
user_articles_ner = ' '.join([get_ner(articles[i]) for i in ARTICLES_READ])
print "NERs of Read Article =>", user_articles_ner









    



NERs of Read Article => Narendra Modi Kashmir Modi Jammu Kashmir Modi Burhan Wani Omar Abdullah Abdullah National Conference Congress PCC CPI Tarigami Valley Modi Kashmir Jammu Kashmir Infosys Royal Bank Scotland RBS Williams Glyn Infosys IBM Infosys Application Delivery India Royal Bank Scotland Williams Glyn RBS Infosys Infosys Infosys Bombay Stock Infosys FY2017 Infosys YoY Cricketer Ravichandran Trisha Krishnan Ashwin Tamil Trisha Tamil Lesa Lesa Veteran Revathy Bollywood Queen Actress Suhasini Mani Ratnam Vikas Bahl Queen Paris Kangana Queen Telugu Tamil Filmmaker Thiagarajan Queen Telugu Tamil Revathy Suhasini Mani Ratnam Revathy Suhasini Mani Ratnam Suhasini Revathy Suhasini Mani Ratnam South Indian Telugu Suhasini Rajamouli Baahubali Malayalam Mazhavil Manorama Malayalam Prabhas Rana Daggubati Anushka Shetty Tamannaah Bhatia Baahubali Anushka Tamannaah Rana Prabhas Rajamouli Mazhavil Manorma Manorama Music Baahubali Malayalam VCD DVD Telugu MAA Baahubali Dussehra



In [17]:

    
stop_words = set(stopwords.words('english'))
tknzr = TweetTokenizer()
stemmer = SnowballStemmer("english")



In [18]:

    
def clean_text(text):
    cleaned_text=re.sub('[^\w_\s-]', ' ', text)                                            #remove punctuation marks 
    return cleaned_text                                                                    #and other symbols 

def tokenize(text):
    word = tknzr.tokenize(text)                                                             #tokenization
    filtered_sentence = [w for w in word if not w.lower() in stop_words]                    #removing stop words
    stemmed_filtered_tokens = [stemmer.stem(plural) for plural in filtered_sentence]        #stemming
    tokens = [i for i in stemmed_filtered_tokens if i.isalpha() and len(i) not in [0, 1]]
    return tokens



In [19]:

    
#Cleaning the article
cleaned_text = clean_text(user_articles_ner)
article_vocabulary = tokenize(cleaned_text)



In [20]:

    
#Load model dictionary
model_dictionary = pickle.load(open(DICTIONARY_PATH,"rb"))
#Generate article maping using IDs associated with vocab
corpus = [model_dictionary.doc2bow(text) for text in [article_vocabulary]]



In [21]:

    
#Load LDA Model
lda =  models.LdaModel.load(LDA_MODEL_PATH)



In [22]:

    
# Get topic distribution for the concated NERs
article_topic_distribution=lda.get_document_topics(corpus[0])
article_topic_distribution









    Out[22]:





[(9, 0.016833535075786221),
 (16, 0.13360130412473772),
 (21, 0.011354918964910036),
 (29, 0.048363063836432151),
 (31, 0.18383978754651545),
 (44, 0.016568883655345965),
 (84, 0.017429078934066415),
 (93, 0.041131451368969535),
 (106, 0.12909919386972013),
 (119, 0.20018375535066402),
 (127, 0.10765513761665123),
 (128, 0.036829724632851543),
 (145, 0.041718476547869268)]



In [23]:

    
ner_vector =[0]*NO_OF_TOPICS
for topic_id, topic_weight in article_topic_distribution:
    ner_vector[topic_id]=topic_weight
user_ner_vector = np.asarray(ner_vector).reshape(1,150)

1.3. Generate user vector



In [24]:

    
alpha_topic_vector = ALPHA*user_topic_vector
alpha_ner_vector = (1-ALPHA) * user_ner_vector
user_vector = np.add(alpha_topic_vector,alpha_ner_vector)
user_vector









    Out[24]:





array([[ 0.        ,  0.        ,  0.        ,  0.01244104,  0.03219216,
         0.        ,  0.        ,  0.        ,  0.        ,  0.00841677,
         0.        ,  0.        ,  0.        ,  0.        ,  0.01376513,
         0.        ,  0.16174915,  0.        ,  0.        ,  0.        ,
         0.        ,  0.00567746,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.04759864,
         0.        ,  0.12636923,  0.        ,  0.00205528,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.01159775,
         0.        ,  0.        ,  0.04956302,  0.        ,  0.        ,
         0.00514168,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00330864,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.03799714,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.04533626,  0.00640127,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00196882,  0.        ,
         0.        ,  0.08245976,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.13631879,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.09424241,  0.01841486,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.08236774,  0.        ,  0.        ,  0.        ,  0.        ]])

2. Calculate cosine similarity between user vector and articles Topic matrix



In [25]:

    
recommended_articles_id = calculate_cosine_similarity(articles_topic_matrix, user_vector)
recommended_articles_id
# [array([ 0.75807146]), array([ 0.74644157]), array([ 0.74440326]), array([ 0.7420562]), array([ 0.73966259])]









    Out[25]:





[1913, 2003, 1995, 1997, 864]

3. Get recommended articles



In [26]:

    
#Recommended Articles and their title
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title']









    



Articles Read
6      Infosys shares likely to fall on Tuesday after...
7      Dialogue crucial in finding permanent solution...
61     Revathy to direct Queen s Tamil  Telugu remake...
76     When cricketer R Ashwin started fans club for ...
761     Baahubali  to have world television premiere ...
Name: Title, dtype: object


Recommender 
864     Shah Rukh Khan-Kajol appear on Vijay TV show  ...
1913    AIB Roast Controversy and Stringent Censor Boa...
1995     Bajirao Mastani  Director Bhansali found AIB ...
1997    Twinkle Khanna s Blog on AIB Roast Goes Viral ...
2003    Deepika Padukone  Sonakshi Sinha  Alia Bhatt C...
Name: Title, dtype: object



In [ ]:

	Article_Id	Topic_Id	Topic_Weight
0	0	25	0.324485
1	0	27	0.131476
2	0	127	0.535940
3	1	5	0.306691
4	1	47	0.277037

Topic_Id	0	1	2	3	4	5	6	7	8	9	...	140	141	142	143	144	145	146	147	148	149
Article_Id
0	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.000000	0.0	0.306691	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.015589	0.0	0.077002	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.000000	0.0	0.396528	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0