TF-IDF based Recommender System

Recommender System based on tf-idf as vector representation of documents

TF-IDF Based Recommender

  1. Represent articles in terms of bag of words
  2. Represent user in terms of read articles associated words
  3. Generate TF-IDF matrix for user read articles and unread articles
  4. Calculate cosine similarity between user read articles and unread articles
  5. Get the recommended articles

Describing parameters:


In [1]:
PATH_NEWS_ARTICLES="/home/phoenix/Documents/HandsOn/Final/news_articles.csv"
ARTICLES_READ=[2,7]
NUM_RECOMMENDED_ARTICLES=5

In [2]:
try:
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
except ImportError:
    print('You are missing some packages! ' \
          'We will try installing them before continuing!')
    !pip install "numpy" "pandas" "sklearn" "nltk"
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
    print('Done!')

1. Represent articles in terms of bag of words

  1. Reading the csv file to get the Article id, Title and News Content
  2. Remove punctuation marks and other symbols from each article
  3. Tokenize each article
  4. Stem token of every article

In [3]:
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
news_articles.head()


Out[3]:
Article_Id Title Author Date Content URL
0 0 14 dead after bus falls into canal in Telangan... Devyani Sultania August 22, 2016 12:34 IST At least 14 people died and 17 others were inj... http://www.ibtimes.co.in/14-dead-after-bus-fal...
1 1 Pratibha Tiwari molested on busy road Saath ... Suparno Sarkar August 22, 2016 19:47 IST TV actress Pratibha Tiwari who is best known ... NaN
2 2 US South Korea begin joint military drill ami... Namrata Tripathi August 22, 2016 18:10 IST The United States and South Korea began a join... http://www.ibtimes.co.in/us-south-korea-begin-...
3 3 Illegal construction in Bengaluru Will my hou... S V Krishnamachari August 22, 2016 17:39 IST The relentless drive by Bengaluru s Bangalore... http://www.ibtimes.co.in/illegal-construction-...
4 4 Punjab Gau Rakshak Dal chief held for assaulti... Pranshu Rathee August 22, 2016 17:34 IST Punjab Gau Raksha Dal chief Satish Kumar and h... http://www.ibtimes.co.in/punjab-gau-rakshak-da...

In [4]:
#Select relevant columns and remove rows with missing values
news_articles = news_articles[['Article_Id','Title','Content']].dropna()
#articles is a list of all articles
articles = news_articles['Content'].tolist()
articles[0] #an uncleaned article


Out[4]:
'At least 14 people died and 17 others were injured after a bus travelling from Hyderabad to Kakinada plunged into a canal from a bridge on the accident-prone stretch of the Hyderabad-Khammam highway in Telangana early Monday morning \nThe injured were admitted to the Government General Hospital for treatment \n\n\nSeven people died on the spot and the others succumbed to injuries while undergoing treatment at the hospital  The passengers belonged to the East and West Godavari districts of Andhra Pradesh \nThe bus  owned by private operator Yatra Genie  commenced its journey from Hyderabad at 11 30 p m  on Sunday  Khammam Superintendent of Police Shah Nawaz Khan was quoted by the Hindustan Times as saying \nThe accident happened around 2 30 a m  when the driver slammed the brakes to avoid a collision with another vehicle coming from the opposite direction on a bridge over Nagarjunsagar project left canal at Nayankangudem village in Khammam district  the daily reported  The bus hit the parapet wall of the bridge and nose-dived into the canal \nThe driver of the bus was apparently driving at high speed due to which he lost control of the vehicle  following which it fell into the canal under Kusumanchi mandal  the Deccan Herald reported \nTravellers immediately informed the police who rushed to the accident scene and began the rescue operations '

In [5]:
def clean_tokenize(document):
    document = re.sub('[^\w_\s-]', ' ',document)       #remove punctuation marks and other symbols
    tokens = nltk.word_tokenize(document)              #Tokenize sentences
    cleaned_article = ' '.join([stemmer.stem(item) for item in tokens])    #Stemming each token
    return cleaned_article

In [6]:
cleaned_articles = map(clean_tokenize, articles)
cleaned_articles[0]  #a cleaned, tokenized and stemmed article


Out[6]:
u'at least 14 peopl die and 17 other were injur after a bus travel from hyderabad to kakinada plung into a canal from a bridg on the accident-pron stretch of the hyderabad-khammam highway in telangana earli monday morn the injur were admit to the govern general hospit for treatment seven peopl die on the spot and the other succumb to injuri while undergo treatment at the hospit the passeng belong to the east and west godavari district of andhra pradesh the bus own by privat oper yatra geni commenc it journey from hyderabad at 11 30 p m on sunday khammam superintend of polic shah nawaz khan was quot by the hindustan time as say the accid happen around 2 30 a m when the driver slam the brake to avoid a collis with anoth vehicl come from the opposit direct on a bridg over nagarjunsagar project left canal at nayankangudem villag in khammam district the daili report the bus hit the parapet wall of the bridg and nose-div into the canal the driver of the bus was appar drive at high speed due to which he lost control of the vehicl follow which it fell into the canal under kusumanchi mandal the deccan herald report travel immedi inform the polic who rush to the accid scene and began the rescu oper'

2. Represent user in terms of read articles associated words


In [7]:
#Get user representation in terms of words associated with read articles
user_articles = ' '.join(cleaned_articles[i] for i in ARTICLES_READ)

In [8]:
user_articles


Out[8]:
u'the unit state and south korea began a joint militari drill on monday which prompt threat from north korea the latter has late receiv strong critic worldwid for defi sanction from the unit nation secur council unsc by launch sever ballist missil such action have led to tighter sanction for north korea by the un north korea consid the joint militari drill as prepar for invas and has threaten a pre-empt nuclear strike if the u s and south korea continu the oper it had also conduct a nuclear test in januari which further isol it the ulchi freedom guardian exercis will continu till sept 2 and around 25 000 u s troop are expect to join it the us-l un command militari armistic commiss said that it had notifi the north korean armi that the joint militari drill between the two nation was not provoc from this moment the first-strik combin unit of the korean peopl s armi keep themselv fulli readi to mount a preemptiv retaliatori strike at all enemi attack group involv in ulji freedom guardian a kpa spokesman said in a statement it was recent announc by south korea that the north s deputi ambassador in london had defect and arriv in seoul along with his famili the move dealt a blow to kim jong un the leader of the north korean regim prime minist narendra modi has express deep concern and pain at the unrest and unab violenc in kashmir modi has urg all polit parti to unanim support a perman and last solut within the framework of the constitut to the problem of jammu and kashmir prime minist modi highlight the need for dialogu for restor of normalci in the valley as the unrest that began sinc the kill of hizb-ul-mujahideen leader burhan wani on juli 8 enter the 45th day so far 68 peopl have been kill a 75-minute-long meet with a joint 20-member opposit deleg that was led by former j k chief minist omar abdullah and addit compris seven of abdullah s nation confer mlas along with congress legisl led by pcc chief g a mir and cpi m mla m y tarigami present a memorandum to prime minist modi they collect made an appeal for a polit approach to resolv the crisi in the valley and to ensur that the mistak of the past are not repeat modi appreci the construct suggest and reiter his govern s commit to the welfar and develop of the peopl of kashmir and said those who lost their live dure recent disturb are part of us our nation whether the live lost are of our youth secur personnel or polic it distress us govern and the nation stand with the state of jammu and kashmir 10 11 12 13 14 15'

3. Generate TF-IDF matrix for user read articles and unread articles


In [9]:
#Generate tfidf matrix model for entire corpus
tfidf_matrix = TfidfVectorizer(stop_words='english', min_df=2)
article_tfidf_matrix = tfidf_matrix.fit_transform(cleaned_articles)
article_tfidf_matrix #tfidf vector of an article


Out[9]:
<4831x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 468648 stored elements in Compressed Sparse Row format>

In [10]:
#Generate tfidf matrix model for read articles
user_article_tfidf_vector = tfidf_matrix.transform([user_articles])
user_article_tfidf_vector


Out[10]:
<1x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 188 stored elements in Compressed Sparse Row format>

In [11]:
user_article_tfidf_vector.toarray()


Out[11]:
array([[ 0.        ,  0.03127116,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

4. Calculate cosine similarity between user read articles and unread articles


In [12]:
articles_similarity_score=cosine_similarity(article_tfidf_matrix, user_article_tfidf_vector)

In [13]:
recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]

In [14]:
recommended_articles_id


Out[14]:
array([   2,    7, 3326, ...,  210,  622,  262])

In [15]:
#Remove read articles from recommendations
final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                 if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]

5. Get the recommended articles


In [16]:
final_recommended_articles_id


Out[16]:
[3326, 2862, 2808, 2724, 2950]

In [17]:
#Recommended Articles and their title
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(final_recommended_articles_id)]['Title']


Articles Read
2    US  South Korea begin joint military drill ami...
7    Dialogue crucial in finding permanent solution...
Name: Title, dtype: object


Recommender 
2724    PM Modi says at all-party meeting that PoK is ...
2808    J K  CM Mufti blames  vested interests  for Ka...
2862    J K  PM Modi appeals for peace in Valley  assu...
2950    Kashmir  Death toll rises to 8 in protests ove...
3326    US  China to  fully implement  sanctions again...
Name: Title, dtype: object

In [ ]: