Topic Modelling using Latent-Dirichlet Allocation


In [0]:
## required installation for LDA visualization
!pip install pyLDAvis

## imports

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pyLDAvis
from pyLDAvis import sklearn
pyLDAvis.enable_notebook()

Step 1: Loading and Understanding Data

As part of this step we will load an existing dataset and load it into a pandas dataframe. We will also try to on a brief understand the data. - Source of the dataset is from Kaggle [News category classifer](https://www.kaggle.com/hengzheng/news-category-classifier-val-acc-0-65).

In [0]:
### Reading the dataset from path

filename = 'News_Category_Dataset_v2.json'
data = pd.read_json(filename, lines=True)
data.head()


Out[0]:
category headline authors link short_description date
0 CRIME There Were 2 Mass Shootings In Texas Last Week... Melissa Jeltsen https://www.huffingtonpost.com/entry/texas-ama... She left her husband. He killed their children... 2018-05-26
1 ENTERTAINMENT Will Smith Joins Diplo And Nicky Jam For The 2... Andy McDonald https://www.huffingtonpost.com/entry/will-smit... Of course it has a song. 2018-05-26
2 ENTERTAINMENT Hugh Grant Marries For The First Time At Age 57 Ron Dicker https://www.huffingtonpost.com/entry/hugh-gran... The actor and his longtime girlfriend Anna Ebe... 2018-05-26
3 ENTERTAINMENT Jim Carrey Blasts 'Castrato' Adam Schiff And D... Ron Dicker https://www.huffingtonpost.com/entry/jim-carre... The actor gives Dems an ass-kicking for not fi... 2018-05-26
4 ENTERTAINMENT Julianna Margulies Uses Donald Trump Poop Bags... Ron Dicker https://www.huffingtonpost.com/entry/julianna-... The "Dietland" actress said using the bags is ... 2018-05-26

In [0]:
### data dimensions (rows, columns) of the dataset we are dealing with

data.shape


Out[0]:
(200853, 6)

In [0]:
### Total articles by category spread - viz

plt.figure(figsize=(20,5))
sns.set_style("whitegrid")
sns.countplot(x='category',data=data, orient='h', palette='husl')
plt.xticks(rotation=90)
plt.title("Category count of article")
plt.show()


As we can see in the above diagram, a lot of new relates to Politics and its related items. Also we can understand a total of 20 new categories are defined in this dataset. So as part of Topic modelling exercise we can try to categories the dataset into 20 topics.

Step 2: Transforming the dataset

For the purpose of this demo and blog, we will do the following: 1. **Combine** both the **Headline and Short Description** into one single column to bring more context to the news and corpus. Calling it as: ```Combined_Description``` 2. **Drop** rest of the attributes from the dataframe other than Combined_Description and Categories.

In [0]:
### tranform the dataset to fit the original requirement

data['Combined_Description'] = data['headline'] + data['short_description']

filtered_data = data[['category','Combined_Description']]

filtered_data.head()


Out[0]:
category Combined_Description
0 CRIME There Were 2 Mass Shootings In Texas Last Week...
1 ENTERTAINMENT Will Smith Joins Diplo And Nicky Jam For The 2...
2 ENTERTAINMENT Hugh Grant Marries For The First Time At Age 5...
3 ENTERTAINMENT Jim Carrey Blasts 'Castrato' Adam Schiff And D...
4 ENTERTAINMENT Julianna Margulies Uses Donald Trump Poop Bags...

In [0]:
## checking the dimensions of filtered data
filtered_data.shape


Out[0]:
(200853, 2)
Applying TFIDFVectorizer to pre-process the data into vectors. - max_df : Ignore the words that occurs more than 95% of the corpus. - min_df : Accept the words in preparation of vocab that occurs in atleast 2 of the documents in the corpus. - stop_words : Remove the stop words. We can do this in separate steps or in a single step.

In [0]:
df_tfidf = TfidfVectorizer(max_df=0.5, min_df=10, stop_words='english', lowercase=True)

In [0]:
df_tfidf_transformed = df_tfidf.fit_transform(filtered_data['Combined_Description'])

In [0]:
df_tfidf_transformed


Out[0]:
<200853x21893 sparse matrix of type '<class 'numpy.float64'>'
	with 2740307 stored elements in Compressed Sparse Row format>
Here you can notice that the transformed dataset holds a sparse matrix with a dimension of 200853x21893; where 200853 is the total number of rows and 21893 is the total word corpus.

Step 3: Building Latent-Dirichlet Algorithm using scikit-learn


In [0]:
### Define the LDA model and set the topic size to 20.

topic_clusters = 20

lda_model = LatentDirichletAllocation(n_components=topic_clusters, batch_size=128, random_state=42)

In [0]:
### Fit the filtered data to the model

lda_model.fit(df_tfidf_transformed)


Out[0]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)
Note: Fitting the model to the dataset take a long time. You will see the output as model summary, if success.

Step 4: LDA Topic Cluster


In [0]:
topic_word_dict = {}

top_n_words_num = 10

for index, topic in enumerate(lda_model.components_):
  topic_id = index
  topic_words_max = [df_tfidf.get_feature_names()[i] for i in topic.argsort()[-top_n_words_num:]]
  topic_word_dict[topic_id] = topic_words_max

  print(f"Topic ID : {topic_id}; Top 10 Most Words : {topic_words_max}")


Topic ID : 0; Top 10 Most Words : ['gun', 'police', 'violence', 'health', 'people', 'school', 'new', 'study', 'sexual', 'women']
Topic ID : 1; Top 10 Most Words : ['stewart', 'flint', 'rico', 'storm', 'climate', 'new', 'jon', 'puerto', 'hurricane', 'water']
Topic ID : 2; Top 10 Most Words : ['crash', 'police', 'airlines', 'cup', 'driver', 'coffee', 'colbert', 'man', 'stephen', 'car']
Topic ID : 3; Top 10 Most Words : ['new', 'photo', 'hair', 'week', 'look', 'dress', 'style', 'wedding', 'fashion', 'photos']
Topic ID : 4; Top 10 Most Words : ['years', 'players', 'saudi', 'nfl', 'old', 'man', 'football', 'year', 'francis', 'pope']
Topic ID : 5; Top 10 Most Words : ['day', 'love', 'make', 'don', 'need', 'things', 'know', 'people', 'time', 'life']
Topic ID : 6; Top 10 Most Words : ['dead', 'olympic', 'killed', 'drag', 'israel', 'year', 'festival', 'new', 'police', 'film']
Topic ID : 7; Top 10 Most Words : ['movie', 'year', 'williams', 'trailer', 'isis', 'jennifer', 'netflix', 'new', 'wars', 'star']
Topic ID : 8; Top 10 Most Words : ['day', 'students', 'parents', 'dads', 'week', 'moms', 'women', 'tweets', 'funniest', 'college']
Topic ID : 9; Top 10 Most Words : ['kendall', 'new', 'tom', 'kylie', 'amy', 'west', 'kanye', 'kim', 'jenner', 'kardashian']
Topic ID : 10; Top 10 Most Words : ['man', 'alabama', 'halloween', 'box', 'kimmel', 'office', 'moore', 'trump', 'fallon', 'jimmy']
Topic ID : 11; Top 10 Most Words : ['game', 'james', 'president', 'obama', 'bernie', 'donald', 'sanders', 'trump', 'hillary', 'clinton']
Topic ID : 12; Top 10 Most Words : ['10', 'year', 'new', 'tips', 'family', 'best', 'day', 'time', 'holiday', 'travel']
Topic ID : 13; Top 10 Most Words : ['ban', 'judge', 'white', 'news', 'hill', 'president', 'donald', 'supreme', 'court', 'trump']
Topic ID : 14; Top 10 Most Words : ['like', 'best', 'dogs', 'city', 'new', 'world', 'dog', 'guide', 'photos', 'gps']
Topic ID : 15; Top 10 Most Words : ['chocolate', 'photos', 'foods', 'day', 'best', 'recipe', 'eat', 'make', 'recipes', 'food']
Topic ID : 16; Top 10 Most Words : ['style', 'photos', 'sure', 'tumblr', 'instagram', 'huffpost', 'pinterest', 'check', 'twitter', 'facebook']
Topic ID : 17; Top 10 Most Words : ['disease', 'loss', 'risk', 'pounds', 'study', 'women', 'ebola', 'cancer', 'lost', 'weight']
Topic ID : 18; Top 10 Most Words : ['election', 'new', 'republicans', 'house', 'republican', 'obama', 'president', 'gop', 'donald', 'trump']
Topic ID : 19; Top 10 Most Words : ['music', 'home', 'video', 'best', 'city', 'hotel', 'world', 'art', 'new', 'photos']

Transforming the existing dataframe and adding the content with a topic id and LDA generated topics


In [0]:
topic_output = lda_model.transform(df_tfidf_transformed)

In [0]:
filtered_data = filtered_data.copy()

filtered_data['LDA_Topic_ID'] = topic_output.argmax(axis=1)

filtered_data['Topic_word_categories'] = filtered_data['LDA_Topic_ID'].apply(lambda id: topic_word_dict[id])

In [0]:
filtered_data[['category','Combined_Description','LDA_Topic_ID','Topic_word_categories']].head()


Out[0]:
category Combined_Description LDA_Topic_ID Topic_word_categories
0 CRIME There Were 2 Mass Shootings In Texas Last Week... 0 [gun, police, violence, health, people, school...
1 ENTERTAINMENT Will Smith Joins Diplo And Nicky Jam For The 2... 2 [crash, police, airlines, cup, driver, coffee,...
2 ENTERTAINMENT Hugh Grant Marries For The First Time At Age 5... 7 [movie, year, williams, trailer, isis, jennife...
3 ENTERTAINMENT Jim Carrey Blasts 'Castrato' Adam Schiff And D... 13 [ban, judge, white, news, hill, president, don...
4 ENTERTAINMENT Julianna Margulies Uses Donald Trump Poop Bags... 13 [ban, judge, white, news, hill, president, don...

Step 6: Visualizing


In [0]:
viz = sklearn.prepare(lda_model=lda_model, dtm=df_tfidf_transformed, vectorizer=df_tfidf)

In [0]:
pyLDAvis.display(viz)


Out[0]:
The above chart visualizes the distribution of LDA topic and prints out a list of most used topics

In [0]: