In [117]:
## additional installations in colab
!pip install pyLDAvis
!python -m spacy download en_core_web_lg ## restart once download is complete.
## general imports
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import spacy
from spacy import displacy
import pyLDAvis
from pyLDAvis import sklearn
pyLDAvis.enable_notebook()
As part of this step we will try to load and analyze our data in order to get a general intuition of our data.
Source of this dataset is from BBC News Insight data.
About the Data: All rights, including copyright, in the content of the original articles are owned by the BBC.
In [0]:
#from google.colab import drive
#drive.mount('/content/drive')
In [119]:
### Reading the dataset from path
filename = '/content/drive/My Drive/Colab Notebooks/data/bbc/bbc_raw_dict.json'
data = pd.read_json(filename)
data.head()
Out[119]:
In [120]:
### get the shape of the data
data.shape
Out[120]:
In [121]:
### Check for any null or na
data['content'].isna().value_counts() # no null records found. hence the data is in good quality
Out[121]:
In [122]:
### distribution of the dataset
data['category'].value_counts()
Out[122]:
In [123]:
### visualize the category - total article spread.
plt.figure(figsize=(10,5))
sns.set_style("whitegrid")
sns.countplot(x='category',data=data, orient='h', palette='Spectral_r')
plt.title("Article Counts by Category (BBC) - Actual Data Source")
plt.show()
The BBC news data is currently divided into 5 Major Categories: Entertainment, Business, Sport, Politics and Tech.
Note: Our target is to understand the BBC news content and build a topic cluster for the news content. We will assume there is no previous knowledge of the above categories. The categories from the original dataset are merely for the purpose of cross-validation and testing.
In [0]:
# load spacy
nlp = spacy.load('en_core_web_lg')
In [125]:
# selecting and analysing the first item in the dataframe
doc_one = nlp(data['content'].iloc[0])
displacy.render(doc_one,style='ent',jupyter=True)
With the above display, we can notice the varied entities across the dataset.We would like to remove most of the entites out like PERSON, ORG, CARDINAL, GPE, LOC and others. This will help us in narrowing down the topic words.
In [126]:
## Lets try to reduce down the data set into selected POS.
additional_stop_words = ['say','go','come','get','see','use','take','want','tell','need']
def parse_filter_document(doc):
'''
1. Remove Stop Words
2. filter selective POS.
3. apply lemma on the tokens
@returns: list of filtered string
'''
filtered_doc = []
for token in doc:
if token.is_stop == False | token.is_punct == False | token.is_space == False:
if token.pos_ in ['NOUN','VERB']:
if token.lemma_ not in additional_stop_words:
filtered_doc.append(token.lemma_)
return ' '.join(filtered_doc)
print(parse_filter_document(doc_one))
With the above method and applying it on one single document / news article, we can see we have cleaned and reduced the news article into meaningful chunk of data. We will use this method in the subsequent parsing of the data set.
Applying the parser method that we created into all of the dataframe and create a list new column named : parsed_content
In [127]:
df = data.copy()
df['parsed_content'] = data['content'].apply(lambda x: parse_filter_document(nlp(x)))
df[['category','content','parsed_content']].head(10)
Out[127]:
Applying CountVectorizer on the new parsed dataframe. This will vectorize the dataframe column : parsed_content. Once the data is vectorized we will fit it into our model.
In [0]:
cnt_vec = CountVectorizer(max_df=0.95, min_df=1, lowercase=True)
In [0]:
df_cnt_vec = cnt_vec.fit_transform(df['parsed_content'])
In [130]:
df_cnt_vec # a sparse matrix is generated.
Out[130]:
At this step, we are ready to fit our data into the LDA model. We have now cleaned, parsed and vectorised our data. In the next step, we will add the vectorized data into our model.
In this subsequent steps we will try to fit in the LDA model based on the transformed and vectorised dataframe we created in the previous step. Based on our initial data analysis, our target topic cluster will be similar to the one in the original data. This will ensure to generate a clearer understanding of this data.
In [0]:
### Define the LDA model and set the topic size to 5.
topic_clusters = 5 ## assumption is based on the original dataset.
lda_model = LatentDirichletAllocation(n_components=topic_clusters,
learning_decay=0.7,
batch_size=128,
random_state=42)
In [132]:
### Fit the filtered data to the model
lda_model.fit(df_cnt_vec)
Out[132]:
Note: Fitting the model to the dataset may take a long time. You will see the output as model summary, if success.
In [0]:
### Transform the dataset with the generated model
result_df = lda_model.transform(df_cnt_vec)
Let us now generate some model outputs based on the dataset. Our model will generate the topics and the word distribution across topics.
In [134]:
### List down the number of generated Topics
# sport ; business ; politics ; tech ; entertainment
topic_word_dict = {}
print("Topic ID| Word Distribution")
print("--------|----------------------------------------------------------------")
for index, topic in enumerate(lda_model.components_):
topic_words_max = [cnt_vec.get_feature_names()[i] for i in topic.argsort()[-15:]]
topic_word_dict[index] = topic_words_max
print(f"Topic:{index:{2}}| {' ,'.join(topic_words_max)}")
From the above generated topic and looking at the distribution of words in each of the 5 topics, we can say the following:
Lets try to get more details and generate the weight distribution of topics across the dataframe/documents.
In [135]:
topics = [ "Topic "+str(t) for t in range(lda_model.n_components)]
indexes = [ i for i in range(len(df))]
topic_dist_df = pd.DataFrame(data=np.round(result_df,decimals=2), columns=topics, index=indexes)
dominant_topic = np.argmax(topic_dist_df.values, axis=1)
topic_dist_df['dominant_topic'] = dominant_topic
df['Topic 0'] = topic_dist_df['Topic 0']
df['Topic 1'] = topic_dist_df['Topic 1']
df['Topic 2'] = topic_dist_df['Topic 2']
df['Topic 3'] = topic_dist_df['Topic 3']
df['Topic 4'] = topic_dist_df['Topic 4']
df['dominant_topic'] = topic_dist_df['dominant_topic']
df[['content','Topic 0','Topic 1','Topic 2','Topic 3','Topic 4','dominant_topic']].head(10)
Out[135]:
In this step we use a specialised visualisation library called pyLDAvis which aims at generating an interactive visualization for LDA model.
Read more on pyLDAvis in the official document.
In [0]:
viz = sklearn.prepare(lda_model=lda_model, dtm=df_cnt_vec, vectorizer=cnt_vec)
In [137]:
pyLDAvis.display(viz)
Out[137]:
In [0]: