01. Topic Modelling using the Gensim Library

Usual imports come first.


In [ ]:
import pandas as pd
import glob
import os
import numpy as np
from time import time
import logging
import gensim
import bz2

1. Load the data from the Transcript files

At the moment, we only consider the entries for which the field LanguageOfText is FR, namely the ones in French. We will consider the text in German later on. We show below one example of the text we consider.


In [ ]:
dataset = []

path = '../datas/treated_data/Transcript/'
#path = 'datas/Vote/'
allFiles = glob.glob(os.path.join(path, 'FR*.csv'))

for file_ in allFiles:
    data = pd.read_csv(file_)
    dataset = dataset + list(data[(data['Text'] == data['Text'])]['Text'].values)
    #dataset = dataset + list(data[(data['BusinessTitle'] == data['BusinessTitle'])]['BusinessTitle'].values+' ')

    
print('Length of the dataset', len(dataset))
#print(dataset[0],'\n',dataset[1])
#data.head()

The length of the transcripts largely vary from an entry to another, but it reflects exactly what is discussed at the federal parliament. Processing them correctly will allow us to grasp the topic which are discussed at the parliament.

2. Format the data in order to use LDA with Gensim

First of all, we load the stop_words, a list which refers all the common words for French, and that we must not take into accoung when doing the topic modelling, as they do not convey any useful information. The pipeline we follow is the following :

  1. Load the stop_words (using the stop_words package) : We do not load the package as we loaded it once and save the resulting stop words into a .txt file. We do that in order to be able to add some stop words of our own.

In [ ]:
def stop_words():
    """
        Loads and concatenates the stop_words list of both french and german languages 
        (due to the fact that there are some german words in the french transcript and vice-versa)
    """
    #1. Load the custom French stop words dictionary
    with open ("../datas/stop_dictionaries/French_stop_words_changed.txt", "r") as myfile:
        stop_words_fr=myfile.read()  
    stop_words_fr = stop_words_fr.split(',')

    #2. Load the custom German stop words dictionary    
    with open ("../datas/stop_dictionaries/German_stop_words.txt", "r") as myfile:
        stop_words_de=myfile.read()  
    stop_words_de = stop_words_de.split(', ')
    
    return stop_words_de+stop_words_fr
stop_words = stop_words()
  1. Remove those common words and tokenize our dataset (break it down into words)
  2. We count the frequency of the words and remove the ones that appear only once in total.
  3. (Implement the Stemming of the data (cf. a French stemming algorithm). (Done with the nltk library) ) -> Not implemented at the moment
  4. Remove all the words of length <= 2.

    N.B. THIS ALGORITHM IS VERY SLOW !!!!


In [ ]:
import re
from collections import defaultdict
from nltk.stem.snowball import FrenchStemmer
from nltk.stem import WordNetLemmatizer

def format_text(dataset, stop_words, stemming = False):
    """
    Here, we remove the common words in our document corpus and tokenize it, before 
    """ 
    # The re.split function takes as first arguments everything we split at. At the moment, this is 
    # ' ' - '\'- '/' - ''' (apostrophe) -  '\n' - '(', ')' - ',' - '.' - ':' - ';' -'[' - ']' and - '´'
    # We also filter the words which are shorter than 3 letters, as they are very unlikely 
    #to provide any information,  and finally, we remove the common words.
    
    texts = [[word for word in re.split(' |\'|\n|\(|\)|,|;|:|\.|\[|\]|\’|\/',
                                    document.lower()) if (len(word) > 4 and (word not in stop_words))] 
             for document in dataset]

    # Thirdly we remove the words that appear only once in a text    
    if stemming:
        #Consider the stemmed version
        FS = FrenchStemmer()

        frequency = defaultdict(int)
        for text in texts:
            for token in text:
                frequency[FS.stem(token)] += 1

        texts = [[FS.stem(token) for token in text if frequency[FS.stem(token)] > 1]
                 for text in texts]
    else:
        
        frequency = defaultdict(int)
        for text in texts:
            for token in text:
                frequency[token] += 1


        texts = [[token for token in text if frequency[token] > 1]
                 for text in texts]
    return texts

In [ ]:
texts = format_text(dataset,stop_words)

3. Perform the LDA topic modelling and print the results.

Formatting the data into a dictionnary and a corpus, necessary entries for the LdaModel function of Gensim.


In [ ]:
dictionary = gensim.corpora.Dictionary(texts)
# Converts a collection of words to its bag of word representation (list of word_id, word_frequency 2-tuples$)
corpus = [dictionary.doc2bow(text) for text in texts]

In [ ]:
if not os.path.exists("../datas/lda"):
    os.makedirs("../datas/lda")

In [ ]:
dictionary.save('../datas/lda/ldaDictionaryFR.dict')

Note that in the algorithm below, we need to choose the number of topics, which is the number of clusters of data that we want to find. Note that the accuracy of our algorithm depends a lot on picking a good number of topics.


In [ ]:
%%time
ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_topics=11,id2word = dictionary)#, passes=1)
#ldamodel = gensim.models.hdpmodel.HdpModel(corpus, id2word=dictionary, T=50)

In [ ]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [ ]:
vis_data = gensimvis.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis_data)

In [ ]:
ldamodel.save('../datas/lda/ldamodelFR.model')