02 - Transcript Parser

In the end of the previous notebook, we were able to retrieve the raw Transcript field from the Parliament website. However, those are formatted in a very unconvenient way, having a lot of HTML tags (e.g. <pd_text> or <p>). This is why we need to parse it all to get a usable text.

Before actually parsing, let us do the usual imports

0. Usual Imports

To install :

pip install PyPDF2

In [2]:
import pandas as pd
import glob
import os
import numpy as np
from time import time
import logging
import gensim
import bz2
import PyPDF2
from bs4 import BeautifulSoup

1. Parsing the Transcripts to a usable form

First of all, it is important to note that this code TAKES AN EXTREMELY LONG TIME TO BE RUN. We would strongly advise you not to run it, as the results are already available in the path provided below. The raw Transcript data can be found in the Transcript Copy folder.


In [ ]:
path = '../datas/scrap/Transcript/'
allFiles = glob.glob(os.path.join(path, '*.csv'))

# Iterate over all the files in the Transcript folder 
# For each of them, import the csv file, iterate over the entries and find the Text
for file_ in allFiles:
    print(file_)
    data = pd.read_csv(file_,index_col=False)
    for index, row in data.iterrows():
        # This line is necessary to remove the NaN entries in the Text fields.
        # Otherwise, the parsing would crash
        if data.Text[index] == data.Text[index] : 
            data.Text[index]= BeautifulSoup(data.Text[index],'html.parser').text.replace('\n',' ')
            
    #At the end, overwrite the file with the parsed data.        
    data.to_csv(file_,index=False)

In [ ]:
data.head()