03. Processing - Transcript to Subject

This is not anymore a purely scraping part. This is at the interface between the end of the scraping and the beginning of the formatting of the data in order to use them for the Natural Language Processing. What we do here is reuniting the whole discussion about a Subject (same IdSubject) into a single field. We lose the person who talked, but we get a much larger corpus of text about a subject that we know to be the same (the algorithm wouldn't guess it !). Note that, at the moment, we only consider the data which are in French (i.e. which have the attribute LanguageOfText to FR).

0. Imports


In [ ]:
import pandas as pd
import glob
import os
import numpy as np
from time import time
import logging
import gensim
import bz2
import PyPDF2

1. Processing of the data

At first, we load all the Transcripts (several files into the Transcript folder) and merge it into one large DataFrame. Then, we will process it to merge all the attributes with the same SubjectId into one entry. Note that in the resulting DataFrame, there will be a lot of empty entries, as all the indices are not contiguous.


In [ ]:
dataset = []

path = '../datas/scrap/Transcript/'
allFiles = glob.glob(os.path.join(path, '*.csv'))

# Load and concactenate all the files into one dataset.
for file_ in allFiles:
    data = pd.read_csv(file_)
    dataset += [data]
datas = pd.concat(dataset)

print(len(data.columns))  
#print('Length of the dataset', len(dataset))

The following cell is also LONG TO RUN. The resulting csv file is the one located at

datas/treated_data/Transcript/FRTextfromsubject.csv

The operation we do is simple. We filter the data by subject and take all the texts which are in French or German given a subject. Then, we aggregate all the text found into a cell, which maps the Subject ID and the Text to lists which contains everything we need. Then, we eventually export it to the file cited above.


In [ ]:
def filter_transcript(language):
    subjects = datas.IdSubject.unique()
    dict_={'Subject Id':[],'Text':[]}
    text = ''
    
    # Iterate on all the different Subjects (list of IDs)
    for subject in subjects : 
    
        data_tmp = datas[(datas.IdSubject==subject) & (datas.LanguageOfText == language)]
        # Deal with all the NaNs and remove them.
        texts = data_tmp[data_tmp.Text == data_tmp.Text]
        text = texts.Text.sum()
        dict_['Subject Id'] += [subject]
        dict_['Text'] += [text]
    return dict_     

def save_transcript(transcript, language):
    # Save to file
    if not os.path.exists("../datas/treated_data/Transcript"):
        os.makedirs("../datas/treated_data/Transcript")
    transcript.to_csv('../datas/treated_data/Transcript/'+language+'Textfromsubject.csv',index=False)

Running the transcript filtering here.


In [ ]:
#language = 'FR'
language='DE'
dict_ = filter_transcript(language)
# Convert the result to a DataFrame, visualise it.
transcript = pd.DataFrame(dict_)
transcript.head(100)

In [ ]:
save_transcript(transcript,language)