Cécile Alduy and I have been working with a team of undergraduate RAs and with some members of the Stanford University Libraries staff to build a corpus of the public discourse of French presidential candidates in advance of the 2017 elections. In this notebook, I will describe the methods by which we have been converting web pages into corpora for analysis and will provide some sample Python code.
Collaborating with Nicholas Taylor, SUL's Web Archiving Service Manager, and Sarah Sussman, the curator of French and Italian Collections, we have identified several key websites and begun periodic crawls using ArchiveIt. One of the first challenges for any text-mining project that uses web archives as a source is, unsurprisingly, getting the correct text from each website. Although it's possible to simply extract all the text from a web page, there's a lot of extraneous information that we don't want to deal with.
def extract_text(site_info, input_file, corpus_dir, word_count=0):
'''
Extract the actual text from the HTML
Write out file with text content
Return extracted metadata about text
'''
results = dict()
try:
soup = BeautifulSoup(open(input_file, encoding="utf-8"), 'html.parser')
except UnicodeDecodeError as err:
# print(input_file + ' is not UTF8', err)
return
if soup is None:
return
# Skip page if there's a filter and it isn't matched
if len(site_info['filter']) and not len(soup.select(site_info['filter'])):
return
# Fields in CSV with BeautifulSoup select() options
for item in ['title','date','author','content']:
results[item] = ''
if (not len(site_info[item])):
continue
contents = soup.select(site_info[item])
if contents is not None and len(contents):
# Assume only the first result is relevant
# BS4 returns a list of results even if only 1 found
results[item] = clean_string(contents[0].getText())
results['word_count'] = len(results['content'].split())
results['filename'] = generate_unique_filename(corpus_dir, site_info['name'], results)
if os.path.isfile(results['filename']):
return
# Save the original URL
results['url'] = get_original_url(site_info, input_file)
if (len(results['title']) and results['word_count'] >= int(word_count)):
# Ensure the path exists
if not os.path.isdir(os.path.dirname(results['filename'])):
os.makedirs(os.path.dirname(results['filename']))
with open(results['filename'], 'w') as content:
content.write(str(results['content']))
return results
return None
In [ ]: