This is a notebook that expands on the OpenCalais code in the file article_coding.py
, also in this folder. It includes more sections on selecting publications you want to submit to OpenCalais as an example. It is intended to be copied and re-used.
In [ ]:
debug_flag = False
In [ ]:
import datetime
import glob
import logging
import lxml
import os
import six
import xml
import xmltodict
import zipfile
In [ ]:
# paper identifier
paper_identifier = "BostonGlobe"
archive_identifier = "BG_20171002210239_00001"
# source
source_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data"
source_paper_path = "{}/{}".format( source_paper_folder, paper_identifier )
# uncompressed
uncompressed_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/uncompressed"
uncompressed_paper_path = "{}/{}".format( uncompressed_paper_folder, paper_identifier )
# make sure an identifier is set before you make a path here.
if ( ( archive_identifier is not None ) and ( archive_identifier != "" ) ):
# identifier is set.
source_archive_file = "{}.zip".format( archive_identifier )
source_archive_path = "{}/{}".format( source_paper_path, source_archive_file )
uncompressed_archive_path = "{}/{}".format( uncompressed_paper_path, archive_identifier )
#-- END check to see if archive_identifier present. --#
In [ ]:
%pwd
In [ ]:
# current working folder
current_working_folder = "/home/jonathanmorgan/work/django/research/work/phd_work/data/article_loading/proquest_hnp/{}".format( paper_identifier )
current_datetime = datetime.datetime.now()
current_date_string = current_datetime.strftime( "%Y-%m-%d-%H-%M-%S" )
configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.
In [ ]:
logging_file_name = "{}/research-data_load-{}-{}.log.txt".format( current_working_folder, paper_identifier, current_date_string )
logging.basicConfig(
level = logging.DEBUG,
format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
filename = logging_file_name,
filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)
If you are using a virtualenv, make sure that you:
Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py
in this notebook, to configure the python environment manually as if you had activated the sourcenet
virtualenv. To do this, you'd make a code cell that contains:
%run ../dev/wsgi.py
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:
activate your virtualenv:
workon research
in your virtualenv, install the package ipykernel
.
pip install ipykernel
use the ipykernel python program to install the current environment as a kernel:
python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
sourcenet
example:
python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"
More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html
First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.
In [ ]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
# add folder to front of path.
django_init_path = "{}/{}".format( django_init_folder, django_init_path )
#-- END check to see if django_init folder. --#
In [ ]:
%run $django_init_path
In [ ]:
# context_text imports
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Subject
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
# context_text_proquest_hnp
from context_text_proquest_hnp.proquest_hnp_newspaper_helper import ProquestHNPNewspaperHelper
Create a LoggingHelper instance to use to log debug and also print at the same time.
Preconditions: Must be run after Django is initialized, since python_utilities
is in the django path.
In [ ]:
# python_utilities
from python_utilities.logging.logging_helper import LoggingHelper
# init
my_logging_helper = LoggingHelper()
my_logging_helper.set_logger_name( "proquest_hnp-article-loading-{}".format( paper_identifier ) )
log_message = None
Create an initialize an instance of ProquestHNPNewspaper for this paper.
In [ ]:
my_paper = ProquestHNPNewspaperHelper()
paper_instance = my_paper.initialize_from_database( paper_identifier )
my_paper.source_all_papers_folder = source_paper_folder
my_paper.destination_all_papers_folder = uncompressed_paper_folder
In [ ]:
print( my_paper )
print( paper_instance )
In [ ]:
my_paper = ProquestHNPNewspaperHelper()
my_paper.paper_identifier = paper_identifier
my_paper.source_all_papers_folder = source_paper_folder
my_paper.source_paper_path = source_paper_path
my_paper.destination_all_papers_folder = uncompressed_paper_folder
my_paper.destination_paper_path = uncompressed_paper_path
my_paper.paper_start_year = 1872
my_paper.paper_end_year = 1985
my_newspaper = Newspaper.objects.get( id = 6 )
my_paper.newspaper = my_newspaper
If desired, add to database.
In [ ]:
phnp_newspaper_instance = my_paper.create_PHNP_newspaper()
In [ ]:
print( phnp_newspaper_instance )
Specify which folder of XML files should be loaded into system, then process all files within the folder.
The compressed archives from proquest_hnp just contain publication XML files, no containing folder.
To process:
uncompresed paper folder ( <paper_folder>
) - make a folder in /mnt/hgfs/projects/phd/proquest_hnp/uncompressed
for the paper whose data you are working with, named the same as the paper's folder in /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data
.
BostonGlobe
".uncompressed archive folder ( <archive_folder>
) - inside a given paper's folder in uncompressed, for each archive file, create a folder named the same as the archive file, but with no ".zip" at the end.
BG_20171002210239_00001.zip
", make a folder named "BG_20171002210239_00001
".<paper_folder>/<archive_name_no_zip>
.unzip the archive into this folder:
unzip <path_to_zip> -d <archive_folder>
See if the uncompressed paper folder exists. If not, set flag and create it.
In [ ]:
# create folder to hold the results of decompressing paper's zip files.
did_uncomp_paper_folder_exist = my_paper.make_dest_paper_folder()
For each *.zip file in the paper's source folder:
check if folder named the same as the "archive identifier" is present.
If no:
If yes:
In [ ]:
# decompress the files
my_paper.uncompress_paper_zip_files()
Change working directories to the uncompressed paper path.
In [ ]:
%cd $uncompressed_paper_path
In [ ]:
%ls
Load one of the files into memory and see what we can do with it. Beautiful Soup?
Looks like the root element is "Record", then the high-level type of the article is "ObjectType".
ObjectType values:
Good options for XML parser:
lxml.etree
- https://stackoverflow.com/questions/12290091/reading-xml-file-and-fetching-its-attributes-value-in-pythonxmltodict
- https://docs.python-guide.org/scenarios/xml/beautifulsoup
using lxml
In [ ]:
# loop over files in the current archive folder path.
object_type_to_count_map = my_paper.process_archive_object_types( uncompressed_archive_path )
Processing 5752 files in /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002210239_00001
----> XML file count: 5752
Counters:
- Processed 5752 files
- No Record: 0
- No ObjectType: 0
- No ObjectType value: 0
ObjectType values and occurrence counts:
- A|d|v|e|r|t|i|s|e|m|e|n|t: 1902
- Article|Feature: 1792
- N|e|w|s: 53
- Commentary|Editorial: 36
- G|e|n|e|r|a|l| |I|n|f|o|r|m|a|t|i|o|n: 488
- S|t|o|c|k| |Q|u|o|t|e: 185
- Advertisement|Classified Advertisement: 413
- E|d|i|t|o|r|i|a|l| |C|a|r|t|o|o|n|/|C|o|m|i|c: 31
- Correspondence|Letter to the Editor: 119
- Front Matter|Table of Contents: 193
- O|b|i|t|u|a|r|y: 72
- F|r|o|n|t| |P|a|g|e|/|C|o|v|e|r| |S|t|o|r|y: 107
- I|m|a|g|e|/|P|h|o|t|o|g|r|a|p|h: 84
- Marriage Announcement|News: 6
- I|l|l|u|s|t|r|a|t|i|o|n: 91
- R|e|v|i|e|w: 133
- C|r|e|d|i|t|/|A|c|k|n|o|w|l|e|d|g|e|m|e|n|t: 30
- News|Legal Notice: 17
Loop over all folders in the paper path. For each folder, grab all files in the folder. For each file, parse XML, then get the ObjectType value and if it isn't already in map of obect types to counts, add it. Increment count.
From command line, in the uncompressed BostonGlobe folder:
find . -type f -iname "*.xml" | wc -l
resulted in 11,374,500 articles. That is quite a few.
In [ ]:
xml_folder_list = glob.glob( "{}/*".format( uncompressed_paper_path ) )
print( "folder_list: {}".format( xml_folder_list ) )
In [ ]:
# build map of all object types for a paper to the overall counts of each
paper_object_type_to_count_map = my_paper.process_paper_object_types()
Example output:
XML file count: 5752 Counters:
ObjectType values and occurrence counts:
Choose a directory, then loop over the files in the directory to build a map of types to lists of file names.
In [ ]:
# directory to work in.
uncompressed_archive_folder = "BG_20151211054235_00003"
uncompressed_archive_path = "{}/{}".format( uncompressed_paper_path, uncompressed_archive_folder )
In [ ]:
# build map of file types to lists of files of that type in specified folder.
object_type_to_file_path_map = my_paper.map_archive_folder_files_to_types( uncompressed_archive_path )
In [ ]:
# which types do we want to preview?
types_to_output = master_object_type_list
#types_to_output = [ 'Advertisement|Classified Advertisement' ]
# declare variables
xml_file_path_list = None
xml_file_path_example_list = None
xml_file_path = None
xml_file = None
xml_dict = None
xml_string = None
# loop over types
for object_type in types_to_output:
# print type and count
xml_file_path_list = object_type_to_file_path_map.get( object_type, [] )
xml_file_path_example_list = xml_file_path_list[ : 10 ]
print( "\n- {}:".format( object_type ) )
for xml_file_path in xml_file_path_example_list:
print( "----> {}".format( xml_file_path ) )
# try to parse the file
with open( xml_file_path ) as xml_file:
# parse XML
xml_dict = xmltodict.parse( xml_file.read() )
#-- END with open( xml_file_path ) as xml_file: --#
# pretty-print
xml_string = xmltodict.unparse( xml_dict, pretty = True )
# output
print( xml_string )
#-- END loop over example file paths. --#
#-- END loop over object types. --#
IDs:
<RecordID>1821311973</RecordID>
<URLDocView>http://search.proquest.com/docview/1821311973/</URLDocView>
Object Types:
<ObjectType>Feature</ObjectType>
<ObjectType>Article</ObjectType>
Action code:
<ActionCode>change</ActionCode>
Publication Date:
<AlphaPubDate>Nov 20, 1985</AlphaPubDate>
<NumericPubDate>19851120</NumericPubDate>
Headline:
<RecordTitle>Ulster pact rapped in Irish Parliament</RecordTitle>
Author:
<Contributor>
<ContribRole>Author</ContribRole>
<OriginalForm>Bob O'Connor Special to the Globe</OriginalForm>
</Contributor>
From /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210230044_00004/367105818.xml:
<Contributor>
<ContribRole>Author</ContribRole>
<LastName>McCain</LastName>
<FirstName>Nina</FirstName>
<PersonName>Nina McCain</PersonName>
<OriginalForm>Nina McCain</OriginalForm>
</Contributor>
("Globe Staff" is still in the body text).
Looks like you can count on the person's name being in the "Contributor" element, sometimes parsed into name parts, sometimes not. Looks like it will not parse if the author string includes a suffix. Example:
from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/ChristianScienceMonitor/CSM_20170929191926_00001/513134635.xml:
<Contributor>
<ContribRole>Author</ContribRole>
<OriginalForm>John Dillin Staff writer of The Christian Science Monitor</OriginalForm>
</Contributor>
If no "Contributor" element, then they are asserting that there is no byline.
Shared bylines = Multiple Contributor elements:
from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006231925_00050/1000174750.xml
<Contributor>
<ContribRole>Author</ContribRole>
<LastName>Nash</LastName>
<MiddleName>M</MiddleName>
<FirstName>Bruce</FirstName>
<PersonName>Bruce M Nash</PersonName>
<OriginalForm>Bruce M Nash</OriginalForm>
</Contributor>
<Contributor>
<ContribRole>Author</ContribRole>
<LastName>Monchick</LastName>
<MiddleName>B</MiddleName>
<FirstName>Randolph</FirstName>
<PersonName>Randolph B Monchick</PersonName>
<OriginalForm>Randolph B Monchick</OriginalForm>
</Contributor>
Boston Globe
byline suffixes (loop and compile all from the "OriginalForm" strings):
if no identifier?
/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210230044_00004/367091933.xml
- in this case, person is not a staff journalist.Newsday
might be some suffixes (but not many):
/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006231925_00050/1000247974.xml
/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006231925_00050/1002490977.xml
Christian Science Monitor
looks like news includes a suffix. Example:
from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/ChristianScienceMonitor/CSM_20170929191926_00001/513134635.xml:
<Contributor>
<ContribRole>Author</ContribRole>
<OriginalForm>John Dillin Staff writer of The Christian Science Monitor</OriginalForm>
</Contributor>
Body text:
<Abstract>
is first sentence/lead, and <FullText>
is the full text (so could look for contents of abstract in full text to see where article itself begins? In first example I looked at, the abstract had a period at the end and the sentence in the full text did not. Hmmm.). Headline and byline are in the full text. For Globe, looks like the good way to split is on the "Original Form" of the "Contributor". That is what I'd try first.TODO: