Introduction

Back to Table of Contents

This is a notebook that expands on the OpenCalais code in the file article_coding.py, also in this folder. It includes more sections on selecting publications you want to submit to OpenCalais as an example. It is intended to be copied and re-used.

Setup

Back to Table of Contents

Setup - Debug

Back to Table of Contents



In [ ]:

    
debug_flag = False

Setup - Imports

Back to Table of Contents



In [ ]:

    
import datetime
import glob
import logging
import lxml
import os
import six
import xml
import xmltodict
import zipfile

Setup - working folder paths

Back to Table of Contents

What data are we looking at?



In [ ]:

    
# paper identifier
paper_identifier = "BostonGlobe"
archive_identifier = "BG_20171002210239_00001"

# source
source_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data"
source_paper_path = "{}/{}".format( source_paper_folder, paper_identifier )

# uncompressed
uncompressed_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/uncompressed"
uncompressed_paper_path = "{}/{}".format( uncompressed_paper_folder, paper_identifier )

# make sure an identifier is set before you make a path here.
if ( ( archive_identifier is not None ) and ( archive_identifier != "" ) ):
    
    # identifier is set.
    source_archive_file = "{}.zip".format( archive_identifier )
    source_archive_path = "{}/{}".format( source_paper_path, source_archive_file )
    uncompressed_archive_path = "{}/{}".format( uncompressed_paper_path, archive_identifier )

#-- END check to see if archive_identifier present. --#



In [ ]:

    
%pwd



In [ ]:

    
# current working folder
current_working_folder = "/home/jonathanmorgan/work/django/research/work/phd_work/data/article_loading/proquest_hnp/{}".format( paper_identifier )
current_datetime = datetime.datetime.now()
current_date_string = current_datetime.strftime( "%Y-%m-%d-%H-%M-%S" )

Setup - logging

Back to Table of Contents

configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.



In [ ]:

    
logging_file_name = "{}/research-data_load-{}-{}.log.txt".format( current_working_folder, paper_identifier, current_date_string )
logging.basicConfig(
    level = logging.DEBUG,
    format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    filename = logging_file_name,
    filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)

Setup - virtualenv jupyter kernel

Back to Table of Contents

If you are using a virtualenv, make sure that you:

have installed your virtualenv as a kernel.
choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py in this notebook, to configure the python environment manually as if you had activated the sourcenet virtualenv. To do this, you'd make a code cell that contains:

%run ../dev/wsgi.py

This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:

activate your virtualenv:
```
  workon research
```
in your virtualenv, install the package ipykernel.
```
  pip install ipykernel
```

use the ipykernel python program to install the current environment as a kernel:

  python -m ipykernel install --user --name <env_name> --display-name "<display_name>"

sourcenet example:

  python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"

More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html

Setup - Initialize Django

Back to Table of Contents

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.



In [ ]:

    
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#



In [ ]:

    
%run $django_init_path



In [ ]:

    
# context_text imports
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Subject
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase

# context_text_proquest_hnp
from context_text_proquest_hnp.proquest_hnp_newspaper_helper import ProquestHNPNewspaperHelper

Setup - Initialize LoggingHelper

Back to Table of Contents

Create a LoggingHelper instance to use to log debug and also print at the same time.

Preconditions: Must be run after Django is initialized, since python_utilities is in the django path.



In [ ]:

    
# python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

# init
my_logging_helper = LoggingHelper()
my_logging_helper.set_logger_name( "proquest_hnp-article-loading-{}".format( paper_identifier ) )
log_message = None

Setup - initialize ProquestHNPNewspaper

Back to Table of Contents

Create an initialize an instance of ProquestHNPNewspaper for this paper.

load from database

Back to Table of Contents



In [ ]:

    
my_paper = ProquestHNPNewspaperHelper()
paper_instance = my_paper.initialize_from_database( paper_identifier )
my_paper.source_all_papers_folder = source_paper_folder
my_paper.destination_all_papers_folder = uncompressed_paper_folder



In [ ]:

    
print( my_paper )
print( paper_instance )

set up manually

Back to Table of Contents



In [ ]:

    
my_paper = ProquestHNPNewspaperHelper()
my_paper.paper_identifier = paper_identifier
my_paper.source_all_papers_folder = source_paper_folder
my_paper.source_paper_path = source_paper_path
my_paper.destination_all_papers_folder = uncompressed_paper_folder
my_paper.destination_paper_path = uncompressed_paper_path
my_paper.paper_start_year = 1872
my_paper.paper_end_year = 1985

my_newspaper = Newspaper.objects.get( id = 6 )
my_paper.newspaper = my_newspaper

If desired, add to database.



In [ ]:

    
phnp_newspaper_instance = my_paper.create_PHNP_newspaper()



In [ ]:

    
print( phnp_newspaper_instance )

Find articles to be loaded

Back to Table of Contents

Specify which folder of XML files should be loaded into system, then process all files within the folder.

The compressed archives from proquest_hnp just contain publication XML files, no containing folder.

To process:

uncompresed paper folder ( <paper_folder> ) - make a folder in /mnt/hgfs/projects/phd/proquest_hnp/uncompressed for the paper whose data you are working with, named the same as the paper's folder in /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data.
- for example, for the Boston Globe, name it "BostonGlobe".
uncompressed archive folder ( <archive_folder> ) - inside a given paper's folder in uncompressed, for each archive file, create a folder named the same as the archive file, but with no ".zip" at the end.
- For example, for the file "BG_20171002210239_00001.zip", make a folder named "BG_20171002210239_00001".
- path should be "<paper_folder>/<archive_name_no_zip>.

unzip the archive into this folder:

  unzip <path_to_zip> -d <archive_folder>

Uncompress files

Back to Table of Contents

See if the uncompressed paper folder exists. If not, set flag and create it.



In [ ]:

    
# create folder to hold the results of decompressing paper's zip files.
did_uncomp_paper_folder_exist = my_paper.make_dest_paper_folder()

For each *.zip file in the paper's source folder:

parse file name from path returned by glob.
parse the part before ".zip" from the file name. This is referred to subsequently as the "archive identifier".
check if folder named the same as the "archive identifier" is present.
- If no:
  - create it.
  - then, uncompress the archive into it.
- If yes:
  - output a message. Don't want to uncompress if it was already uncompressed once.



In [ ]:

    
# decompress the files
my_paper.uncompress_paper_zip_files()

Work with uncompressed files

Back to Table of Contents

Change working directories to the uncompressed paper path.



In [ ]:

    
%cd $uncompressed_paper_path



In [ ]:

    
%ls

parse and load XML files

Back to Table of Contents

Load one of the files into memory and see what we can do with it. Beautiful Soup?

Looks like the root element is "Record", then the high-level type of the article is "ObjectType".

ObjectType values:

Good options for XML parser:

lxml.etree - https://stackoverflow.com/questions/12290091/reading-xml-file-and-fetching-its-attributes-value-in-python
xmltodict - https://docs.python-guide.org/scenarios/xml/
beautifulsoup using lxml



In [ ]:

    
# loop over files in the current archive folder path.
object_type_to_count_map = my_paper.process_archive_object_types( uncompressed_archive_path )

Processing 5752 files in /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002210239_00001
----> XML file count: 5752

Counters:
- Processed 5752 files
- No Record: 0
- No ObjectType: 0
- No ObjectType value: 0

ObjectType values and occurrence counts:
- A|d|v|e|r|t|i|s|e|m|e|n|t: 1902
- Article|Feature: 1792
- N|e|w|s: 53
- Commentary|Editorial: 36
- G|e|n|e|r|a|l| |I|n|f|o|r|m|a|t|i|o|n: 488
- S|t|o|c|k| |Q|u|o|t|e: 185
- Advertisement|Classified Advertisement: 413
- E|d|i|t|o|r|i|a|l| |C|a|r|t|o|o|n|/|C|o|m|i|c: 31
- Correspondence|Letter to the Editor: 119
- Front Matter|Table of Contents: 193
- O|b|i|t|u|a|r|y: 72
- F|r|o|n|t| |P|a|g|e|/|C|o|v|e|r| |S|t|o|r|y: 107
- I|m|a|g|e|/|P|h|o|t|o|g|r|a|p|h: 84
- Marriage Announcement|News: 6
- I|l|l|u|s|t|r|a|t|i|o|n: 91
- R|e|v|i|e|w: 133
- C|r|e|d|i|t|/|A|c|k|n|o|w|l|e|d|g|e|m|e|n|t: 30
- News|Legal Notice: 17

build list of all ObjectTypes

Back to Table of Contents

Loop over all folders in the paper path. For each folder, grab all files in the folder. For each file, parse XML, then get the ObjectType value and if it isn't already in map of obect types to counts, add it. Increment count.

From command line, in the uncompressed BostonGlobe folder:

find . -type f -iname "*.xml" | wc -l

resulted in 11,374,500 articles. That is quite a few.



In [ ]:

    
xml_folder_list = glob.glob( "{}/*".format( uncompressed_paper_path ) )
print( "folder_list: {}".format( xml_folder_list ) )



In [ ]:

    
# build map of all object types for a paper to the overall counts of each
paper_object_type_to_count_map = my_paper.process_paper_object_types()

Example output:

XML file count: 5752 Counters:

Processed 5752 files
No Record: 0
No ObjectType: 0
No ObjectType value: 0

ObjectType values and occurrence counts:

A|d|v|e|r|t|i|s|e|m|e|n|t: 2114224
Feature|Article: 5271887
I|m|a|g|e|/|P|h|o|t|o|g|r|a|p|h: 249942
O|b|i|t|u|a|r|y: 625143
G|e|n|e|r|a|l| |I|n|f|o|r|m|a|t|i|o|n: 1083164
S|t|o|c|k| |Q|u|o|t|e: 202776
N|e|w|s: 140274
I|l|l|u|s|t|r|a|t|i|o|n: 106925
F|r|o|n|t| |P|a|g|e|/|C|o|v|e|r| |S|t|o|r|y: 386421
E|d|i|t|o|r|i|a|l| |C|a|r|t|o|o|n|/|C|o|m|i|c: 78993
Editorial|Commentary: 156342
C|r|e|d|i|t|/|A|c|k|n|o|w|l|e|d|g|e|m|e|n|t: 68356
Classified Advertisement|Advertisement: 291533
R|e|v|i|e|w: 86889
Table of Contents|Front Matter: 69798
Letter to the Editor|Correspondence: 202071
News|Legal Notice: 24053
News|Marriage Announcement: 41314
B|i|r|t|h| |N|o|t|i|c|e: 926
News|Military/War News: 3
U|n|d|e|f|i|n|e|d: 5
Article|Feature: 137526
Front Matter|Table of Contents: 11195
Commentary|Editorial: 3386
Marriage Announcement|News: 683
Correspondence|Letter to the Editor: 7479
Legal Notice|News: 1029
Advertisement|Classified Advertisement: 12163

map files to types

Back to Table of Contents

Choose a directory, then loop over the files in the directory to build a map of types to lists of file names.



In [ ]:

    
# directory to work in.
uncompressed_archive_folder = "BG_20151211054235_00003"
uncompressed_archive_path = "{}/{}".format( uncompressed_paper_path, uncompressed_archive_folder )



In [ ]:

    
# build map of file types to lists of files of that type in specified folder.
object_type_to_file_path_map = my_paper.map_archive_folder_files_to_types( uncompressed_archive_path )



In [ ]:

    
# which types do we want to preview?
types_to_output = master_object_type_list
#types_to_output = [ 'Advertisement|Classified Advertisement' ]

# declare variables
xml_file_path_list = None
xml_file_path_example_list = None
xml_file_path = None
xml_file = None
xml_dict = None
xml_string = None

# loop over types
for object_type in types_to_output:
    
    # print type and count
    xml_file_path_list = object_type_to_file_path_map.get( object_type, [] )
    xml_file_path_example_list = xml_file_path_list[ : 10 ]
    print( "\n- {}:".format( object_type ) )
    for xml_file_path in xml_file_path_example_list:
        
        print( "----> {}".format( xml_file_path ) )

        # try to parse the file
        with open( xml_file_path ) as xml_file:

            # parse XML
            xml_dict = xmltodict.parse( xml_file.read() )
            
        #-- END with open( xml_file_path ) as xml_file: --#
            
        # pretty-print
        xml_string = xmltodict.unparse( xml_dict, pretty = True )

        # output
        print( xml_string )
        
    #-- END loop over example file paths. --#
    
#-- END loop over object types. --#

XML analysis

Back to Table of Contents

IDs:

<RecordID>1821311973</RecordID>
<URLDocView>http://search.proquest.com/docview/1821311973/</URLDocView>

and, the file name is the ID, also.

Object Types:

<ObjectType>Feature</ObjectType>
<ObjectType>Article</ObjectType>

So, order is probably arbitrary, that is why they can be in either order.

Action code:

<ActionCode>change</ActionCode>

will you ever get the same aritcle with different action codes? Hopefully just get last time it was sent.

Publication Date:

<AlphaPubDate>Nov 20, 1985</AlphaPubDate>
<NumericPubDate>19851120</NumericPubDate>

Headline:

<RecordTitle>Ulster pact rapped in Irish Parliament</RecordTitle>

Author:

<Contributor>
    <ContribRole>Author</ContribRole>
    <OriginalForm>Bob O'Connor Special to the Globe</OriginalForm>
</Contributor>

From /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210230044_00004/367105818.xml:
<Contributor>
    <ContribRole>Author</ContribRole>
    <LastName>McCain</LastName>
    <FirstName>Nina</FirstName>
    <PersonName>Nina McCain</PersonName>
    <OriginalForm>Nina McCain</OriginalForm>
</Contributor>
("Globe Staff" is still in the body text).

Looks like you can count on the person's name being in the "Contributor" element, sometimes parsed into name parts, sometimes not. Looks like it will not parse if the author string includes a suffix. Example:

  from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/ChristianScienceMonitor/CSM_20170929191926_00001/513134635.xml:
  <Contributor>
      <ContribRole>Author</ContribRole>
      <OriginalForm>John Dillin Staff writer of The Christian Science Monitor</OriginalForm>
  </Contributor>

If no "Contributor" element, then they are asserting that there is no byline.

Shared bylines = Multiple Contributor elements:

  from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006231925_00050/1000174750.xml
  <Contributor>
      <ContribRole>Author</ContribRole>
      <LastName>Nash</LastName>
      <MiddleName>M</MiddleName>
      <FirstName>Bruce</FirstName>
      <PersonName>Bruce M Nash</PersonName>
      <OriginalForm>Bruce M Nash</OriginalForm>
  </Contributor>
  <Contributor>
      <ContribRole>Author</ContribRole>
      <LastName>Monchick</LastName>
      <MiddleName>B</MiddleName>
      <FirstName>Randolph</FirstName>
      <PersonName>Randolph B Monchick</PersonName>
      <OriginalForm>Randolph B Monchick</OriginalForm>
  </Contributor>

Boston Globe
- byline suffixes (loop and compile all from the "OriginalForm" strings):
  - Contributing Reporter
  - Globe Staff
  - Special to the Globe
  - ...
  - if no identifier?
    - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210230044_00004/367091933.xml - in this case, person is not a staff journalist.
Newsday
- looks like it doesn't have any suffixes, and it has opinion mixed in with hard news (Globe probably does, also). Could start a list of columnists... They'd likely not have many sources, also.
- might be some suffixes (but not many):
  - "Miriam Pawel Newsday Staff Correspondent" from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006231925_00050/1000247974.xml
  - "Susan Page Newsday Washington Bureau" from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006231925_00050/1002490977.xml

Christian Science Monitor

looks like news includes a suffix. Example:

  from /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/ChristianScienceMonitor/CSM_20170929191926_00001/513134635.xml:
  <Contributor>
      <ContribRole>Author</ContribRole>
      <OriginalForm>John Dillin Staff writer of The Christian Science Monitor</OriginalForm>
  </Contributor>

Body text:

<Abstract> is first sentence/lead, and <FullText> is the full text (so could look for contents of abstract in full text to see where article itself begins? In first example I looked at, the abstract had a period at the end and the sentence in the full text did not. Hmmm.). Headline and byline are in the full text. For Globe, looks like the good way to split is on the "Original Form" of the "Contributor". That is what I'd try first.
Punctuation is often missing, might really screw with parser. Would have to test this out.
all papers, they also include page furniture in the text, including page numbers, text that explains that the story jumped from a previous page, etc.
and, scan quality generally looks about as good as you'd expect (not good).
Newsday has lots of different types of content mixed into Article|Feature and front page. Globe looked like it was more hard news, but also might just have not been a representative sample.
Christian Science Monitor is a national news paper, shouldn't include it - no "local" coverage. Also, they have a lot of garbage mixed in with their news, might be hard to confirm that you got all the news, none of the opinion pieces.

TODO

Back to Table of Contents

TODO:

figure out which ObjectTypes to explore, pick a folder and just eyeball a few, to see what they look like.
run summarize on all articles, store the results, so we can start to look at article quality in different eras.