Introduction

This is a notebook that expands on the OpenCalais code in the file article_coding.py, also in this folder. It includes more sections on selecting publications you want to submit to OpenCalais as an example. It is intended to be copied and re-used.

Setup

Setup - Debug


In [1]:
debug_flag = False

Setup - Imports


In [2]:
import datetime
from django.db.models import Avg, Max, Min
import logging
import six

Setup - working folder paths


In [3]:
%pwd


Out[3]:
'/home/jonathanmorgan/work/django/research/work/phd_work/data/article_coding'

In [ ]:
# current working folder
current_working_folder = "/home/jonathanmorgan/work/django/research/work/phd_work/analysis"
current_datetime = datetime.datetime.now()
current_date_string = current_datetime.strftime( "%Y-%m-%d-%H-%M-%S" )

Setup - logging

configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.


In [4]:
# build file name
logging_file_name = "{}/article_coding-{}.log.txt".format( current_working_folder, current_date_string )

# set up logging.
logging.basicConfig(
    level = logging.DEBUG,
    format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    filename = logging_file_name,
    filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)

Setup - virtualenv jupyter kernel

If you are using a virtualenv, make sure that you:

  • have installed your virtualenv as a kernel.
  • choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py in this notebook, to configure the python environment manually as if you had activated the sourcenet virtualenv. To do this, you'd make a code cell that contains:

%run ../dev/wsgi.py

This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:

  • activate your virtualenv:

      workon research
  • in your virtualenv, install the package ipykernel.

      pip install ipykernel
  • use the ipykernel python program to install the current environment as a kernel:

      python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
    
    

    sourcenet example:

      python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"

More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html

Setup - Initialize Django

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.


In [5]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#

In [6]:
%run $django_init_path


django initialized at 2019-08-11 14:21:24.190799

In [7]:
# context_text imports
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Subject
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase

Setup - Initialize LoggingHelper

Create a LoggingHelper instance to use to log debug and also print at the same time.

Preconditions: Must be run after Django is initialized, since python_utilities is in the django path.


In [8]:
# python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

# init
my_logging_helper = LoggingHelper()
my_logging_helper.set_logger_name( "newsbank-article_coding" )
log_message = None

Find articles to be coded

Tag all locally implemented hard news articles in database and all that have already been coded using Open Calais V2, then work through using OpenCalais to code all local hard news that hasn't alredy been coded, starting with those proximal to the coding sample for methods paper.

which articles have already been coded?

More precisely, find all articles that have Article_Data coded by the automated coder with type "OpenCalais_REST_API_v2" and tag the articles as "coded-open_calais_v2" or something like that.

Then, for articles without that tag, use our criteria for local hard news to filter out and tag publications in the year before and after the month used to evaluate the automated coder, in both the Grand Rapids Press and the Detroit News, so I can look at longer time frames, then code all articles currently in database.

Eventually, then, we'll code and examine before and after layoffs.


In [9]:
# look for publications that have article data:
# - coded by automated coder
# - with coder type of "OpenCalais_REST_API_v2"

# get automated coder
automated_coder_user = ArticleCoder.get_automated_coding_user()

print( "{} - Loaded automated user: {}, id = {}".format( datetime.datetime.now(), automated_coder_user, automated_coder_user.id ) )


2019-08-11 14:21:26.963496 - Loaded automated user: automated, id = 2

In [ ]:
# try aggregates
article_qs = Article.objects.all()
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )

In [ ]:
# find articles with Article_Data created by the automated user...
article_qs = Article.objects.filter( article_data__coder = automated_coder_user )

# ...and specifically coded using OpenCalais V2...
article_qs = article_qs.filter( article_data__coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION )

# ...and finally, we just want the distinct articles by ID.
article_qs = article_qs.order_by( "id" ).distinct( "id" )

# count?
article_count = article_qs.count()
print( "Found {} articles".format( article_count ) )

Tag the coded articles

Removing duplicates present from joining with Article_Data yields 579 articles that were coded by the automated coder.

Tag all the coded articles with OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME.


In [ ]:
# declare variables
current_article = None
tag_name_list = None
article_count = None
untagged_count = None
already_tagged_count = None
newly_tagged_count = None
count_sum = None
do_add_tag = False

# init
do_add_tag = True

# get article_count
article_count = article_qs.count()

# loop over articles.
untagged_count = 0
already_tagged_count = 0
newly_tagged_count = 0
for current_article in article_qs:
    
    # get list of tags for this publication
    tag_name_list = current_article.tags.names()
    
    # is the coded tag in the list?
    if ( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME not in tag_name_list ):
        
        # are we adding tag?
        if ( do_add_tag == True ):

            # add tag.
            current_article.tags.add( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
            newly_tagged_count += 1
            
        else:

            # for now, increment untagged count
            untagged_count += 1
            
        #-- END check to see if we are adding tag. --#
        
    else:
        
        # already tagged
        already_tagged_count += 1
        
    #-- END check to see if coded tag is set --#
    
#-- END loop over articles. --#

print( "Article counts:" )
print( "- total articles: {}".format( article_count ) )
print( "- untagged articles: {}".format( untagged_count ) )
print( "- already tagged: {}".format( already_tagged_count ) )
print( "- newly tagged: {}".format( newly_tagged_count ) )
count_sum = untagged_count + already_tagged_count + newly_tagged_count
print( "- count sum: {}".format( count_sum ) )

Profile the coded articles

Look at range of pub dates.


In [ ]:
tags_in_list = []
tags_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
article_qs = Article.objects.filter( tags__name__in = tags_in_list )
print( "Matching article count: {}".format( article_qs.count() ) )
  • Original: 579
  • after coding 10: 589 (tag is being set correctly by Open Calais V2 coder)
  • 2019.08.02 - after 5000 (minus a few errors because 2 seconds isn't quite enough for rate limit): 5518

In [ ]:
# profile these publications
min_pub_date = None
max_pub_date = None
current_pub_date = None
pub_date_count = None
date_to_count_map = {}
date_to_articles_map = {}
pub_date_article_dict = None

# try aggregates
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )

# counts of pubs by date
for current_article in article_qs:
    
    # get pub_date
    current_pub_date = current_article.pub_date
    current_article_id = current_article.id
    
    # get count, increment, and store.
    pub_date_count = date_to_count_map.get( current_pub_date, 0 )
    pub_date_count += 1
    date_to_count_map[ current_pub_date ] = pub_date_count
    
    # also, store up ids and instances
    
    # get dict of article ids to article instances for date
    pub_date_article_dict = date_to_articles_map.get( current_pub_date, {} )
    
    # article already there?
    if ( current_article_id not in pub_date_article_dict ):
        
        # no - add it.
        pub_date_article_dict[ current_article_id ] = current_article
        
    #-- END check to see if article already there.
    
    # put dict back.
    date_to_articles_map[ current_pub_date ] = pub_date_article_dict
    
#-- END loop over articles. --#

# output dates and counts.

# get list of keys from map
keys_list = list( six.viewkeys( date_to_count_map ) )
keys_list.sort()
for current_pub_date in keys_list:
    
    # get count
    pub_date_count = date_to_count_map.get( current_pub_date, 0 )
    print( "- {} ( {} ) count: {}".format( current_pub_date, type( current_pub_date ), pub_date_count ) )
    
#-- END loop over dates --#

In [ ]:
# look at the 2010-07-31 date
pub_date = datetime.datetime.strptime( "2010-07-31", "%Y-%m-%d" ).date()
articles_for_date = date_to_articles_map.get( pub_date, {} )
print( articles_for_date )

# get the article and look at its tags.
article_instance = articles_for_date.get( 6065 )
print( article_instance.tags.all() )

# loop over associated Article_Data instances.
for article_data in article_instance.article_data_set.all():
    
    print( article_data )
    
#-- END loop over associated Article_Data instances --#

tag all local news

Definition of local hard news by in-house implementor for Grand Rapids Press and Detroit News follow. For each, tag all articles in database that match as "local_hard_news".

TODO

TODO:

  • make class for GRPB at NewsBank.

    • also, pull the things that are newspaper specific out of ArticleCoder.py and into the GRPB.py class.
  • refine "local news" and "locally created" regular expressions for Grand Rapids Press based on contents of author_string and author_affiliation.

  • do the same for TDN.
  • then, use the updated classes and definitions below to flag all local hard news in database for each publication.

DONE

DONE:

  • abstract out shared stuff from GRPB.py and DTNB.py into abstract parent class context_text/collectors/newsbank/newspapers/newsbank_newspaper.py

    • update DTNB.py to use the parent class.
  • make class for GRPB at NewsBank.

    • context_text/collectors/newsbank/newspapers/GRPB.py

Grand Rapids Press local news

Grand Rapids Press local hard news:

  • context_text/examples/articles/articles-GRP-local_news.py
  • local hard news sections (stored in Article.GRP_NEWS_SECTION_NAME_LIST):

    • "Business"
    • "City and Region"
    • "Front Page"
    • "Lakeshore"
    • "Religion"
    • "Special"
    • "State"
  • in-house implementor (based on byline patterns, stored in sourcenet.models.Article.Q_GRP_IN_HOUSE_AUTHOR):

    • Byline ends in "/ THE GRAND RAPIDS PRESS", ignore case.

      • Q( author_varchar__iregex = r'.* */ *THE GRAND RAPIDS PRESS$'
    • Byline ends in "/ PRESS * EDITOR", ignore case.

      • Q( author_varchar__iregex = r'.* */ *PRESS .* EDITOR$' )
    • Byline ends in "/ GRAND RAPIDS PRESS * BUREAU", ignore case.

      • Q( author_varchar__iregex = r'.* */ *GRAND RAPIDS PRESS .* BUREAU$' )
    • Byline ends in "/ SPECIAL TO THE PRESS", ignore case.

      • Q( author_varchar__iregex = r'.* */ *SPECIAL TO THE PRESS$' )
  • can also exclude columns (I will not):

      grp_article_qs = grp_article_qs.exclude( index_terms__icontains = "Column" )

Need to work to further refine this.

Looking at affiliation strings:

SELECT author_affiliation, COUNT( author_affiliation ) as affiliation_count
FROM context_text_article
WHERE newspaper_id = 1
GROUP BY author_affiliation
ORDER BY COUNT( author_affiliation ) DESC;

And at author strings for collective bylines:

SELECT author_string, COUNT( author_string ) as author_count
FROM context_text_article
WHERE newspaper_id = 1
GROUP BY author_string
ORDER BY COUNT( author_string ) DESC
LIMIT 10;

In [ ]:
# filter queryset to just locally created Grand Rapids Press (GRP) articles.
# imports
from context_text.models import Article
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
from context_text.collectors.newsbank.newspapers.GRPB import GRPB

# declare variables - Grand Rapids Press
do_apply_tag = False
tag_to_apply = None
grp_local_news_sections = []
grp_newspaper = None
grp_article_qs = None
article_count = -1

# declare variables - filtering
include_opinion_columns = True
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1

# declare variables - make list of article IDs from QS.
article_id_list = []
article_counter = -1
current_article = None
article_tag_name_list = None
article_update_counter = -1

# ==> configure

# configure - size of random sample we want
#random_count = 60

# configure - also, apply tag?
do_apply_tag = True
tag_to_apply = ContextTextBase.TAG_LOCAL_HARD_NEWS

# set up "local, regional and state news" sections
grp_local_news_sections = GRPB.LOCAL_NEWS_SECTION_NAME_LIST

# Grand Rapids Press
# get newspaper instance for GRP.
grp_newspaper = Newspaper.objects.get( id = GRPB.NEWSPAPER_ID )

# start with all articles
#grp_article_qs = Article.objects.all()

# ==> filter to newspaper, local news section list, and in-house reporters.

# ----> manually

# now, need to find local news articles to test on.
#grp_article_qs = grp_article_qs.filter( newspaper = grp_newspaper )

# only the locally implemented sections
#grp_article_qs = grp_article_qs.filter( section__in = grp_local_news_sections )

# and, with an in-house author
#grp_article_qs = grp_article_qs.filter( Article.Q_GRP_IN_HOUSE_AUTHOR )

#print( "manual filter count: {}".format( grp_article_qs.count() ) )

# ----> using Article.filter_articles()
grp_article_qs = Article.filter_articles( qs_IN = grp_article_qs,
                                          newspaper = grp_newspaper,
                                          section_name_list = grp_local_news_sections,
                                          custom_article_q = GRPB.Q_IN_HOUSE_AUTHOR )

print( "Article.filter_articles count: {}".format( grp_article_qs.count() ) )

# and include opinion columns?
if ( include_opinion_columns == False ):
    
    # do not include columns
    grp_article_qs = grp_article_qs.exclude( index_terms__icontains = "Column" )
    
#-- END check to see if we include columns. --#

'''
# filter to newspaper, section list, and in-house reporters.
grp_article_qs = Article.filter_articles( qs_IN = grp_article_qs,
                                          start_date = "2009-12-01",
                                          end_date = "2009-12-31",
                                          newspaper = grp_newspaper,
                                          section_name_list = grp_local_news_sections,
                                          custom_article_q = Article.Q_GRP_IN_HOUSE_AUTHOR )
'''

# how many is that?
article_count = grp_article_qs.count()

print( "Article count before filtering on tags: " + str( article_count ) )

# ==> tags

# tags to exclude
tags_not_in_list = []

# Example: prelim-related tags
#tags_not_in_list.append( "prelim_reliability" )
#tags_not_in_list.append( "prelim_network" ]
#tags_not_in_list.append( "minnesota1-20160328" )
#tags_not_in_list.append( "minnesota2-20160328" )

# for later - exclude articles already coded.
#tags_not_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )

# exclude any already tagged with tag_to_apply
tags_not_in_list.append( tag_to_apply )

if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):

    # exclude those in a list
    print( "filtering out articles with tags: " + str( tags_not_in_list ) )
    grp_article_qs = grp_article_qs.exclude( tags__name__in = tags_not_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# include only those with certain tags.
tags_in_list = []

# Examples

# Examples: prelim-related tags
#tags_in_list.append( "prelim_unit_test_001" )
#tags_in_list.append( "prelim_unit_test_002" )
#tags_in_list.append( "prelim_unit_test_003" )
#tags_in_list.append( "prelim_unit_test_004" )
#tags_in_list.append( "prelim_unit_test_005" )
#tags_in_list.append( "prelim_unit_test_006" )
#tags_in_list.append( "prelim_unit_test_007" )

# Example: grp_month
#tags_in_list.append( "grp_month" )

if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):

    # filter
    print( "filtering to just articles with tags: " + str( tags_in_list ) )
    grp_article_qs = grp_article_qs.filter( tags__name__in = tags_in_list )
    
#-- END check to see if we have a specific list of tags we want to include --#

# filter out "*prelim*" tags?
#filter_out_prelim_tags = True
if ( filter_out_prelim_tags == True ):

    # ifilter out all articles with any tag whose name contains "prelim".
    print( "filtering out articles with tags that contain \"prelim\"" )
    grp_article_qs = grp_article_qs.exclude( tags__name__icontains = "prelim" )
    
#-- END check to see if we filter out "prelim_*" tags --#

# how many is that?
article_count = grp_article_qs.count()

print( "Article count after tag filtering: " + str( article_count ) )

# do we want a random sample?
if ( random_count > 0 ):

    # to get random, order them by "?", then use slicing to retrieve requested
    #     number.
    grp_article_qs = grp_article_qs.order_by( "?" )[ : random_count ]
    
#-- END check to see if we want random sample --#

# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

# make ID list, tag articles if configured to.
article_id_list = []
article_counter = 0
article_update_counter = 0
for current_article in grp_article_qs:

    # increment article_counter
    article_counter += 1

    # add IDs to article_id_list
    article_id_list.append( str( current_article.id ) )
    
    # apply a tag while we are at it?
    if ( ( do_apply_tag == True ) and ( tag_to_apply is not None ) and ( tag_to_apply != "" ) ):
    
        # yes, please.  Tag already present?
        article_tag_name_list = current_article.tags.names()
        if ( tag_to_apply not in article_tag_name_list ):

            # Add tag.
            current_article.tags.add( tag_to_apply )
            
            # increment counter
            article_update_counter += 1
            
        #-- END check to see if tag already present. --#
        
    #-- END check to see if we apply tag. --#

    # output the tags.
    if ( debug_flag == True ):
        print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
    #-- END DEBUG --#

#-- END loop over articles --#

# output the list.
print( "grp_article_qs count: {}".format( grp_article_qs.count() ) )
print( "Found " + str( article_counter ) + " articles ( " + str( article_count ) + " )." )
print( "- Updated {} articles to add tag {}.".format( article_update_counter, tag_to_apply ) )
if ( debug_flag == True ):
    print( "List of " + str( len( article_id_list ) ) + " local GRP staff article IDs: " + ", ".join( article_id_list ) )
#-- END DEBUG --#

Detroit News local news

Detroit News local news:

  • context_text/examples/articles/articles-TDN-local_news.py
  • local hard news sections (stored in from context_text.collectors.newsbank.newspapers.DTNB import DTNB - DTNB.NEWS_SECTION_NAME_LIST):

    • "Business"
    • "Metro"
    • "Nation" - because of auto industry stories
  • in-house implementor (based on byline patterns, stored in DTNB.Q_IN_HOUSE_AUTHOR):

    • Byline ends in "/ The Detroit News", ignore case.

      • Q( author_varchar__iregex = r'.*\s*/\s*the\s*detroit\s*news$' )
    • Byline ends in "Special to The Detroit News", ignore case.

      • Q( author_varchar__iregex = r'.*\s*/\s*special\s*to\s*the\s*detroit\s*news$' )
    • Byline ends in "Detroit News * Bureau", ignore case.

      • Q( author_varchar__iregex = r'.*\s*/\s*detroit\s*news\s*.*\s*bureau$' )

In [ ]:
# filter queryset to just locally created Detroit News (TDN) articles.
# imports
from context_text.models import Article
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
from context_text.collectors.newsbank.newspapers.DTNB import DTNB

# declare variables - Detroit News
do_apply_tag = False
tag_to_apply = None
tdn_local_news_sections = []
tdn_newspaper = None
tdn_article_qs = None
article_count = -1

# declare variables - filtering
include_opinion_columns = True
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1

# declare variables - make list of article IDs from QS.
article_id_list = []
article_counter = -1
current_article = None

# ==> configure

# configure - size of random sample we want
#random_count = 60

# configure - also, apply tag?
do_apply_tag = False
tag_to_apply = ContextTextBase.TAG_LOCAL_HARD_NEWS

# set up "local, regional and state news" sections
tdn_local_news_sections = DTNB.LOCAL_NEWS_SECTION_NAME_LIST

# Detroit News
# get newspaper instance for TDN.
tdn_newspaper = Newspaper.objects.get( id = DTNB.NEWSPAPER_ID )

# start with all articles
#tdn_article_qs = Article.objects.all()

# ==> filter to newspaper, local news section list, and in-house reporters.

# ----> manually

# now, need to find local news articles to test on.
#tdn_article_qs = tdn_article_qs.filter( newspaper = tdn_newspaper )

# only the locally implemented sections
#tdn_article_qs = tdn_article_qs.filter( section__in = tdn_local_news_sections )

# and, with an in-house author
#tdn_article_qs = tdn_article_qs.filter( DTNB.Q_IN_HOUSE_AUTHOR )

#print( "manual filter count: {}".format( tdn_article_qs.count() ) )

# ----> using Article.filter_articles()
tdn_article_qs = Article.filter_articles( qs_IN = tdn_article_qs,
                                          newspaper = tdn_newspaper,
                                          section_name_list = tdn_local_news_sections,
                                          custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR )

print( "Article.filter_articles count: {}".format( tdn_article_qs.count() ) )

# and include opinion columns?
if ( include_opinion_columns == False ):
    
    # do not include columns
    tdn_article_qs = tdn_article_qs.exclude( author_string__in = DTNB.COLUMNIST_NAME_LIST )
    
#-- END check to see if we include columns. --#

'''
# filter to newspaper, section list, and in-house reporters.
tdn_article_qs = Article.filter_articles( qs_IN = tdn_article_qs,
                                          start_date = "2009-12-01",
                                          end_date = "2009-12-31",
                                          newspaper = tdn_newspaper,
                                          section_name_list = tdn_local_news_sections,
                                          custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR )
'''

# how many is that?
article_count = tdn_article_qs.count()

print( "Article count before filtering on tags: " + str( article_count ) )

# ==> tags

# tags to exclude
#tags_not_in_list = [ "prelim_reliability", "prelim_network" ]
#tags_not_in_list = [ "minnesota1-20160328", "minnesota2-20160328", ]

# for later - exclude articles already coded.
#tags_not_in_list = [ OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME ]

tags_not_in_list = None
if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):

    # exclude those in a list
    print( "filtering out articles with tags: " + str( tags_not_in_list ) )
    tdn_article_qs = tdn_article_qs.exclude( tags__name__in = tags_not_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# include only those with certain tags.
#tags_in_list = [ "prelim_unit_test_001", "prelim_unit_test_002", "prelim_unit_test_003", "prelim_unit_test_004", "prelim_unit_test_005", "prelim_unit_test_006", "prelim_unit_test_007" ]
#tags_in_list = [ "tdn_month", ]
tags_in_list = None
if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):

    # filter
    print( "filtering to just articles with tags: " + str( tags_in_list ) )
    tdn_article_qs = tdn_article_qs.filter( tags__name__in = tags_in_list )
    
#-- END check to see if we have a specific list of tags we want to include --#

# filter out "*prelim*" tags?
#filter_out_prelim_tags = True
if ( filter_out_prelim_tags == True ):

    # ifilter out all articles with any tag whose name contains "prelim".
    print( "filtering out articles with tags that contain \"prelim\"" )
    tdn_article_qs = tdn_article_qs.exclude( tags__name__icontains = "prelim" )
    
#-- END check to see if we filter out "prelim_*" tags --#

# how many is that?
article_count = tdn_article_qs.count()

print( "Article count after tag filtering: " + str( article_count ) )

# do we want a random sample?
if ( random_count > 0 ):

    # to get random, order them by "?", then use slicing to retrieve requested
    #     number.
    tdn_article_qs = tdn_article_qs.order_by( "?" )[ : random_count ]
    
#-- END check to see if we want random sample --#

# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

# make ID list, tag articles if configured to.
article_id_list = []
article_counter = 0
for current_article in tdn_article_qs:

    # increment article_counter
    article_counter += 1

    # add IDs to article_id_list
    article_id_list.append( str( current_article.id ) )
    
    # apply a tag while we are at it?
    if ( ( do_apply_tag == True ) and ( tag_to_apply is not None ) and ( tag_to_apply != "" ) ):
    
        # yes, please.  Add tag.
        current_article.tags.add( tag_to_apply )
        
    #-- END check to see if we apply tag. --#

    # output the tags.
    if ( debug_flag == True ):
        print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
    #-- END DEBUG --#

#-- END loop over articles --#

# output the list.
print( "tdn_article_qs count: {}".format( tdn_article_qs.count() ) )
print( "Found " + str( article_counter ) + " articles ( " + str( article_count ) + " )." )
if ( debug_flag == True ):
    print( "List of " + str( len( article_id_list ) ) + " local TDN staff article IDs: " + ", ".join( article_id_list ) )
#-- END DEBUG --#

Code Articles

Retrieve just publications that are tagged as being local hard news and that also are not tagged as having been coded by OpenCalaisV2.


In [10]:
# declare variables

# declare variables - article filter parameters
start_pub_date = None # should be datetime instance
end_pub_date = None # should be datetime instance
tags_in_list = []
tags_not_in_list = []
paper_id_in_list = []
section_list = []
article_id_in_list = []
params = {}

# declare variables - processing
do_i_print_updates = True
my_article_coding = None
article_qs = None
article_count = -1
coding_status = ""
limit_to = -1
do_coding = True

# declare variables - results
success_count = -1
success_list = None
got_errors = False
error_count = -1
error_dictionary = None
error_article_id = -1
error_status_list = None
error_status = ""
error_status_counter = -1

# first, get a list of articles to code.

# ! Set param values.

# ==> start and end dates
#start_pub_date = "2009-12-06"
#end_pub_date = "2009-12-12"

# ==> tagged articles

# Examples:
#tag_in_list = "prelim_reliability"
#tag_in_list = "prelim_network"
#tag_in_list = "prelim_unit_test_007"
#tag_in_list = [ "prelim_reliability", "prelim_network" ]
#tag_in_list = [ "prelim_reliability_test" ] # 60 articles - Grand Rapids only.
#tag_in_list = [ "prelim_reliability_combined" ] # 87 articles, Grand Rapids and Detroit.
#tag_in_list = [ "prelim_training_001" ]
#tag_in_list = [ "grp_month" ]

# ----> include articles when these tags are present.
#tags_in_list = None
tags_in_list = []
tags_in_list.append( ContextTextBase.TAG_LOCAL_HARD_NEWS )

# ---> exclude articles when these tags are present.
#tags_not_in_list = None
tags_not_in_list = []
tags_not_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )

# ==> IDs of newspapers to include.
#paper_id_in_list = "1"

# ==> names of sections to include.
#section_list = "Lakeshore,Front Page,City and Region,Business"

# ==> just limit to specific articles by ID.
article_id_in_list = []
#article_id_in_list = [ 360962 ]
#article_id_in_list = [ 28598 ]
#article_id_in_list = [ 21653, 21756 ]
#article_id_in_list = [ 90948 ]
#article_id_in_list = [ 21627, 21609, 21579 ]
#article_id_in_list = [ 48778 ]
#article_id_in_list = [ 6065 ]
#article_id_in_list = [ 221858 ]
#article_id_in_list = [ 23804, 22630 ]
#article_id_in_list = [ 23804 ]

# debugging exception
#article_id_in_list.append( 402670 )
#article_id_in_list.append( 408735 )

# filter parameters
params[ ArticleCoding.PARAM_START_DATE ] = start_pub_date
params[ ArticleCoding.PARAM_END_DATE ] = end_pub_date
params[ ArticleCoding.PARAM_TAGS_IN_LIST ] = tags_in_list
params[ ArticleCoding.PARAM_TAGS_NOT_IN_LIST ] = tags_not_in_list
params[ ArticleCoding.PARAM_PUBLICATION_LIST ] = paper_id_in_list
params[ ArticleCoding.PARAM_SECTION_LIST ] = section_list
params[ ArticleCoding.PARAM_ARTICLE_ID_LIST ] = article_id_in_list

# set coder you want to use.

# OpenCalais REST API v.2
params[ ArticleCoding.PARAM_CODER_TYPE ] = ArticleCoding.ARTICLE_CODING_IMPL_OPEN_CALAIS_API_V2

# get instance of ArticleCoding
my_article_coding = ArticleCoding()
my_article_coding.do_print_updates = do_i_print_updates

# to adjust timing, you need to update the ArticleCoder class for your
#     coder.  That overrides the value set here (so we respect limits
#     if they are coded into a particular coder):
my_article_coding.rate_limit_in_seconds = 3

# set params
my_article_coding.store_parameters( params )

print( "Query Parameters: {}".format( params ) )

# create query set - ArticleCoding does the filtering for you.
article_qs = my_article_coding.create_article_query_set()

print( "After my_article_coding.create_article_query_set(), count: {}".format( article_qs.count() ) )
if ( article_qs._result_cache is None ):
    
    print( "article_qs evaluated: NO ( {} )".format( article_qs._result_cache ) )
    
else:
    
    print( "article_qs evaluated: YES" )

#-- END check to see if _result_cache --#

# order by pub_date DESC, so we do most recent first.
article_qs = article_qs.order_by( "-pub_date" )

# limit for an initial test?
limit_to = 5000
# limit_to = 5
if ( ( limit_to is not None ) and ( isinstance( limit_to, int ) == True ) and ( limit_to > 0 ) ):

    # yes.
    article_qs = article_qs[ : limit_to ]

#-- END check to see if limit --#

# get article count
if ( isinstance( article_qs, list ) == True ):

    # list - call len()
    article_list = article_qs
    article_count = len( article_list )
    
else:

    # not a list - call count()
    article_count = article_qs.count()
    
#-- END figure out how to get count --#

print( "Matching article count: " + str( article_count ) )

# Do coding?
if ( do_coding == True ):

    print( "do_coding == True - it's on!" )

    # yes - make sure we have at least one article:
    if ( article_count > 0 ):

        # invoke the code_article_data( self, query_set_IN ) method.
        coding_status = my_article_coding.code_article_data( article_qs )
    
        # output status
        print( "\n\n==============================\n\nCoding status: \"" + coding_status + "\"" )
        
        # get success count
        success_count = my_article_coding.get_success_count()
        print( "\n\n====> Count of articles successfully processed: " + str( success_count ) )    
        
        # if successes, list out IDs.
        if ( success_count > 0 ):
        
            # there were successes.
            success_list = my_article_coding.get_success_list()
            print( "- list of successfully processed articles: " + str( success_list ) )
        
        #-- END check to see if successes. --#
        
        # got errors?
        got_errors = my_article_coding.has_errors()
        if ( got_errors == True ):
        
            # get error dictionary
            error_dictionary = my_article_coding.get_error_dictionary()
            
            # get error count
            error_count = len( error_dictionary )
            print( "\n\n====> Count of articles with errors: " + str( error_count ) )
            
            # loop...
            for error_article_id, error_status_list in six.iteritems( error_dictionary ):
            
                # output errors for this article.
                print( "- errors for article ID " + str( error_article_id ) + ":" )
                
                # loop over status messages.
                error_status_counter = 0
                for error_status in error_status_list:
                
                    # increment status
                    error_status_counter += 1

                    # print status
                    print( "----> status #" + str( error_status_counter ) + ": " + error_status )
                    
                #-- END loop over status messages. --#
            
            #-- END loop over articles. --#
   
        #-- END check to see if errors --#
    
    #-- END check to see if article count. --#
    
else:
    
    # output matching article count.
    print( "do_coding == False, so dry run" )
    
#-- END check to see if we do_coding --#


Query Parameters: {'start_date': None, 'end_date': None, 'tags_in_list_IN': ['local_hard_news'], 'tags_not_in_list_IN': ['coded-OpenCalaisV2ArticleCoder'], 'publications': [], 'section_list': [], 'article_id_list': [], 'coder_type': 'open_calais_api_v2'}
After my_article_coding.create_article_query_set(), count: 0
article_qs evaluated: NO ( None )
Matching article count: 0
do_coding == True - it's on!
  • 2019.07.31 - 5000 - started: execution queued 22:42:01 2019-07-31 --> executed in 3h 47m 55s, finished 02:29:55 2019-08-01
  • 2019.08.03 - 4990 - started: execution queued 00:38:05 2019-08-03 -->
  • 2019.08.04 - 5000 - started: execution queued 22:28:45 2019-08-04 --> executed in 4h 45m 21s, finished 03:14:07 2019-08-05
  • 2019.08.05 - 5000 - started: execution queued 23:04:50 2019-08-05 -->
  • 2019.08.06 - 5000 - started: execution queued 22:27:34 2019-08-06 --> executed in 5h 21m 21s, finished 03:48:55 2019-08-07
  • 2019.08.07 - 5000 - started: execution queued 00:11:32 2019-08-08 --> executed in 4h 51m 22s, finished 05:02:54 2019-08-08
  • 2019.08.08 - 5000 - started: execution queued 00:00:00 2019-08-09 --> executed in 4h 54m 50s, finished 03:04:21 2019-08-10
  • 2091.08.10 - 3819 - started: execution queued 22:09:20 2019-08-10 --> finished 02:52:51.48118 2019-08-01

Optional Validation


In [11]:
# get automated coder
automated_coder_user = ArticleCoder.get_automated_coding_user()

print( "{} - Loaded automated user: {}, id = {}".format( datetime.datetime.now(), automated_coder_user, automated_coder_user.id ) )


2019-08-11 14:17:18.589179 - Loaded automated user: automated, id = 2

Validate success publications

Loop over all successful records and verify:

  • that they have the OpenCalais coded-by-me tag (OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME).
  • that they have an ArticleData for automated coding user.
  • that it isn't all just 0 sources. Perhaps, collect and average source and subject counts.

In [12]:
# declare variables
success_count = -1
success_list = None
article_id = None
has_coded_tag = None
has_coded_tag_counter = None
has_article_data_counter = None
article_instance = None

# declare variables - tag validation
tag_name_list = None
coded_by_tag_name = None
has_coded_by_tag = None

# declare variables - ArticleData validation
article_id_to_data_map = None
article_data_qs = None
article_data_count = None
article_data_instance = None
article_data_id = None
automated_coder_type = None
article_data_map = None
article_author_qs = None
author_count = None
article_subject_qs = None
subject_qs = None
subject_count = None
source_qs = None
source_count = None
has_data_count = None
has_people_count = None
has_subjects_count = None
has_sources_count = None
article_counter = None
start_time = None
previous_time = None
current_time = None
time_since_start = None
time_since_previous = None

# validation

# init
coded_by_tag_name = OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME
#automated_coder_user = ArticleCoder.get_automated_coding_user()
automated_coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION
article_id_to_data_map = {}

# get success count
success_count = my_article_coding.get_success_count()
log_message = "\n\n====> Count of articles successfully processed: {}".format( success_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )

# if successes, list out IDs.
if ( success_count > 0 ):

    # there were successes.
    success_list = my_article_coding.get_success_list()
    #print( "- list of successfully processed articles: " + str( success_list ) )
    
    # loop over success articles
    article_counter = 0
    has_coded_tag_counter = 0
    has_article_data_counter = 0
    has_data_count = 0
    has_people_count = 0
    has_subjects_count = 0
    has_sources_count = 0
    start_time = datetime.datetime.now()
    current_time = start_time
    for article_id in success_list:
        
        article_counter += 1
        
        # load article
        article_instance = Article.objects.get( pk = article_id )
        
        # get tag name list
        tag_name_list = article_instance.tags.names()
        
        # is coded-by tag name present?
        if ( coded_by_tag_name in tag_name_list ):
            
            # it is there, as it should be.
            has_coded_by_tag =  True
            has_coded_tag_counter += 1
            
        else:
            
            # not there.  Error.
            has_coded_by_tag = False
            log_message = "ERROR in article {}: coded-by tag ( {} ) not in tag list: {}".format( article_id, coded_by_tag_name, tag_name_list )
            my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
        
        #-- END check for coded-by tag name in tag list. --#
        
        # is there an ArticleData instance by automated coder for OpenCalais V.2?
        article_data_qs = article_instance.article_data_set.filter( coder = automated_coder_user )
        article_data_qs = article_data_qs.filter( coder_type = automated_coder_type )
        article_data_count = article_data_qs.count()
        if ( article_data_count == 1 ):
            
            # got one.  Increment counter.
            has_article_data_counter += 1
            
            # TODO - check how many sources, subjects.
            article_data_instance = article_data_qs.get()
            article_data_id = article_data_instance.id
            
            # create article data map
            article_data_map = {}
            article_data_map[ "article_id" ] = article_id
            article_data_map[ "article_instance" ] = article_instance
            article_data_map[ "article_data_instance" ] = article_data_instance
            article_data_map[ "article_data_id" ] = article_data_id
            
            # get count of authors
            article_author_qs = article_data_instance.article_author_set.all()
            author_count = article_author_qs.count()
            article_data_map[ "author_count" ] = author_count
            
            # get count of subjects
            article_subject_qs = article_data_instance.article_subject_set.all()
            article_subject_total_count = article_subject_qs.count()
            article_data_map[ "article_subject_total_count" ] = article_subject_total_count
            if ( article_subject_total_count > 0 ):
                
                has_people_count += 1
                
            #-- END check to see if any people found at all --#
            
            # just subjects
            subject_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_MENTIONED )
            subject_count = subject_qs.count()
            article_data_map[ "subject_count" ] = subject_count
            if ( subject_count > 0 ):
                
                has_subjects_count += 1
                
            #-- END check to see if any subjects found --#
            
            # get count of sources
            source_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_QUOTED )
            source_count = source_qs.count()
            article_data_map[ "source_count" ] = source_count
            if ( source_count > 0 ):
                
                has_sources_count += 1
                
            #-- END check to see if any sources found --#
            
            # store information for article.
            article_id_to_data_map[ article_id ] = article_data_map
            
            if ( ( author_count == 0 ) and ( article_subject_total_count == 0 ) ):
                
                # get current time and time elapsed since start
                log_message = "No authors or sources in article {}".format( article_id )
                my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
                
            else:
                
                # increment populated data count
                has_data_count += 1
                
            #-- END sanity check for empty data (won't be zero, shouldn't be many) --#
            
        elif ( article_data_count > 1 ):
            
            # more than one?
            log_message = "ERROR in article {}: more than one ArticleData instance ( {} ) for automated coder ( {} ), coder type: {}.".format( article_id, article_data_count, automated_coder_user, automated_coder_type )
            my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
            
        else:
            
            # error - no ArticleData.
            log_message = "ERROR in article {}: no ArticleData instances for automated coder ( {} ), coder type: {}.".format( article_id, automated_coder_user, automated_coder_type )
            my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
            
        #-- END check to see if ArticleData by automated coder, Open Calais v.2 --#
        
        # progress output
        if ( ( article_counter % 100 ) == 0 ):
            
            log_message = "----> article counter: {}".format( article_counter )
            my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
            
            # get current time and time elapsed since start
            previous_time = current_time
            current_time = datetime.datetime.now()
            time_since_start = current_time - start_time
            time_since_previous = current_time - previous_time
            log_message = "         @ {} - time since previous: {}; time since start: {}".format( current_time, time_since_previous, time_since_start )
            my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )

        #-- END progress output. --#
        
    #-- END loop over IDs of sucessfully processed articles. --#

#-- END check to see if successes. --#
        
log_message = "- Tagged article count: {}".format( has_coded_tag_counter )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Correct ArticleData count: {}".format( has_article_data_counter )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has data count: {}".format( has_data_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has people count: {}".format( has_people_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has subjects count: {}".format( has_subjects_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has sources count: {}".format( has_sources_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )



====> Count of articles successfully processed: 46
- Tagged article count: 46
- Correct ArticleData count: 46
- Has data count: 46
- Has people count: 45
- Has subjects count: 30
- Has sources count: 45

Validate error publications

Loop over all error records and verify:

  • that they do not have the OpenCalais coded-by-me tag (OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME).
  • check on the status of their ArticleData. Do they have any? If so, what to do?

In [13]:
# declare variables
got_errors = None
error_dictionary = None
error_count = None
error_article_id = None
error_status_list = None
error_status_counter = None
article_instance = None
tag_name_list = None
coded_by_tag_name = None
has_coded_by_tag = None

# declare variables - ArticleData validation
error_article_id_to_data_map = None
article_data_qs = None
article_data_count = None
article_data_instance = None
article_data_id = None
automated_coder_type = None
article_data_map = None
article_author_qs = None
author_count = None
article_subject_qs = None
subject_qs = None
subject_count = None
source_qs = None
source_count = None
has_data_count = None
has_people_count = None
has_subjects_count = None
has_sources_count = None

# init
coded_by_tag_name = OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME
#automated_coder_user = ArticleCoder.get_automated_coding_user()
automated_coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION
error_article_id_to_data_map = {}

# got errors?
got_errors = my_article_coding.has_errors()
if ( got_errors == True ):

    # get error dictionary
    error_dictionary = my_article_coding.get_error_dictionary()

    # get error count
    error_count = len( error_dictionary )
    log_message = "\n\n====> Count of articles with errors: {}".format( error_count )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )

    # loop...
    has_coded_tag_counter = 0
    has_article_data_counter = 0
    has_data_count = 0
    has_people_count = 0
    has_subjects_count = 0
    has_sources_count = 0
    for error_article_id, error_status_list in six.iteritems( error_dictionary ):

        log_message = "\nError article ID: {}".format( error_article_id )
        my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )

        # output errors for this article.
        log_message = "- errors for article ID {}:".format( error_article_id )
        my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )

        # loop over status messages.
        error_status_counter = 0
        for error_status in error_status_list:

            # increment status
            error_status_counter += 1

            # print status
            log_message = "----> status #{}: {}".format( error_status_counter, error_status )
            my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
            
        #-- END loop over status messages. --#

        # load article
        article_instance = Article.objects.get( pk = error_article_id )
        
        # get tag name list
        tag_name_list = article_instance.tags.names()
        
        # is coded-by tag name present?
        if ( coded_by_tag_name in tag_name_list ):
            
            # it is there, as it should be.
            has_coded_by_tag =  True
            has_coded_tag_counter += 1
            
        else:
            
            # not there.  Error.
            has_coded_by_tag = False
            #print( "ERROR in article {}: coded-by tag ( {} ) not in tag list: {}".format( error_article_id, coded_by_tag_name, tag_name_list ) )
        
        #-- END check for coded-by tag name in tag list. --#
        
        # is there an ArticleData instance by automated coder for OpenCalais V.2?
        article_data_qs = article_instance.article_data_set.filter( coder = automated_coder_user )
        article_data_qs = article_data_qs.filter( coder_type = automated_coder_type )
        article_data_count = article_data_qs.count()
        if ( article_data_count == 1 ):
            
            # got one.  Increment counter.
            has_article_data_counter += 1
            
            # TODO - check how many sources, subjects.
            article_data_instance = article_data_qs.get()
            article_data_id = article_data_instance.id
            
            # create article data map
            article_data_map = {}
            article_data_map[ "article_id" ] = error_article_id
            article_data_map[ "article_instance" ] = article_instance
            article_data_map[ "article_data_instance" ] = article_data_instance
            article_data_map[ "article_data_id" ] = article_data_id
            
            # get count of authors
            article_author_qs = article_data_instance.article_author_set.all()
            author_count = article_author_qs.count()
            article_data_map[ "author_count" ] = author_count
            
            # get count of subjects
            article_subject_qs = article_data_instance.article_subject_set.all()
            article_subject_total_count = article_subject_qs.count()
            article_data_map[ "article_subject_total_count" ] = article_subject_total_count
            if ( article_subject_total_count > 0 ):
                
                has_people_count += 1
                
            #-- END check to see if any people found at all --#
            
            # just subjects
            subject_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_MENTIONED )
            subject_count = subject_qs.count()
            article_data_map[ "subject_count" ] = subject_count
            if ( subject_count > 0 ):
                
                has_subjects_count += 1
                
            #-- END check to see if any subjects found --#
            
            # get count of sources
            source_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_QUOTED )
            source_count = source_qs.count()
            article_data_map[ "source_count" ] = source_count
            if ( source_count > 0 ):
                
                has_sources_count += 1
                
            #-- END check to see if any sources found --#
            
            # store information for article.
            error_article_id_to_data_map[ error_article_id ] = article_data_map
            
            if ( ( author_count == 0 ) and ( article_subject_total_count == 0 ) ):
                
                pass
                #print( "- No authors or sources in article {}".format( error_article_id ) )
                
            else:
                
                # increment populated data count
                has_data_count += 1
                log_message = "- Found data in article {}: person = {}; subject = {}; source = {}".format( error_article_id, article_subject_total_count, subject_count, source_count )
                my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
                
            #-- END sanity check for empty data (won't be zero, shouldn't be many) --#
            
        elif ( article_data_count > 1 ):
            
            # more than one?
            log_message = "ERROR in article {}: more than one ArticleData instance ( {} ) for automated coder ( {} ), coder type: {}.".format( error_article_id, article_data_count, automated_coder_user, automated_coder_type )
            my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
            
        else:
            
            # no ArticleData.
            pass
            
        #-- END check to see if ArticleData by automated coder, Open Calais v.2 --#

    #-- END loop over articles. --#

    log_message = "- Tagged article count: {}".format( has_coded_tag_counter )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    log_message = "- Correct ArticleData count: {}".format( has_article_data_counter )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    log_message = "- Has data count: {}".format( has_data_count )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    log_message = "- Has people count: {}".format( has_people_count )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    log_message = "- Has subjects count: {}".format( has_subjects_count )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    log_message = "- Has sources count: {}".format( has_sources_count )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    
else:

    log_message = "NO ERRORS!  YAY!"
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    
#-- END check to see if errors --#


NO ERRORS!  YAY!

NOTE: Looks like publications where there is an OpenCalais network error are not getting the Coded tag applied, so they will remain in the pool to be re-coded in subsequent runs.


In [14]:
# get list of error IDs from map.
if ( error_dictionary is not None ):

    error_article_id_list = list( six.viewkeys( error_dictionary ) )
    log_message = "IDs of articles with errors: {}".format( error_article_id_list )
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    
else:

    log_message = "STILL NO ERRORS!  YAY!"
    my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
    
#-- END check to see if None --#


STILL NO ERRORS!  YAY!

TODO

TODO:

  • make sure that I am including author-to-author based on shared byline (different tie type).
  • figure out the naive date-time error in coding.
  • test change to rate limiting values being in static variables in OpenCalaisv.2 coder.
  • start loading data from XML
  • move data from Article_Data into context.
  • make the network data creator work against context, then generalize it for tie and node types.
  • think how we specify which class to use for author strings - needs to be speced to an interface, but not just a newsbank one - so, abstraction here should be higher up - in shared?

DONE:

  • // Save log of coding first 4990 of next round of data.
  • // for next round of coding, sort on publication date, descending, so we fill in the year before and after the layoffs first.
  • // adjust django logging to output DEBUG, then test Article.filter_articles() to see where QuerySet is evaluated (DISTINCT check?).