This is a notebook that expands on the OpenCalais code in the file article_coding.py
, also in this folder. It includes more sections on selecting publications you want to submit to OpenCalais as an example. It is intended to be copied and re-used.
In [1]:
debug_flag = False
In [2]:
import datetime
from django.db.models import Avg, Max, Min
import logging
import six
In [3]:
%pwd
Out[3]:
In [ ]:
# current working folder
current_working_folder = "/home/jonathanmorgan/work/django/research/work/phd_work/analysis"
current_datetime = datetime.datetime.now()
current_date_string = current_datetime.strftime( "%Y-%m-%d-%H-%M-%S" )
configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.
In [4]:
# build file name
logging_file_name = "{}/article_coding-{}.log.txt".format( current_working_folder, current_date_string )
# set up logging.
logging.basicConfig(
level = logging.DEBUG,
format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
filename = logging_file_name,
filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)
If you are using a virtualenv, make sure that you:
Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py
in this notebook, to configure the python environment manually as if you had activated the sourcenet
virtualenv. To do this, you'd make a code cell that contains:
%run ../dev/wsgi.py
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:
activate your virtualenv:
workon research
in your virtualenv, install the package ipykernel
.
pip install ipykernel
use the ipykernel python program to install the current environment as a kernel:
python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
sourcenet
example:
python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"
More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html
First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.
In [5]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
# add folder to front of path.
django_init_path = "{}/{}".format( django_init_folder, django_init_path )
#-- END check to see if django_init folder. --#
In [6]:
%run $django_init_path
In [7]:
# context_text imports
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Subject
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
Create a LoggingHelper instance to use to log debug and also print at the same time.
Preconditions: Must be run after Django is initialized, since python_utilities
is in the django path.
In [8]:
# python_utilities
from python_utilities.logging.logging_helper import LoggingHelper
# init
my_logging_helper = LoggingHelper()
my_logging_helper.set_logger_name( "newsbank-article_coding" )
log_message = None
Tag all locally implemented hard news articles in database and all that have already been coded using Open Calais V2, then work through using OpenCalais to code all local hard news that hasn't alredy been coded, starting with those proximal to the coding sample for methods paper.
More precisely, find all articles that have Article_Data coded by the automated coder with type "OpenCalais_REST_API_v2" and tag the articles as "coded-open_calais_v2" or something like that.
Then, for articles without that tag, use our criteria for local hard news to filter out and tag publications in the year before and after the month used to evaluate the automated coder, in both the Grand Rapids Press and the Detroit News, so I can look at longer time frames, then code all articles currently in database.
Eventually, then, we'll code and examine before and after layoffs.
In [9]:
# look for publications that have article data:
# - coded by automated coder
# - with coder type of "OpenCalais_REST_API_v2"
# get automated coder
automated_coder_user = ArticleCoder.get_automated_coding_user()
print( "{} - Loaded automated user: {}, id = {}".format( datetime.datetime.now(), automated_coder_user, automated_coder_user.id ) )
In [ ]:
# try aggregates
article_qs = Article.objects.all()
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )
In [ ]:
# find articles with Article_Data created by the automated user...
article_qs = Article.objects.filter( article_data__coder = automated_coder_user )
# ...and specifically coded using OpenCalais V2...
article_qs = article_qs.filter( article_data__coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION )
# ...and finally, we just want the distinct articles by ID.
article_qs = article_qs.order_by( "id" ).distinct( "id" )
# count?
article_count = article_qs.count()
print( "Found {} articles".format( article_count ) )
Removing duplicates present from joining with Article_Data yields 579 articles that were coded by the automated coder.
Tag all the coded articles with OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME
.
In [ ]:
# declare variables
current_article = None
tag_name_list = None
article_count = None
untagged_count = None
already_tagged_count = None
newly_tagged_count = None
count_sum = None
do_add_tag = False
# init
do_add_tag = True
# get article_count
article_count = article_qs.count()
# loop over articles.
untagged_count = 0
already_tagged_count = 0
newly_tagged_count = 0
for current_article in article_qs:
# get list of tags for this publication
tag_name_list = current_article.tags.names()
# is the coded tag in the list?
if ( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME not in tag_name_list ):
# are we adding tag?
if ( do_add_tag == True ):
# add tag.
current_article.tags.add( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
newly_tagged_count += 1
else:
# for now, increment untagged count
untagged_count += 1
#-- END check to see if we are adding tag. --#
else:
# already tagged
already_tagged_count += 1
#-- END check to see if coded tag is set --#
#-- END loop over articles. --#
print( "Article counts:" )
print( "- total articles: {}".format( article_count ) )
print( "- untagged articles: {}".format( untagged_count ) )
print( "- already tagged: {}".format( already_tagged_count ) )
print( "- newly tagged: {}".format( newly_tagged_count ) )
count_sum = untagged_count + already_tagged_count + newly_tagged_count
print( "- count sum: {}".format( count_sum ) )
In [ ]:
tags_in_list = []
tags_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
article_qs = Article.objects.filter( tags__name__in = tags_in_list )
print( "Matching article count: {}".format( article_qs.count() ) )
In [ ]:
# profile these publications
min_pub_date = None
max_pub_date = None
current_pub_date = None
pub_date_count = None
date_to_count_map = {}
date_to_articles_map = {}
pub_date_article_dict = None
# try aggregates
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )
# counts of pubs by date
for current_article in article_qs:
# get pub_date
current_pub_date = current_article.pub_date
current_article_id = current_article.id
# get count, increment, and store.
pub_date_count = date_to_count_map.get( current_pub_date, 0 )
pub_date_count += 1
date_to_count_map[ current_pub_date ] = pub_date_count
# also, store up ids and instances
# get dict of article ids to article instances for date
pub_date_article_dict = date_to_articles_map.get( current_pub_date, {} )
# article already there?
if ( current_article_id not in pub_date_article_dict ):
# no - add it.
pub_date_article_dict[ current_article_id ] = current_article
#-- END check to see if article already there.
# put dict back.
date_to_articles_map[ current_pub_date ] = pub_date_article_dict
#-- END loop over articles. --#
# output dates and counts.
# get list of keys from map
keys_list = list( six.viewkeys( date_to_count_map ) )
keys_list.sort()
for current_pub_date in keys_list:
# get count
pub_date_count = date_to_count_map.get( current_pub_date, 0 )
print( "- {} ( {} ) count: {}".format( current_pub_date, type( current_pub_date ), pub_date_count ) )
#-- END loop over dates --#
In [ ]:
# look at the 2010-07-31 date
pub_date = datetime.datetime.strptime( "2010-07-31", "%Y-%m-%d" ).date()
articles_for_date = date_to_articles_map.get( pub_date, {} )
print( articles_for_date )
# get the article and look at its tags.
article_instance = articles_for_date.get( 6065 )
print( article_instance.tags.all() )
# loop over associated Article_Data instances.
for article_data in article_instance.article_data_set.all():
print( article_data )
#-- END loop over associated Article_Data instances --#
Definition of local hard news by in-house implementor for Grand Rapids Press and Detroit News follow. For each, tag all articles in database that match as "local_hard_news".
TODO:
make class for GRPB at NewsBank.
refine "local news" and "locally created" regular expressions for Grand Rapids Press based on contents of author_string
and author_affiliation
.
DONE:
abstract out shared stuff from GRPB.py and DTNB.py into abstract parent class context_text/collectors/newsbank/newspapers/newsbank_newspaper.py
make class for GRPB at NewsBank.
Grand Rapids Press local hard news:
context_text/examples/articles/articles-GRP-local_news.py
local hard news sections (stored in Article.GRP_NEWS_SECTION_NAME_LIST
):
in-house implementor (based on byline patterns, stored in sourcenet.models.Article.Q_GRP_IN_HOUSE_AUTHOR
):
Byline ends in "/ THE GRAND RAPIDS PRESS", ignore case.
Q( author_varchar__iregex = r'.* */ *THE GRAND RAPIDS PRESS$'
Byline ends in "/ PRESS * EDITOR", ignore case.
Q( author_varchar__iregex = r'.* */ *PRESS .* EDITOR$' )
Byline ends in "/ GRAND RAPIDS PRESS * BUREAU", ignore case.
Q( author_varchar__iregex = r'.* */ *GRAND RAPIDS PRESS .* BUREAU$' )
Byline ends in "/ SPECIAL TO THE PRESS", ignore case.
Q( author_varchar__iregex = r'.* */ *SPECIAL TO THE PRESS$' )
can also exclude columns (I will not):
grp_article_qs = grp_article_qs.exclude( index_terms__icontains = "Column" )
Need to work to further refine this.
Looking at affiliation strings:
SELECT author_affiliation, COUNT( author_affiliation ) as affiliation_count
FROM context_text_article
WHERE newspaper_id = 1
GROUP BY author_affiliation
ORDER BY COUNT( author_affiliation ) DESC;
And at author strings for collective bylines:
SELECT author_string, COUNT( author_string ) as author_count
FROM context_text_article
WHERE newspaper_id = 1
GROUP BY author_string
ORDER BY COUNT( author_string ) DESC
LIMIT 10;
In [ ]:
# filter queryset to just locally created Grand Rapids Press (GRP) articles.
# imports
from context_text.models import Article
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
# declare variables - Grand Rapids Press
do_apply_tag = False
tag_to_apply = None
grp_local_news_sections = []
grp_newspaper = None
grp_article_qs = None
article_count = -1
# declare variables - filtering
include_opinion_columns = True
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1
# declare variables - make list of article IDs from QS.
article_id_list = []
article_counter = -1
current_article = None
article_tag_name_list = None
article_update_counter = -1
# ==> configure
# configure - size of random sample we want
#random_count = 60
# configure - also, apply tag?
do_apply_tag = True
tag_to_apply = ContextTextBase.TAG_LOCAL_HARD_NEWS
# set up "local, regional and state news" sections
grp_local_news_sections = GRPB.LOCAL_NEWS_SECTION_NAME_LIST
# Grand Rapids Press
# get newspaper instance for GRP.
grp_newspaper = Newspaper.objects.get( id = GRPB.NEWSPAPER_ID )
# start with all articles
#grp_article_qs = Article.objects.all()
# ==> filter to newspaper, local news section list, and in-house reporters.
# ----> manually
# now, need to find local news articles to test on.
#grp_article_qs = grp_article_qs.filter( newspaper = grp_newspaper )
# only the locally implemented sections
#grp_article_qs = grp_article_qs.filter( section__in = grp_local_news_sections )
# and, with an in-house author
#grp_article_qs = grp_article_qs.filter( Article.Q_GRP_IN_HOUSE_AUTHOR )
#print( "manual filter count: {}".format( grp_article_qs.count() ) )
# ----> using Article.filter_articles()
grp_article_qs = Article.filter_articles( qs_IN = grp_article_qs,
newspaper = grp_newspaper,
section_name_list = grp_local_news_sections,
custom_article_q = GRPB.Q_IN_HOUSE_AUTHOR )
print( "Article.filter_articles count: {}".format( grp_article_qs.count() ) )
# and include opinion columns?
if ( include_opinion_columns == False ):
# do not include columns
grp_article_qs = grp_article_qs.exclude( index_terms__icontains = "Column" )
#-- END check to see if we include columns. --#
'''
# filter to newspaper, section list, and in-house reporters.
grp_article_qs = Article.filter_articles( qs_IN = grp_article_qs,
start_date = "2009-12-01",
end_date = "2009-12-31",
newspaper = grp_newspaper,
section_name_list = grp_local_news_sections,
custom_article_q = Article.Q_GRP_IN_HOUSE_AUTHOR )
'''
# how many is that?
article_count = grp_article_qs.count()
print( "Article count before filtering on tags: " + str( article_count ) )
# ==> tags
# tags to exclude
tags_not_in_list = []
# Example: prelim-related tags
#tags_not_in_list.append( "prelim_reliability" )
#tags_not_in_list.append( "prelim_network" ]
#tags_not_in_list.append( "minnesota1-20160328" )
#tags_not_in_list.append( "minnesota2-20160328" )
# for later - exclude articles already coded.
#tags_not_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
# exclude any already tagged with tag_to_apply
tags_not_in_list.append( tag_to_apply )
if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):
# exclude those in a list
print( "filtering out articles with tags: " + str( tags_not_in_list ) )
grp_article_qs = grp_article_qs.exclude( tags__name__in = tags_not_in_list )
#-- END check to see if we have a specific list of tags we want to exclude --#
# include only those with certain tags.
tags_in_list = []
# Examples
# Examples: prelim-related tags
#tags_in_list.append( "prelim_unit_test_001" )
#tags_in_list.append( "prelim_unit_test_002" )
#tags_in_list.append( "prelim_unit_test_003" )
#tags_in_list.append( "prelim_unit_test_004" )
#tags_in_list.append( "prelim_unit_test_005" )
#tags_in_list.append( "prelim_unit_test_006" )
#tags_in_list.append( "prelim_unit_test_007" )
# Example: grp_month
#tags_in_list.append( "grp_month" )
if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):
# filter
print( "filtering to just articles with tags: " + str( tags_in_list ) )
grp_article_qs = grp_article_qs.filter( tags__name__in = tags_in_list )
#-- END check to see if we have a specific list of tags we want to include --#
# filter out "*prelim*" tags?
#filter_out_prelim_tags = True
if ( filter_out_prelim_tags == True ):
# ifilter out all articles with any tag whose name contains "prelim".
print( "filtering out articles with tags that contain \"prelim\"" )
grp_article_qs = grp_article_qs.exclude( tags__name__icontains = "prelim" )
#-- END check to see if we filter out "prelim_*" tags --#
# how many is that?
article_count = grp_article_qs.count()
print( "Article count after tag filtering: " + str( article_count ) )
# do we want a random sample?
if ( random_count > 0 ):
# to get random, order them by "?", then use slicing to retrieve requested
# number.
grp_article_qs = grp_article_qs.order_by( "?" )[ : random_count ]
#-- END check to see if we want random sample --#
# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
# make ID list, tag articles if configured to.
article_id_list = []
article_counter = 0
article_update_counter = 0
for current_article in grp_article_qs:
# increment article_counter
article_counter += 1
# add IDs to article_id_list
article_id_list.append( str( current_article.id ) )
# apply a tag while we are at it?
if ( ( do_apply_tag == True ) and ( tag_to_apply is not None ) and ( tag_to_apply != "" ) ):
# yes, please. Tag already present?
article_tag_name_list = current_article.tags.names()
if ( tag_to_apply not in article_tag_name_list ):
# Add tag.
current_article.tags.add( tag_to_apply )
# increment counter
article_update_counter += 1
#-- END check to see if tag already present. --#
#-- END check to see if we apply tag. --#
# output the tags.
if ( debug_flag == True ):
print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
#-- END DEBUG --#
#-- END loop over articles --#
# output the list.
print( "grp_article_qs count: {}".format( grp_article_qs.count() ) )
print( "Found " + str( article_counter ) + " articles ( " + str( article_count ) + " )." )
print( "- Updated {} articles to add tag {}.".format( article_update_counter, tag_to_apply ) )
if ( debug_flag == True ):
print( "List of " + str( len( article_id_list ) ) + " local GRP staff article IDs: " + ", ".join( article_id_list ) )
#-- END DEBUG --#
Detroit News local news:
context_text/examples/articles/articles-TDN-local_news.py
local hard news sections (stored in from context_text.collectors.newsbank.newspapers.DTNB import DTNB
- DTNB.NEWS_SECTION_NAME_LIST
):
in-house implementor (based on byline patterns, stored in DTNB.Q_IN_HOUSE_AUTHOR
):
Byline ends in "/ The Detroit News", ignore case.
Q( author_varchar__iregex = r'.*\s*/\s*the\s*detroit\s*news$' )
Byline ends in "Special to The Detroit News", ignore case.
Q( author_varchar__iregex = r'.*\s*/\s*special\s*to\s*the\s*detroit\s*news$' )
Byline ends in "Detroit News * Bureau", ignore case.
Q( author_varchar__iregex = r'.*\s*/\s*detroit\s*news\s*.*\s*bureau$' )
In [ ]:
# filter queryset to just locally created Detroit News (TDN) articles.
# imports
from context_text.models import Article
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
# declare variables - Detroit News
do_apply_tag = False
tag_to_apply = None
tdn_local_news_sections = []
tdn_newspaper = None
tdn_article_qs = None
article_count = -1
# declare variables - filtering
include_opinion_columns = True
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1
# declare variables - make list of article IDs from QS.
article_id_list = []
article_counter = -1
current_article = None
# ==> configure
# configure - size of random sample we want
#random_count = 60
# configure - also, apply tag?
do_apply_tag = False
tag_to_apply = ContextTextBase.TAG_LOCAL_HARD_NEWS
# set up "local, regional and state news" sections
tdn_local_news_sections = DTNB.LOCAL_NEWS_SECTION_NAME_LIST
# Detroit News
# get newspaper instance for TDN.
tdn_newspaper = Newspaper.objects.get( id = DTNB.NEWSPAPER_ID )
# start with all articles
#tdn_article_qs = Article.objects.all()
# ==> filter to newspaper, local news section list, and in-house reporters.
# ----> manually
# now, need to find local news articles to test on.
#tdn_article_qs = tdn_article_qs.filter( newspaper = tdn_newspaper )
# only the locally implemented sections
#tdn_article_qs = tdn_article_qs.filter( section__in = tdn_local_news_sections )
# and, with an in-house author
#tdn_article_qs = tdn_article_qs.filter( DTNB.Q_IN_HOUSE_AUTHOR )
#print( "manual filter count: {}".format( tdn_article_qs.count() ) )
# ----> using Article.filter_articles()
tdn_article_qs = Article.filter_articles( qs_IN = tdn_article_qs,
newspaper = tdn_newspaper,
section_name_list = tdn_local_news_sections,
custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR )
print( "Article.filter_articles count: {}".format( tdn_article_qs.count() ) )
# and include opinion columns?
if ( include_opinion_columns == False ):
# do not include columns
tdn_article_qs = tdn_article_qs.exclude( author_string__in = DTNB.COLUMNIST_NAME_LIST )
#-- END check to see if we include columns. --#
'''
# filter to newspaper, section list, and in-house reporters.
tdn_article_qs = Article.filter_articles( qs_IN = tdn_article_qs,
start_date = "2009-12-01",
end_date = "2009-12-31",
newspaper = tdn_newspaper,
section_name_list = tdn_local_news_sections,
custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR )
'''
# how many is that?
article_count = tdn_article_qs.count()
print( "Article count before filtering on tags: " + str( article_count ) )
# ==> tags
# tags to exclude
#tags_not_in_list = [ "prelim_reliability", "prelim_network" ]
#tags_not_in_list = [ "minnesota1-20160328", "minnesota2-20160328", ]
# for later - exclude articles already coded.
#tags_not_in_list = [ OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME ]
tags_not_in_list = None
if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):
# exclude those in a list
print( "filtering out articles with tags: " + str( tags_not_in_list ) )
tdn_article_qs = tdn_article_qs.exclude( tags__name__in = tags_not_in_list )
#-- END check to see if we have a specific list of tags we want to exclude --#
# include only those with certain tags.
#tags_in_list = [ "prelim_unit_test_001", "prelim_unit_test_002", "prelim_unit_test_003", "prelim_unit_test_004", "prelim_unit_test_005", "prelim_unit_test_006", "prelim_unit_test_007" ]
#tags_in_list = [ "tdn_month", ]
tags_in_list = None
if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):
# filter
print( "filtering to just articles with tags: " + str( tags_in_list ) )
tdn_article_qs = tdn_article_qs.filter( tags__name__in = tags_in_list )
#-- END check to see if we have a specific list of tags we want to include --#
# filter out "*prelim*" tags?
#filter_out_prelim_tags = True
if ( filter_out_prelim_tags == True ):
# ifilter out all articles with any tag whose name contains "prelim".
print( "filtering out articles with tags that contain \"prelim\"" )
tdn_article_qs = tdn_article_qs.exclude( tags__name__icontains = "prelim" )
#-- END check to see if we filter out "prelim_*" tags --#
# how many is that?
article_count = tdn_article_qs.count()
print( "Article count after tag filtering: " + str( article_count ) )
# do we want a random sample?
if ( random_count > 0 ):
# to get random, order them by "?", then use slicing to retrieve requested
# number.
tdn_article_qs = tdn_article_qs.order_by( "?" )[ : random_count ]
#-- END check to see if we want random sample --#
# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
# make ID list, tag articles if configured to.
article_id_list = []
article_counter = 0
for current_article in tdn_article_qs:
# increment article_counter
article_counter += 1
# add IDs to article_id_list
article_id_list.append( str( current_article.id ) )
# apply a tag while we are at it?
if ( ( do_apply_tag == True ) and ( tag_to_apply is not None ) and ( tag_to_apply != "" ) ):
# yes, please. Add tag.
current_article.tags.add( tag_to_apply )
#-- END check to see if we apply tag. --#
# output the tags.
if ( debug_flag == True ):
print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
#-- END DEBUG --#
#-- END loop over articles --#
# output the list.
print( "tdn_article_qs count: {}".format( tdn_article_qs.count() ) )
print( "Found " + str( article_counter ) + " articles ( " + str( article_count ) + " )." )
if ( debug_flag == True ):
print( "List of " + str( len( article_id_list ) ) + " local TDN staff article IDs: " + ", ".join( article_id_list ) )
#-- END DEBUG --#
Retrieve just publications that are tagged as being local hard news and that also are not tagged as having been coded by OpenCalaisV2.
In [10]:
# declare variables
# declare variables - article filter parameters
start_pub_date = None # should be datetime instance
end_pub_date = None # should be datetime instance
tags_in_list = []
tags_not_in_list = []
paper_id_in_list = []
section_list = []
article_id_in_list = []
params = {}
# declare variables - processing
do_i_print_updates = True
my_article_coding = None
article_qs = None
article_count = -1
coding_status = ""
limit_to = -1
do_coding = True
# declare variables - results
success_count = -1
success_list = None
got_errors = False
error_count = -1
error_dictionary = None
error_article_id = -1
error_status_list = None
error_status = ""
error_status_counter = -1
# first, get a list of articles to code.
# ! Set param values.
# ==> start and end dates
#start_pub_date = "2009-12-06"
#end_pub_date = "2009-12-12"
# ==> tagged articles
# Examples:
#tag_in_list = "prelim_reliability"
#tag_in_list = "prelim_network"
#tag_in_list = "prelim_unit_test_007"
#tag_in_list = [ "prelim_reliability", "prelim_network" ]
#tag_in_list = [ "prelim_reliability_test" ] # 60 articles - Grand Rapids only.
#tag_in_list = [ "prelim_reliability_combined" ] # 87 articles, Grand Rapids and Detroit.
#tag_in_list = [ "prelim_training_001" ]
#tag_in_list = [ "grp_month" ]
# ----> include articles when these tags are present.
#tags_in_list = None
tags_in_list = []
tags_in_list.append( ContextTextBase.TAG_LOCAL_HARD_NEWS )
# ---> exclude articles when these tags are present.
#tags_not_in_list = None
tags_not_in_list = []
tags_not_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
# ==> IDs of newspapers to include.
#paper_id_in_list = "1"
# ==> names of sections to include.
#section_list = "Lakeshore,Front Page,City and Region,Business"
# ==> just limit to specific articles by ID.
article_id_in_list = []
#article_id_in_list = [ 360962 ]
#article_id_in_list = [ 28598 ]
#article_id_in_list = [ 21653, 21756 ]
#article_id_in_list = [ 90948 ]
#article_id_in_list = [ 21627, 21609, 21579 ]
#article_id_in_list = [ 48778 ]
#article_id_in_list = [ 6065 ]
#article_id_in_list = [ 221858 ]
#article_id_in_list = [ 23804, 22630 ]
#article_id_in_list = [ 23804 ]
# debugging exception
#article_id_in_list.append( 402670 )
#article_id_in_list.append( 408735 )
# filter parameters
params[ ArticleCoding.PARAM_START_DATE ] = start_pub_date
params[ ArticleCoding.PARAM_END_DATE ] = end_pub_date
params[ ArticleCoding.PARAM_TAGS_IN_LIST ] = tags_in_list
params[ ArticleCoding.PARAM_TAGS_NOT_IN_LIST ] = tags_not_in_list
params[ ArticleCoding.PARAM_PUBLICATION_LIST ] = paper_id_in_list
params[ ArticleCoding.PARAM_SECTION_LIST ] = section_list
params[ ArticleCoding.PARAM_ARTICLE_ID_LIST ] = article_id_in_list
# set coder you want to use.
# OpenCalais REST API v.2
params[ ArticleCoding.PARAM_CODER_TYPE ] = ArticleCoding.ARTICLE_CODING_IMPL_OPEN_CALAIS_API_V2
# get instance of ArticleCoding
my_article_coding = ArticleCoding()
my_article_coding.do_print_updates = do_i_print_updates
# to adjust timing, you need to update the ArticleCoder class for your
# coder. That overrides the value set here (so we respect limits
# if they are coded into a particular coder):
my_article_coding.rate_limit_in_seconds = 3
# set params
my_article_coding.store_parameters( params )
print( "Query Parameters: {}".format( params ) )
# create query set - ArticleCoding does the filtering for you.
article_qs = my_article_coding.create_article_query_set()
print( "After my_article_coding.create_article_query_set(), count: {}".format( article_qs.count() ) )
if ( article_qs._result_cache is None ):
print( "article_qs evaluated: NO ( {} )".format( article_qs._result_cache ) )
else:
print( "article_qs evaluated: YES" )
#-- END check to see if _result_cache --#
# order by pub_date DESC, so we do most recent first.
article_qs = article_qs.order_by( "-pub_date" )
# limit for an initial test?
limit_to = 5000
# limit_to = 5
if ( ( limit_to is not None ) and ( isinstance( limit_to, int ) == True ) and ( limit_to > 0 ) ):
# yes.
article_qs = article_qs[ : limit_to ]
#-- END check to see if limit --#
# get article count
if ( isinstance( article_qs, list ) == True ):
# list - call len()
article_list = article_qs
article_count = len( article_list )
else:
# not a list - call count()
article_count = article_qs.count()
#-- END figure out how to get count --#
print( "Matching article count: " + str( article_count ) )
# Do coding?
if ( do_coding == True ):
print( "do_coding == True - it's on!" )
# yes - make sure we have at least one article:
if ( article_count > 0 ):
# invoke the code_article_data( self, query_set_IN ) method.
coding_status = my_article_coding.code_article_data( article_qs )
# output status
print( "\n\n==============================\n\nCoding status: \"" + coding_status + "\"" )
# get success count
success_count = my_article_coding.get_success_count()
print( "\n\n====> Count of articles successfully processed: " + str( success_count ) )
# if successes, list out IDs.
if ( success_count > 0 ):
# there were successes.
success_list = my_article_coding.get_success_list()
print( "- list of successfully processed articles: " + str( success_list ) )
#-- END check to see if successes. --#
# got errors?
got_errors = my_article_coding.has_errors()
if ( got_errors == True ):
# get error dictionary
error_dictionary = my_article_coding.get_error_dictionary()
# get error count
error_count = len( error_dictionary )
print( "\n\n====> Count of articles with errors: " + str( error_count ) )
# loop...
for error_article_id, error_status_list in six.iteritems( error_dictionary ):
# output errors for this article.
print( "- errors for article ID " + str( error_article_id ) + ":" )
# loop over status messages.
error_status_counter = 0
for error_status in error_status_list:
# increment status
error_status_counter += 1
# print status
print( "----> status #" + str( error_status_counter ) + ": " + error_status )
#-- END loop over status messages. --#
#-- END loop over articles. --#
#-- END check to see if errors --#
#-- END check to see if article count. --#
else:
# output matching article count.
print( "do_coding == False, so dry run" )
#-- END check to see if we do_coding --#
In [11]:
# get automated coder
automated_coder_user = ArticleCoder.get_automated_coding_user()
print( "{} - Loaded automated user: {}, id = {}".format( datetime.datetime.now(), automated_coder_user, automated_coder_user.id ) )
Loop over all successful records and verify:
OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME
).
In [12]:
# declare variables
success_count = -1
success_list = None
article_id = None
has_coded_tag = None
has_coded_tag_counter = None
has_article_data_counter = None
article_instance = None
# declare variables - tag validation
tag_name_list = None
coded_by_tag_name = None
has_coded_by_tag = None
# declare variables - ArticleData validation
article_id_to_data_map = None
article_data_qs = None
article_data_count = None
article_data_instance = None
article_data_id = None
automated_coder_type = None
article_data_map = None
article_author_qs = None
author_count = None
article_subject_qs = None
subject_qs = None
subject_count = None
source_qs = None
source_count = None
has_data_count = None
has_people_count = None
has_subjects_count = None
has_sources_count = None
article_counter = None
start_time = None
previous_time = None
current_time = None
time_since_start = None
time_since_previous = None
# validation
# init
coded_by_tag_name = OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME
#automated_coder_user = ArticleCoder.get_automated_coding_user()
automated_coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION
article_id_to_data_map = {}
# get success count
success_count = my_article_coding.get_success_count()
log_message = "\n\n====> Count of articles successfully processed: {}".format( success_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
# if successes, list out IDs.
if ( success_count > 0 ):
# there were successes.
success_list = my_article_coding.get_success_list()
#print( "- list of successfully processed articles: " + str( success_list ) )
# loop over success articles
article_counter = 0
has_coded_tag_counter = 0
has_article_data_counter = 0
has_data_count = 0
has_people_count = 0
has_subjects_count = 0
has_sources_count = 0
start_time = datetime.datetime.now()
current_time = start_time
for article_id in success_list:
article_counter += 1
# load article
article_instance = Article.objects.get( pk = article_id )
# get tag name list
tag_name_list = article_instance.tags.names()
# is coded-by tag name present?
if ( coded_by_tag_name in tag_name_list ):
# it is there, as it should be.
has_coded_by_tag = True
has_coded_tag_counter += 1
else:
# not there. Error.
has_coded_by_tag = False
log_message = "ERROR in article {}: coded-by tag ( {} ) not in tag list: {}".format( article_id, coded_by_tag_name, tag_name_list )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
#-- END check for coded-by tag name in tag list. --#
# is there an ArticleData instance by automated coder for OpenCalais V.2?
article_data_qs = article_instance.article_data_set.filter( coder = automated_coder_user )
article_data_qs = article_data_qs.filter( coder_type = automated_coder_type )
article_data_count = article_data_qs.count()
if ( article_data_count == 1 ):
# got one. Increment counter.
has_article_data_counter += 1
# TODO - check how many sources, subjects.
article_data_instance = article_data_qs.get()
article_data_id = article_data_instance.id
# create article data map
article_data_map = {}
article_data_map[ "article_id" ] = article_id
article_data_map[ "article_instance" ] = article_instance
article_data_map[ "article_data_instance" ] = article_data_instance
article_data_map[ "article_data_id" ] = article_data_id
# get count of authors
article_author_qs = article_data_instance.article_author_set.all()
author_count = article_author_qs.count()
article_data_map[ "author_count" ] = author_count
# get count of subjects
article_subject_qs = article_data_instance.article_subject_set.all()
article_subject_total_count = article_subject_qs.count()
article_data_map[ "article_subject_total_count" ] = article_subject_total_count
if ( article_subject_total_count > 0 ):
has_people_count += 1
#-- END check to see if any people found at all --#
# just subjects
subject_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_MENTIONED )
subject_count = subject_qs.count()
article_data_map[ "subject_count" ] = subject_count
if ( subject_count > 0 ):
has_subjects_count += 1
#-- END check to see if any subjects found --#
# get count of sources
source_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_QUOTED )
source_count = source_qs.count()
article_data_map[ "source_count" ] = source_count
if ( source_count > 0 ):
has_sources_count += 1
#-- END check to see if any sources found --#
# store information for article.
article_id_to_data_map[ article_id ] = article_data_map
if ( ( author_count == 0 ) and ( article_subject_total_count == 0 ) ):
# get current time and time elapsed since start
log_message = "No authors or sources in article {}".format( article_id )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
else:
# increment populated data count
has_data_count += 1
#-- END sanity check for empty data (won't be zero, shouldn't be many) --#
elif ( article_data_count > 1 ):
# more than one?
log_message = "ERROR in article {}: more than one ArticleData instance ( {} ) for automated coder ( {} ), coder type: {}.".format( article_id, article_data_count, automated_coder_user, automated_coder_type )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
else:
# error - no ArticleData.
log_message = "ERROR in article {}: no ArticleData instances for automated coder ( {} ), coder type: {}.".format( article_id, automated_coder_user, automated_coder_type )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
#-- END check to see if ArticleData by automated coder, Open Calais v.2 --#
# progress output
if ( ( article_counter % 100 ) == 0 ):
log_message = "----> article counter: {}".format( article_counter )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
# get current time and time elapsed since start
previous_time = current_time
current_time = datetime.datetime.now()
time_since_start = current_time - start_time
time_since_previous = current_time - previous_time
log_message = " @ {} - time since previous: {}; time since start: {}".format( current_time, time_since_previous, time_since_start )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
#-- END progress output. --#
#-- END loop over IDs of sucessfully processed articles. --#
#-- END check to see if successes. --#
log_message = "- Tagged article count: {}".format( has_coded_tag_counter )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Correct ArticleData count: {}".format( has_article_data_counter )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has data count: {}".format( has_data_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has people count: {}".format( has_people_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has subjects count: {}".format( has_subjects_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has sources count: {}".format( has_sources_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
Loop over all error records and verify:
OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME
).
In [13]:
# declare variables
got_errors = None
error_dictionary = None
error_count = None
error_article_id = None
error_status_list = None
error_status_counter = None
article_instance = None
tag_name_list = None
coded_by_tag_name = None
has_coded_by_tag = None
# declare variables - ArticleData validation
error_article_id_to_data_map = None
article_data_qs = None
article_data_count = None
article_data_instance = None
article_data_id = None
automated_coder_type = None
article_data_map = None
article_author_qs = None
author_count = None
article_subject_qs = None
subject_qs = None
subject_count = None
source_qs = None
source_count = None
has_data_count = None
has_people_count = None
has_subjects_count = None
has_sources_count = None
# init
coded_by_tag_name = OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME
#automated_coder_user = ArticleCoder.get_automated_coding_user()
automated_coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION
error_article_id_to_data_map = {}
# got errors?
got_errors = my_article_coding.has_errors()
if ( got_errors == True ):
# get error dictionary
error_dictionary = my_article_coding.get_error_dictionary()
# get error count
error_count = len( error_dictionary )
log_message = "\n\n====> Count of articles with errors: {}".format( error_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
# loop...
has_coded_tag_counter = 0
has_article_data_counter = 0
has_data_count = 0
has_people_count = 0
has_subjects_count = 0
has_sources_count = 0
for error_article_id, error_status_list in six.iteritems( error_dictionary ):
log_message = "\nError article ID: {}".format( error_article_id )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
# output errors for this article.
log_message = "- errors for article ID {}:".format( error_article_id )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
# loop over status messages.
error_status_counter = 0
for error_status in error_status_list:
# increment status
error_status_counter += 1
# print status
log_message = "----> status #{}: {}".format( error_status_counter, error_status )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
#-- END loop over status messages. --#
# load article
article_instance = Article.objects.get( pk = error_article_id )
# get tag name list
tag_name_list = article_instance.tags.names()
# is coded-by tag name present?
if ( coded_by_tag_name in tag_name_list ):
# it is there, as it should be.
has_coded_by_tag = True
has_coded_tag_counter += 1
else:
# not there. Error.
has_coded_by_tag = False
#print( "ERROR in article {}: coded-by tag ( {} ) not in tag list: {}".format( error_article_id, coded_by_tag_name, tag_name_list ) )
#-- END check for coded-by tag name in tag list. --#
# is there an ArticleData instance by automated coder for OpenCalais V.2?
article_data_qs = article_instance.article_data_set.filter( coder = automated_coder_user )
article_data_qs = article_data_qs.filter( coder_type = automated_coder_type )
article_data_count = article_data_qs.count()
if ( article_data_count == 1 ):
# got one. Increment counter.
has_article_data_counter += 1
# TODO - check how many sources, subjects.
article_data_instance = article_data_qs.get()
article_data_id = article_data_instance.id
# create article data map
article_data_map = {}
article_data_map[ "article_id" ] = error_article_id
article_data_map[ "article_instance" ] = article_instance
article_data_map[ "article_data_instance" ] = article_data_instance
article_data_map[ "article_data_id" ] = article_data_id
# get count of authors
article_author_qs = article_data_instance.article_author_set.all()
author_count = article_author_qs.count()
article_data_map[ "author_count" ] = author_count
# get count of subjects
article_subject_qs = article_data_instance.article_subject_set.all()
article_subject_total_count = article_subject_qs.count()
article_data_map[ "article_subject_total_count" ] = article_subject_total_count
if ( article_subject_total_count > 0 ):
has_people_count += 1
#-- END check to see if any people found at all --#
# just subjects
subject_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_MENTIONED )
subject_count = subject_qs.count()
article_data_map[ "subject_count" ] = subject_count
if ( subject_count > 0 ):
has_subjects_count += 1
#-- END check to see if any subjects found --#
# get count of sources
source_qs = article_subject_qs.filter( subject_type = Article_Subject.SUBJECT_TYPE_QUOTED )
source_count = source_qs.count()
article_data_map[ "source_count" ] = source_count
if ( source_count > 0 ):
has_sources_count += 1
#-- END check to see if any sources found --#
# store information for article.
error_article_id_to_data_map[ error_article_id ] = article_data_map
if ( ( author_count == 0 ) and ( article_subject_total_count == 0 ) ):
pass
#print( "- No authors or sources in article {}".format( error_article_id ) )
else:
# increment populated data count
has_data_count += 1
log_message = "- Found data in article {}: person = {}; subject = {}; source = {}".format( error_article_id, article_subject_total_count, subject_count, source_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
#-- END sanity check for empty data (won't be zero, shouldn't be many) --#
elif ( article_data_count > 1 ):
# more than one?
log_message = "ERROR in article {}: more than one ArticleData instance ( {} ) for automated coder ( {} ), coder type: {}.".format( error_article_id, article_data_count, automated_coder_user, automated_coder_type )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
else:
# no ArticleData.
pass
#-- END check to see if ArticleData by automated coder, Open Calais v.2 --#
#-- END loop over articles. --#
log_message = "- Tagged article count: {}".format( has_coded_tag_counter )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Correct ArticleData count: {}".format( has_article_data_counter )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has data count: {}".format( has_data_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has people count: {}".format( has_people_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has subjects count: {}".format( has_subjects_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
log_message = "- Has sources count: {}".format( has_sources_count )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
else:
log_message = "NO ERRORS! YAY!"
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
#-- END check to see if errors --#
NOTE: Looks like publications where there is an OpenCalais network error are not getting the Coded tag applied, so they will remain in the pool to be re-coded in subsequent runs.
In [14]:
# get list of error IDs from map.
if ( error_dictionary is not None ):
error_article_id_list = list( six.viewkeys( error_dictionary ) )
log_message = "IDs of articles with errors: {}".format( error_article_id_list )
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
else:
log_message = "STILL NO ERRORS! YAY!"
my_logging_helper.output_message( log_message, do_print_IN = True, log_level_code_IN = logging.INFO )
#-- END check to see if None --#
TODO:
DONE: