2017.07.01 - work log - prelim_month - evaluate disagreements

Setup

Setup - Imports


In [ ]:
import datetime
import json
import six

print( "packages imported at " + str( datetime.datetime.now() ) )

In [ ]:
%pwd

Setup - Initialize Django

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.

You need to have installed your virtualenv with django as a kernel, then select that kernel for this notebook.


In [ ]:
%run django_init.py

Import any sourcenet or context_analysis models or classes.


In [ ]:
# django imports
from django.contrib.auth.models import User

# sourcenet shared
from context_text.shared.person_details import PersonDetails

# sourcenet models.
from context_text.models import Article
from context_text.models import Article_Data
from context_text.models import Article_Subject
from context_text.models import Person
from context_text.shared.context_text_base import ContextTextBase
from context_text.tests.models.test_Article_Data_model import Article_Data_Copy_Tester

# sourcenet article_coding
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.manual_coding.manual_article_coder import ManualArticleCoder

# context_analysis models.
from context_analysis.models import Reliability_Names
from context_analysis.models import Reliability_Names_Evaluation
from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder

print( "sourcenet and context_analysis packages imported at " + str( datetime.datetime.now() ) )

Setup - Tools

Tool - copy Article_Data to user ground_truth

Retrieve the ground truth user, then make a deep copy of an Article_Data record, assigning it to the ground truth user.


In [ ]:
def copy_to_ground_truth_user( source_article_data_id_IN ):

    '''
    Accepts ID of Article_Data instance to copy to ground_truth user,
        for correcting coding error made by human coder.  Performs a deep
        copy of Article_Data instance, then assignes it to the ground_truth
        user.  Prints any validation errors, returns the new Article_Data.
    '''
    
    # return reference
    new_article_data_instance_OUT = -1
    
    # declare variables
    ground_truth_user = None
    ground_truth_user_id = -1
    id_of_article_data_to_copy = -1
    new_article_data = None
    new_article_data_id = -1
    validation_error_list = None
    validation_error_count = -1
    validation_error = None

    # set ID of article data we want to copy.
    id_of_article_data_to_copy = source_article_data_id_IN

    # get the ground_truth user's ID.
    ground_truth_user = ContextTextBase.get_ground_truth_coding_user()
    ground_truth_user_id = ground_truth_user.id

    # make the copy
    new_article_data = Article_Data.make_deep_copy( id_of_article_data_to_copy,
                                                    new_coder_user_id_IN = ground_truth_user_id )
    new_article_data_id = new_article_data.id

    # validate it.
    validation_error_list = Article_Data_Copy_Tester.validate_article_data_deep_copy( original_article_data_id_IN = id_of_article_data_to_copy,
                                                                                      copy_article_data_id_IN = new_article_data_id,
                                                                                      copy_coder_user_id_IN = ground_truth_user_id )

    # get error count:
    validation_error_count = len( validation_error_list )
    if ( validation_error_count > 0 ):

        # loop and output messages
        for validation_error in validation_error_list:

            print( "- Validation erorr: " + str( validation_error ) )

        #-- END loop over validation errors. --#

    else:

        # no errors - success!
        print( "Record copy a success (as far as we know)!" )

    #-- END check to see if validation errors --#

    print( "copied Article_Data id " + str( id_of_article_data_to_copy ) + " INTO Article_Data id " + str( new_article_data_id ) + " at " + str( datetime.datetime.now() ) )
    
    new_article_data_instance_OUT = new_article_data
    
    return new_article_data_instance_OUT

#-- END function copy_to_ground_truth_user() --#

print( "function copy_to_ground_truth_user() defined at " + str( datetime.datetime.now() ) )

In [ ]:
# Example: set ID of article data we want to copy.
#copy_to_ground_truth_user( 2342 )

Tool - delete Article_Data

Delete the Article_Data whose ID you specify (intended only when you accidentally create a "ground_truth").


In [ ]:
def delete_article_data( article_data_id_IN, do_delete_IN = False ):

    # declare variables
    article_data_id = -1
    article_data = None
    do_delete = False

    # set do_delete from parameter.
    do_delete = do_delete_IN
    
    # set ID.
    article_data_id = article_data_id_IN

    # get model instance
    article_data = Article_Data.objects.get( id = article_data_id )

    # got something?
    if ( article_data is not None ):

        # yes.  Delete?
        if ( do_delete == True ):

            # delete.
            print( "Deleting Article_Data: " + str( article_data ) )
            article_data.delete()

        else:

            # no delete.
            print( "Found Article_Data: " + str( article_data ) + ", but not deleting." )

        #-- END check to see if we delete --#

    #-- END check to see if Article_Data match. --#
    
#-- END function delete_article_data() --#

print( "function delete_article_data() defined at " + str( datetime.datetime.now() ) )

Tool - update Reliability_Names labels for an article

Steps:

  • retrieve the Reliability_Names row(s) for article with a paritcular ID, and filter on label if one provided.
  • update the selected Reliability_Names row(s) with a new label you pass in.

In [ ]:
def update_reliability_names_label_for_article( article_id_IN, new_label_IN ):

    # declare variables
    article_id = -1
    label = ""
    row_string_list = None

    # first, get existing Reliability_Names rows for article and label.
    article_id = article_id_IN
    label = "prelim_month"

    # Do the update
    row_string_list = Reliability_Names.update_reliabilty_names_for_article( article_id,
                                                                             filter_label_IN = label,
                                                                             new_label_IN = new_label_IN,
                                                                             do_delete_IN = False )

    # print the strings.
    for row_string in row_string_list:

        # print it.
        print( row_string )

    #-- END loop over row strings --#

#-- END function delete_reliability_names_for_article() --#

print( "function update_reliability_names_label_for_article() defined at " + str( datetime.datetime.now() ) )

Tool - rebuild Reliability_Names for an article

Steps:

  • retrieve the Reliability_Names row(s) for article with a paritcular ID, and filter on label if one provided.
  • delete the selected Reliability_Names row(s).
  • set up a call to the Reliability_Names program that just generates data for:

    • the article in question
    • users in a desired order.
    • etc.

Delete existing Reliability_Names for article


In [ ]:
def delete_reliability_names_for_article( article_id_IN ):

    # declare variables
    article_id = -1
    label = ""
    do_delete = False
    row_string_list = None

    # first, get existing Reliability_Names rows for article and label.
    article_id = article_id_IN
    label = "prelim_month"
    do_delete = True

    # Do the delete
    row_string_list = Reliability_Names.delete_reliabilty_names_for_article( article_id,
                                                                             label_IN = label,
                                                                             do_delete_IN = do_delete )

    # print the strings.
    for row_string in row_string_list:

        # print it.
        print( row_string )

    #-- END loop over row strings --#

#-- END function delete_reliability_names_for_article() --#

print( "function delete_reliability_names_for_article() defined at " + str( datetime.datetime.now() ) )

Make new Reliability_Names


In [ ]:
def rebuild_reliability_names_for_article( article_id_IN, delete_existing_first_IN = True ):
    
    '''
    Remove existing Reliability_Names records for article, then rebuild them
        from related Article_Data that matches any specified criteria.
        
    Detailed logic:
    - remove old Reliability_Names for that article ( [Delete existing `Reliability_Names` for article](#Delete-existing-Reliability_Names-for-article) ).  Make sure to specify both label and Article ID, so you don't delete more than you intend.
    - re-run Reliability_Names creation for the article ( [Make new `Reliability_Names`](#Make-new-Reliability_Names) ).  Specify:

        - Article ID list (just put the ID of the article you want to reprocess in the list).
        - label: make sure this is the same as the label of the rest of your Reliability_Names records ("prelim_month").
        - Tag list: If you want to make even more certain that you don't do something unexpected, also specify the article tags that make up your current data set, so if you accidentally specify the ID of an article not in your data set, it won't process.  Current tag is "grp_month".
        - Coders to assign to which index in the Reliability_Names record, and in what priority.  You can assign multiple coders to a given index, for example, when multiple coders coded subsets of a data set, and you want their combined coding to be used as "coder 1" or "coder 2", for example.  See the cell for an example.
        - Automated coder type: You can specify the particular automated coding type you want for automated coder, to filter out coding done by other automated methods.  See the cell for an example for "OpenCalais v2".
    '''
    
    # django imports
    #from django.contrib.auth.models import User

    # sourcenet imports
    #from context_text.shared.context_text_base import ContextTextBase

    # context_analysis imports
    #from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder

    # declare variables
    my_reliability_instance = None
    tag_in_list = []
    article_id_in_list = []
    label = ""

    # declare variables - user setup
    current_coder = None
    current_coder_id = -1
    current_index = -1

    # declare variables - Article_Data filtering.
    coder_type = ""

    # delete old Reliability_Names?
    if ( delete_existing_first_IN == True ):
        
        # delete first
        delete_reliability_names_for_article( article_id_IN )
        
    #-- END check to see if we delete first --#
    
    # make reliability instance
    my_reliability_instance = ReliabilityNamesBuilder()

    #===============================================================================
    # configure
    #===============================================================================

    # list of tags of articles we want to process.
    tag_in_list = [ "grp_month", ]

    # list of IDs of articles we want to process:
    article_id_in_list = [ article_id_IN, ]

    # label to associate with results, for subsequent lookup.
    label = "prelim_month"

    # ! ====> map coders to indices

    # set it up so that...

    # ...the ground truth user has highest priority (4) for index 1...
    current_coder = ContextTextBase.get_ground_truth_coding_user()
    current_coder_id = current_coder.id
    current_index = 1
    current_priority = 4
    my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

    # ...coder ID 8 is priority 3 for index 1...
    current_coder_id = 8
    current_index = 1
    current_priority = 3
    my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

    # ...coder ID 9 is priority 2 for index 1...
    current_coder_id = 9
    current_index = 1
    current_priority = 2
    my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

    # ...coder ID 10 is priority 1 for index 1...
    current_coder_id = 10
    current_index = 1
    current_priority = 1
    my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

    # ...and automated coder (2) is index 2
    current_coder = ContextTextBase.get_automated_coding_user()
    current_coder_id = current_coder.id
    current_index = 2
    current_priority = 1
    my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

    # and only look at coding by those users.  And...

    # configure so that it limits to automated coder_type of OpenCalais_REST_API_v2.
    coder_type = "OpenCalais_REST_API_v2"
    #my_reliability_instance.limit_to_automated_coder_type = "OpenCalais_REST_API_v2"
    my_reliability_instance.automated_coder_type_include_list.append( coder_type )

    # output debug JSON to file
    #my_reliability_instance.debug_output_json_file_path = "/home/jonathanmorgan/" + label + ".json"

    #===============================================================================
    # process
    #===============================================================================

    # process articles
    my_reliability_instance.process_articles( tag_in_list,
                                              article_id_in_list_IN = article_id_in_list )

    # output to database.
    my_reliability_instance.output_reliability_data( label )

#-- END function rebuild_reliability_names_for_article() --#

print( "function rebuild_reliability_names_for_article() defined at " + str( datetime.datetime.now() ) )

Tag disagreements as TODO

First, assign "TODO" tag to all disagreements using the "View reliability name information" screen:

To do this:

  • First, enter the following in the fields there:

    • Label: "prelim_month"
    • Coders to compare (1 through ==>): 2
    • Reliability names filter type: Select "Disagree (only rows with disagreement between coders)"
  • Click the "Submit Query" button. This should load all the disagreement rows (424 after removing single-word names).

  • Click the "(all)" link in the "select" column header to check the checkbox next to all of the records.
  • In the "Reliability names action:" field, select "Add tag(s) to selected".
  • In the "Tag(s) - (comma-delimited):" field, enter "TODO" (without the quotes).
  • Click the "Do Action" button.

Evaluate disagreements

Need to go through each disagreement and make sure that the ground truth is correct. In the interest of accuracy/precision/recall, my human coding serves as ground truth to compare computer against. So, will look at all the disagreements and make sure that the human coding is right. This isn't perfect. The error where both incorrectly agree is still unaddressed, and would effectively require me to re-code all the articles (which I could do...). But, better than not checking.

Evaluate disagreements using the "View reliability name information" screen:

To start, enter the following in fields there:

  • Label: "prelim_month"
  • Coders to compare (1 through ==>): 2
  • Reliability names filter type: Select "Lookup"
  • [Lookup] - Reliability_Names tags (comma-delimited): Enter "TODO" (without the quotes).

Then click the "Submit Query" button.

You should see all the records with disagreements that still need to be evaluated (we remove "TODO" from records as we go to keep track of which we have evaluated). To start, the same 424 that had disagreements after removing single names should be assigned "TODO" tag.

Article assessment

First, need to make sure that the article in question is actually a news article. Some lists and columns are written with such a different style from traditional news articles that they really shouldn't be included in this study. Others might be OK to include. For now, try to fix ground truth on them all, and I'll come back - probably will make the non-news Article_Data a separate label, then do the analysis on only prelim_month, then on the combination of news and crazy to see the difference.

Excluded articles (label "prelim_month_exclude"):

  • 22869 - Dec 20, 2009, Lakeshore ( N4 ), UID: 12CBF3D33570A618 - Faces in the crowd - Lakeshore sports provides fascinating stories ( Grand Rapids Press, The )

In [ ]:
article_id = 22869
new_label = "prelim_month_exclude"
#update_reliability_names_label_for_article( article_id, new_label )
print( "==> Updated labels for article " + str( article_id ) + " to " + str( new_label ) + " at " + str( datetime.datetime.now() ) )

Disagreement evaluation

Need to look at each instance where there is a disagreement and make sure the human coding is correct.

Most are probably instances where the computer screwed up, but since we are calling this human coding "ground truth", want to winnow out as much human error as possible.

For each disagreement, to check for coder error (like just capturing a name part for a person whose full name was in the story), click the "Article ID" in the column that has a link to article ID. It will take you to a view of the article where all the people who coded the article are included, with each detection of a mention or quotation displayed next to the paragraph where the person was originally first detected.

If the disagreement deals with mentions only, and if the person shouldn't instead have been quoted, it is OK to skip fixing it if the human coder was in error since those are not included in this work. It is also OK to fix if you want.

Disagreement resolution

For each disagreement, click on the article ID link in the row to go to the article and check to see if the human coding for the disagreement in question is correct ( http://research.local/research/context/text/article/article_data/view_with_text/ ).

Once you've evaluated and verified the human coding, remove the "TODO" tag from the current record (either from the single-article view above if you've removed all disagreements, or from the disagreement view if not):

  • Click the checkbox in the "select" column next to the record whose evaluation is complete.
  • In the "Reliability names action:" field, select "Remove tag(s) from selected".
  • In the "Tag(s) - (comma-delimited):" field, enter "TODO" (without the quotes).
  • Click the "Do Action" button.
  • This will also place information on the Reliability_Names record into a Reliability_Names_Evaluation record in the database. The message that results from this action completing will include a link to the record (the first number in the output). Click the link to open the record and update it with additional details. Specifically:

    • status - status of human coder's coding:

      • If the human coder got it right, status is "CORRECT", even if OpenCalais had an egregious error.
      • If this is because the coding screen couldn't capture compound names initially, set status to "INCOMPLETE", set the status message to "SKIPPED because screen couldn't deal with compound names", put the compound name string in notes, and then add tag "compound_names".
      • if the OC coder had an issue because we had to smoosh all paragraphs together because it didn't deal well with HTML markup in the body of text it processed, set status to "CORRECT", set status message to "OC ERROR because of formatting", and then explain the problem in the notes. If the article is a list or column with odd formatting, consider flagging the article for removal from the study.
      • if you have to update ground truth, set "status" to "ERROR".
      • else, use your best judgment.
    • if problems caused by automated coder error, click the "is_automated_error" checkbox.

    • update the "status_message" so it contains a brief description of what exactly happened (should have been mentioned, should have been quoted, missed the person entirely, etc.).
    • update "Notes" with more details.

      • If should have been quoted or mentioned, note the graf # and paragraph text of the paragraph that indicates this.
    • add "Tags" if appropriate (for sports articles, for example, add "sports" tag).

NOTE: Always remove TODO tag first, so you have a record of status for each Reliability Names record. Then, once you've done that, you can merge, delete, etc.

Standard tags

tag description
ambiguous if it is something that is ambiguouos because of article's implementation: "ambiguous".
complex if it is something genuinely complicated, ambiguous or confusing: "complex".
complex_titles issues with long or complex titles: "complex_titles"
compound_attribution single statement attributed to two or more people - "... Williams and Helder said.": "compound_attribution"
compound_names issue with compound names, later fixed in admin (Dave and Krista Mason)" "compound_names"
contributed_to for problems because of reporters credited in last paragraph: "contributed_to".
dictionary_error issues with name parts that appear to be in a dictionary: "dictionary_error"
disambiguation specific topic of ambiguity when matching name text to stored named entities: "disambiguation"
editing_error issues with editing errors: "editing_error"
error for particularly interesting OpenCalais errors, "error".
follow_on_attribution problems with pronoun attribution ("he said" in paragraph after a person is introduced - follow-on attribution) - "follow_on_attribution".
foreign_names for issues related to foreign names, add "foreign_names" tag.
gender_confusion issues with gender confusion (names that can refer to both genders - Dominique, etc.): "gender_confusion"
initials issues with initials (R.J. Smith): "initials"
interesting if something is interesting, "interesting" (for examples for paper).
layout_or_design issues with article layout/design: "layout_or_design"
list issues with lists within an article: "list"
lookup issues with looking up person based on name string: "lookup"
no_html for problems because OpenCalais API doesn't deal well with HTML, so I passed flattened text: "no_html"
non_news for non-news articles, for example sports or book reviews, add "non_news" tag.
pronouns / pronoun_attribution when a pronoun reference is ambiguous/indeterminate, or other "pronoun" chicanery
proper_nouns issues with proper nouns and names referring to other than people: "proper_nouns"
quote_distance problem with distance between intro of person and quote (a guess...): "quote_distance".
said_verb problem with said verb: "said_verb".
second_hand for second hand attribution fooling OpenCalais, use "second_hand".
spanish for issues related to the Spanish language, add "spanish" tag.
sports for sports articles, for example, add "sports" tag.
straightforward if a decision seems very straightforward, but OC errored, use "straightforward".
short_n-gram issues with short n-gram: "short_n-gram"
title_prefix issues with titles that precede name: "title_prefix"

Human coder error

Order of operations:

  • 1) look at all disagreements for the article.
  • 2) remove all TODO tags from all disagreements, and fill in details for each.
  • 3) follow steps below to create ground_truth copy and fix it.
  • 4) rebuild Reliability_Names for article and cleanup.
  • 5) then, do any deletes or merges you need to do, so you only do them once.

If human coder did not detect person or made some other kind of error:

Remove TODO tag

  • In the Reliability_Names disagreement view ( http://research.local/research/context/analysis/reliability/names/disagreement/view ), remove the "TODO" tag from any items related to this disagreement and save:

    • Click the checkbox in the "select" column next to the record whose evaluation is complete.
    • In the "Reliability names action:" field, select "Remove tag(s) from selected".
    • In the "Tag(s) - (comma-delimited):" field, enter "TODO" (without the quotes).
    • Click the "Do Action" button.

Add details to Reliability_Names_Evaluation record

  • This will also place information on the Reliability_Names record into a Reliability_Names_Evaluation record in the database. The message that results from this action completing will include a link to the record (the first number in the output). Click the link to open the record and update it with additional details. Specifically:

    • status:

      • If this is because the coding screen couldn't capture compound names initiall, set status to "INCOMPLETE", set the status message to "SKIPPED because screen couldn't deal with compound names", put the compound name string in notes, and then add tag "compound_names".
      • else, since we had to update ground truth, set "status" to "ERROR".
    • update the "status_message" so it contains a brief description of what exactly happened (should have been mentioned, should have been quoted, missed the person entirely, etc.).

    • update "Notes" with more details, including the text in question.
    • check the "is_ground_truth_fixed" checkbox.
    • if problems caused by automated coder error, click the "is_automated_error" checkbox.
    • if the disagreement was over text that is ambiguous, them click the "is_ambiguous" checkbox.
    • add "Tags" if appropriate:

Store Article ID and article data ID

Set "resolve_article_id" and "human_article_data_id" variables values, then run the cell.


In [ ]:
# Setup variables of interest.
resolve_article_id = 24132
human_article_data_id = 2801

print( "SET variables:" )
print( "- resolve_article_id = " + str( resolve_article_id ) )
print( "- human_article_data_id = " + str( human_article_data_id ) )
print( "at " + str( datetime.datetime.now() ) )

Copy to ground_truth user

  • use the function "copy_to_ground_truth_user()" defined in section Tool - copy Article_Data to user ground_truth to create a copy of the person's Article_Data and assign it to coder "ground_truth". Make a code cell and set up a call to "copy_to_ground_truth_user()", passing it the ID of the Article_Data you want to copy to ground_truth. Example:

      # copy Article_Data 12345 to ground_truth user.
      copy_to_ground_truth_user( 12345 )

In [ ]:
# copy Article_Data to ground_truth user.
print( "==> copy_to_ground_truth_user() (article: " + str( resolve_article_id ) + ") run at " + str( datetime.datetime.now() ) )
print( "" )
copy_to_ground_truth_user( human_article_data_id )

In [ ]:
# if you screw up and create two, you can delete one:
delete_article_data( 3388, do_delete_IN = True )

Fix ground_truth coding

  • If you want to stay logged in as your normal user while processing an error, do the following in a separate browser (I like Opera).
  • if this is the first time you've used the "ground_truth" user, log into the django admin ( http://research.local/research/admin/ ) and:

    • set or reset the "ground_truth" user's password.
    • give it "staff status".
  • log in to the coding tool ( http://research.local/research/context/text/article/code/ ) as the "ground_truth" user and fix the coding for the article in question, then save.

Rebuild Reliability_Names for article

  • make a code cell and call "rebuild_reliability_names_for_article()", passing it the ID of the article whose Reliability_Names records you want to rebuild. It will automatically delete existing and then rebuild, using all the right parameters. Example:

      # rebuild Reliability_Names for article 12345
      rebuild_reliability_names_for_article( 12345 )

In [ ]:
# rebuild Reliability_Names for article
print( "==> rebuild_reliability_names_for_article() (article: " + str( resolve_article_id ) + ") run at " + str( datetime.datetime.now() ) )
print( "" )
rebuild_reliability_names_for_article( resolve_article_id )

Re-fix article problems

  • Then, you'll need to re-fix any other problems with the article. Specifically:

    • load just the Reliability_Names for this article - http://research.local/research/context/analysis/reliability/names/disagreement/view:

      • Label: "prelim_month"
      • Coders to compare (1 through ==>): 2
      • Reliability names filter type: Select "Lookup"
      • [Lookup] - Associated Article IDs (comma-delimited): Enter "<article_id>," (without the quotes).
    • check for single names, either to remove, or to tie an erroneously parsed name to the correct person (forgot to capture first name, for example).

    • If two people that should be tied together are not (if human detects a person, and then if computer coder misses a name part but detects the person, just with a truncated name, for example), you'll need to merge the two rows. See Merging computer and human Reliability_Names records below for more details.
    • add again the "TODO" tag to any rows with disagreement that haven't already been evaluated.

      • Click the checkbox in the "select" column next to any records that are either disagreements or the person who initiated this work.
      • In the "Reliability names action:" field, select "Add tag(s) to selected".
      • In the "Tag(s) - (comma-delimited):" field, enter "TODO" (without the quotes).
      • Click the "Do Action" button.
    • __Note: as you re-process, you should check to see if any of the steps already has a Reliability_Names_Evaluation row, and if so, remove the newer one so you don't have duplicates of any of the actions recorded._


Merging computer and human Reliability_Names records

If there is a problem where human and computer coding of same person are so different they split into different rows, merge the computer row into the human row, then remove the computer row.

  • First, merge the computer row into the human row:

    • Click the checkbox in the "select" column in the computer coding row.
    • Click the checkbox in the "merge INTO" column in the human coding row.
    • In the field "Reliability names action" at the top of the list, select "Merge Coding --> FROM 'select' TO 'merge INTO'".
    • Click "Do Action" to perform the merge.
    • Once the merge is complete, you can click the link in the resulting output to go to the Reliability_Names_Evaluation record created to track the merge and update any of the boolean flags (likely you'll want to set "is_automated_error" to checked, for example), add notes, adjust status message, etc.
  • Second, delete the computer-only row:

    • As a sanity check, make sure that the row that originally only contained human coding now has both human and computer (coder ID of 2).
    • Then, click the checkbox in the "select" column in the computer-only coding row.
    • In the field "Reliability names action" at the top of the list of Reliability_Names records, select "Delete Selected".
    • Click "Do Action" to perform the delete.

Resolution logs

Table of Reliability_Names records with disagreements, then separate tables of those where:

  • human coding had to be fixed.
  • records for the same person needed to be merged together.
  • coding had to be deleted.

Evaluation log

Track each Reliability_Names that we evaluate:

Ground truth coding fixed

For some, the error will be on the part of the human coder. For human error, we create a new "ground_truth" record that we will correct, so we preserve original coding (and evidence of errors) in case we want or need that information later. Below, we have a table of the articles where we had to fix ground truth. To find the original coding, click the Article link.

Reliability_Names records merged

For some, need to merge a single-name detection by Calais with full-name detection by ground_truth (an OpenCalais error - did not detect full name - combined with lookup error - didn't lookup the right person since missed part of his or her name). Will still have subsequently deleted one or more duplicate rows.

Deleted Reliability_Names records

Some records are just broken, need to be deleted.

Notes

Notes and questions

Notes and questions:

  • the thing with complex preceding titles is, I think more straightforward - it looks like it is a matter of proper nouns that start with title aren't recognized as names, probably because the title words aren't in some name dictionary or lookup table.

Errors

Errors of note in automated coding:

TODO

TODO:

  • add field to article table for non-news or is_hard_news.
  • Article 22181 - Why is the incorrect person "Christian Reformed Church" tagged as being mentioned in paragraph 14 rather than 18 where that string is?
  • Want a way to limit to disagreements where quoted? Might not - this is a start to assessing erroneous agreement. If yes, 1 < coding time < 4 hours.

    • problem - Reliability_Names.person_type only has three values - "author", "subject", "source" - might need a row-level measure of "has_mention", "has_quote" to more readily capture rows where disagreement is over quoted-or-not.

TODO - filter articles that are not news

TODO:

  • Article 22705 - Book roundup - probably should just remove from study, and see if meta-data about articles that could be used to automatically filter these type of articles out in the future. Leaving in for now, but should flag these so I can do comparison of numbers with and without.
  • Use keywords for Lakeshore section stories to try to filter out sports stories ("Basketball"). Maybe try this for all articles in the month?
  • sports

TODO - Update protocol

TODO:

  • TK

Coding to look into

Coding decisions to look at more closely:

  • TK

Debugging

Issues to debug:

  • TK

DONE

DONE:

  • Make an admin for Reliability_Names, so I can filter and sort and try deleting to see if removing Reliability_Names causes removal of related Reliability_Names_Evaluation.
  • Update sections of code that output table markdown to also just insert that information into the database it Reliability_Names_Evaluation.

    • // debug admin pages.
    • // import all of the existing rows from pipe delimited string.

      • // base list
      • // fixed ground truth
      • // deleted
      • // merged
    • // update the places where it outputs the pipe-delimited lists to write also to the database.

  • merge didn't populate merge fields in evaluation record. Need to backup VM, then debug.

  • for article Article 21644: I updated ground truth, rebuilt Reliability_Names, merged Anna K. Simon with Lou Anna K. Simon, deleted old Anna K. Simon. still need to clean up (there is one single name).
  • what to do about someone mentioned in the article but who is also an author? For now, can only be one or the other, so make them an author.
  • commit and push sourcenet, context_analysis, and work/msu...
  • Look at protocol to see what it says about a person quoted from a letter.

RNE not hard news flag

// Add field to evaluation table for non-news (and probably need a way to denote this in articles themselves, also...).

Need to set flag for any Reliability_Names_Evaluation row with tags of "sports", or "list", or "non_news".


In [ ]:
# declare variables
tags_in_list = None
rne_qs = None
rne_instance = None

# include only those with certain tags.
tags_in_list = [ "sports", "list", "non_news" ]
if ( len( tags_in_list ) > 0 ):

    # filter
    print( "filtering to just Reliability_Names_Evaluation records with tags: " + str( tags_in_list ) )
    rne_qs = Reliability_Names_Evaluation.objects.filter( tags__name__in = tags_in_list )
    
    # loop
    for rne_instance in rne_qs:
        
        # set is_not_hard_news, then save.
        print( "==> Updating " + str( rne_instance ) )
        rne_instance.is_not_hard_news = True
        rne_instance.save()
        
    #-- END loop over Reliability_Names_Evaluation instances --#

#-- END check to see if we have a specific list of tags we want to include --#

quotes that contain paragraph break

Quotes with newlines in them (not sure how that is captured on the way to the server, in the database, etc.) break the article coder: http://research.local/research/context/text/article/code/.

When you load JSON that contains quote text that spans lines, the newlines within the text cause the JSON parsing to break. Looks like it is read and parsed correctly when submitted to serrver (except for the graf number - evaluates to -1 - so that is a bug, too, since there are no newlines in any of the text we are looking at, just paragraph breaks).

How to fix?:

  • First try stripping out any stretches of multiple white space characters and substituting a space. This should work with all of the rest of the code on the server. Can implement in javascript, and for sanity check also in Python that processes received JSON.
  • If rest of code doesn't play nice with reformatting, then maybe figure out how to escape the carriage returns and line feeds, and might need to update the "find in text" functions, too.
  • turns out that fixing this in cases when the quotation spans paragraphs might then break things when there are extra spaces within a paragraph. So, leaving it as is for now, need to fix that paragraph in the article.

Examples:

partial name lookup

In ajax-selects lookup filter for person - need to match on first and last name, excluding middle name.

Update protocol

done:

  • add example of Kevin Matthews and Jack Doles from Article 23356: ""There are two hockey sequences in the movie, and we just had Kevin Matthews of WLAV and Jack Doles from Channel 8 out to record the play-by-play," Zandstra said."

    • for subjects, being mentioned inside another quote still counts. These two are both subjects.
  • add example of Sam Olivo (talked to sources for the story, quoted second-hand based on the sources' statements) in Article 21627: "Slowly, Sam Olivo, a 55-year-old from St. Johns and a 22-year state prison system employee, revealed he was jumped by at least one and up to five inmates assigned to a nearby work detail."

    • We care about the reporter talking to a source - if indirect, not a source.
  • what to do about sources who are quoted from a letter or document? Not a source - a subject.

  • what to do about someone mentioned in the article but who is also an author? For now, can only be one or the other, so make them an author.

NEXT