prelim_month_human-create_Reliability_Names-ground_truth_vs_human

old name 2017.10.20 - work log - prelim_month_human - create Reliability_Names

Create Reliability_Names data where coder 1 is ground truth, coder 2 is human coding without corrections for ground truth.

Setup

Setup - Imports


In [1]:
import datetime
import six

print( "packages imported at " + str( datetime.datetime.now() ) )


packages imported at 2017-10-25 23:50:10.961295

Setup - virtualenv jupyter kernel

If you are using a virtualenv, make sure that you:

  • have installed your virtualenv as a kernel.
  • choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py in this notebook, to configure the python environment manually as if you had activated the sourcenet virtualenv. To do this, you'd make a code cell that contains:

%run ../dev/wsgi.py

This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:

  • activate your virtualenv:

      workon sourcenet
  • in your virtualenv, install the package ipykernel.

      pip install ipykernel
  • use the ipykernel python program to install the current environment as a kernel:

      python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
    
    

    sourcenet example:

      python -m ipykernel install --user --name sourcenet --display-name "sourcenet (Python 3)"

More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html


In [2]:
%pwd


Out[2]:
'/home/jonathanmorgan/work/django/research/work/phd_work'

Setup - Initialize Django

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.


In [3]:
%run ../django_init.py


django initialized at 2017-10-26 03:50:14.391144

In [4]:
# django imports
from context_text.models import Article
from context_text.shared.context_text_base import ContextTextBase
from context_analysis.models import Reliability_Names
from context_analysis.reliability.coder_index_info import CoderIndexInfo
from context_analysis.reliability.index_info import IndexInfo
from context_analysis.reliability.index_helper import IndexHelper

Testing

I made big changes to support my human precision and recall. Time to test...


In [5]:
# first, test the CoderIndexInfo
test_info = CoderIndexInfo( 4, None, 1, 5 )

# try getting user instance.
test_user_instance = test_info.get_coder_user_instance()

print( "Coder user: " + str( test_user_instance ) )
print( "test_info = " + str( test_info ) )


Coder user: jonathanmorgan
test_info = user ID: 4, user instance: jonathanmorgan, index: 1, priority: 5

In [6]:
# create an index 1
test_index_info = IndexInfo()
test_index_info.set_index( 1 )

# configure as below:

# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_priority = 4
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )

# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )

# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )

# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )

print( "index info: " + str( test_index_info ) )

test_id_to_info_map = test_index_info.get_coder_id_to_info_map()
for coder_user_id, coder_info in six.iteritems( test_id_to_info_map ):
    
    print( "--> coder id: " + str( coder_user_id ) + " = " + str( coder_info ) )
    
#-- END loop over coders in index. --#


>>>> status = ""
>>>> status = ""
>>>> status = ""
>>>> status = ""
index info: index: 1
====>coder info: {8: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd17dd8>, 9: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd17ef0>, 10: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd17da0>, 13: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd17c50>}
--> coder id: 8 = user ID: 8, user instance: minnesota1, index: 1, priority: 3
--> coder id: 9 = user ID: 9, user instance: minnesota2, index: 1, priority: 2
--> coder id: 10 = user ID: 10, user instance: minnesota3, index: 1, priority: 1
--> coder id: 13 = user ID: 13, user instance: ground_truth, index: 1, priority: 4

In [7]:
# create index helper.
test_index_helper = IndexHelper()

# is valid index?
print( test_index_helper.is_index_valid( 0 ) )
print( test_index_helper.is_index_valid( 1 ) )
print( test_index_helper.is_index_valid( 5 ) )
print( test_index_helper.is_index_valid( 10 ) )
print( test_index_helper.is_index_valid( 15 ) )


False
True
True
True
False

In [8]:
# create index helper.
#test_index_helper = IndexHelper()

# ==> Index 1: human plus ground truth - set it up so that...

# configure as below:

# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_index = 1
current_priority = 4
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )

# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )

# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )

# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )

# ==> Index 2: human (not ground truth) - set it up so that...

# coder ID 8 is priority 3 for index 2...
current_coder_id = 8
current_index = 2
current_priority = 3
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )

# ...coder ID 9 is priority 2 for index 2...
current_coder_id = 9
current_index = 2
current_priority = 2
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )

# ...coder ID 10 is priority 1 for index 2...
current_coder_id = 10
current_index = 2
current_priority = 1
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )


print( "\n>>>> index helper: " + str( test_index_helper ) )
print( "\n>>>> index info map: " + str( test_index_helper.m_index_to_info_map ) )

test_index_to_info_map = test_index_helper.get_index_to_info_map()

print( "\n>>>> index info map: " + str( test_index_to_info_map ) )
print( "\n" )

for current_index, index_info in six.iteritems( test_index_to_info_map ):
    
    print( "--> we'll see...? : " + str( current_index ) + " = " + str( index_info ) )
    
#-- END loop over coders in index. --#


>>>> status = ""


>>>> status = ""


>>>> status = ""


>>>> status = ""


>>>> status = ""


>>>> status = ""


>>>> status = ""



>>>> index helper: index-info-map: {1: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686fd14a58>, 2: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686fd14e10>}

>>>> index info map: {1: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686fd14a58>, 2: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686fd14e10>}

>>>> index info map: {1: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686fd14a58>, 2: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686fd14e10>}


--> we;ll see...? : 1 = index: 1
====>coder info: {8: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd14f28>, 9: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd14b38>, 10: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd14cf8>, 13: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd14b00>}
--> we;ll see...? : 2 = index: 2
====>coder info: {8: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd14c88>, 9: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd14b70>, 10: <context_analysis.reliability.coder_index_info.CoderIndexInfo object at 0x7f686fd14cc0>}

In [9]:
index_1_coder = test_index_helper.get_coder_for_index( 1 )
index_2_coder = test_index_helper.get_coder_for_index( 2 )

print( "index 1 coder: {coder1}".format( coder1 = index_1_coder ) )
print( "index 2 coder: {coder2}".format( coder2 = index_2_coder ) )


++++ found User: ground_truth
++++ found User: minnesota1
index 1 coder: ground_truth
index 2 coder: minnesota1

In [10]:
#article_id = 20813
article_id = 20722
article_instance = Article.objects.get( id = article_id )
coder_map = test_index_helper.map_index_to_coder_for_article( article_instance )
print( "Coder map: {coder_map}".format( coder_map = str( coder_map ) ) )


Coder map: {1: <User: minnesota1>, 2: <User: minnesota1>}

In [11]:
# get all Reliability_Names with label = "prelim_month_human".
reliability_names_qs = Reliability_Names.objects.filter( label = "prelim_month_human" )
item_count = reliability_names_qs.count()
print( "prelim_month_human count = " + str( item_count ) )
do_delete = False
if ( ( item_count > 0 ) and ( do_delete == True ) ):
    
    for instance in reliability_names_qs:

        # delete.
        instance.delete()

    #-- END loop --#
    
#-- END check to see if anything to delete. --#


prelim_month_human count = 0

Reliability data creation - prelim_month_human

Plan:

  • create Reliability_Names data for ground_truth plus human coders in index 1, human coders without ground_truth in index 2.
  • remove single names.
  • calculate precision and recall on humans against ground truth.

Create Reliability_Names for prelim_month_human

First, check to see if the label "prelim_month_human" is in use:

SELECT DISTINCT label
FROM context_analysis_reliability_names
ORDER BY label ASC;

Results:

name_data_test_combined_human
prelim_month
prelim_month_exclude
prelim_network
prelim_network_combined
prelim_reliability
prelim_reliability_combined_all
prelim_reliability_combined_all_final
prelim_reliability_combined_human
prelim_reliability_combined_human_final
prelim_reliability_test
prelim_reliability_test_all
prelim_reliability_test_human
prelim_reliability_v2
prelim_training_002
prelim_training_003

Not in use.

Now, run code to actually build the Reliability_Names.


In [12]:
from __future__ import unicode_literals

# django imports
from django.contrib.auth.models import User

# sourcenet imports
from context_text.shared.context_text_base import ContextTextBase

# context_analysis imports
from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder

# declare variables
my_reliability_instance = None
tag_list = None
label = ""
do_work = True

# declare variables - user setup
current_coder = None
current_coder_id = -1
current_index = -1
current_priority = -1

# declare variables - Article_Data filtering.
coder_type = ""

# make reliability instance
my_reliability_instance = ReliabilityNamesBuilder()

#===============================================================================
# configure
#===============================================================================

# list of tags of articles we want to process.
tag_list = [ "grp_month", ]

# label to associate with results, for subsequent lookup.
label = "prelim_month_human"

# ! ====> map coders to indices

# ==> Index 1: set it up so that...

# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_index = 1
current_priority = 4
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ==> Index 2: human (not ground truth) - set it up so that...

# coder ID 8 is priority 3 for index 2...
current_coder_id = 8
current_index = 2
current_priority = 3
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 9 is priority 2 for index 2...
current_coder_id = 9
current_index = 2
current_priority = 2
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 10 is priority 1 for index 2...
current_coder_id = 10
current_index = 2
current_priority = 1
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# and only look at coding by those users.

# output debug JSON to file
#my_reliability_instance.debug_output_json_file_path = "/home/jonathanmorgan/" + label + ".json"


Out[12]:
''

In [13]:
print( "index_helper: {helper_instance}".format( helper_instance = str( my_reliability_instance.get_index_helper() ) ) )


index_helper: index-info-map: {1: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686f48d898>, 2: <context_analysis.reliability.index_info.IndexInfo object at 0x7f686f48def0>}

In [14]:
article_id = 20813
#article_id = 20722
article_instance = Article.objects.get( id = article_id )
coder_map = my_reliability_instance.map_index_to_coder_for_article( article_instance )
print( "Coder map: {coder_map}".format( coder_map = str( coder_map ) ) )


Coder map: {1: <User: ground_truth>, 2: <User: minnesota3>}

In [ ]:
#===============================================================================
# process articles
#===============================================================================

do_work = True
if ( do_work == True ):

    # process articles
    my_reliability_instance.process_articles( tag_list )

    # output to database.
    #my_reliability_instance.output_reliability_data( label )

    print( "reliability data created at " + str( datetime.datetime.now() ) )
    
#-- END check to see if we do work. --#

In [ ]:
#===============================================================================
# output data
#===============================================================================

do_work = True
if ( do_work == True ):

    # process articles
    #my_reliability_instance.process_articles( tag_list )

    # output to database.
    my_reliability_instance.output_reliability_data( label )

    print( "reliability data created at " + str( datetime.datetime.now() ) )
    
#-- END check to see if we do work. --#
SELECT COUNT( * )
FROM context_analysis_reliability_names
WHERE label = 'prelim_month';

-- 2446

SELECT COUNT( * )
FROM context_analysis_reliability_names
WHERE label = 'prelim_month_human';

-- 2429

Database backup - sourcenet-2017.10.20.pg.sql.gz

First, making backup of database.

  • File name: sourcenet-2017.10.20.pg.sql.gz
  • command (logged in as postgres user):

      pg_dump -O -c --if-exists -C sourcenet | gzip -c > sourcenet-2017.10.20.pg.sql.gz
  • All articles in tag "grp_month" are coded by OpenCalais.

  • Reliability data generated with label "prelim_month" and single name cleanup and disagreement evaluation completed.
  • Reliability data generated with label "prelim_month_human", no cleanup completed yet.

Data cleanup

Remove single-name reliability data

Next, remove all reliability data that refers to a single name using the "View reliability name information" screen:

To start, enter the following in fields there:

  • Label: - "prelim_month_human"
  • Coders to compare (1 through ==>): - 2
  • Reliability names filter type: - Select "Lookup"
  • [Lookup] - Person has first name, no other name parts. - CHECK the checkbox

You should see entries where a coder detected people who were mentioned only by their first name.

For each:

Single-name data assessment

Need to look at each instance where a person has a single name part.

Most are probably instances where the computer correctly detected the name part, but where you don't have enough name to match it to a person so the human coding protocol directed them to not capture the name fragment.

However, there might be some where a coder made a mistake and just captured a name part for a person whose full name was in the story. To check, click the "Article ID" in the column that has a link to article ID. It will take you to a view of the article where all the people who coded the article are included, with each detection of a mention or quotation displayed next to the paragraph where the person was originally first detected.

So for each instance of a single name part:

  • click on the article ID link in the row to go to the article and check to see if there is person whose name the fragment is a part of ( https://research.local/research/context/text/article/article_data/view_with_text/ ).

    • If there is a person with a full name to which the name fragment is a reference, check to see if the coder has data for the full person.

      • if not, merge:

        • go to the disagreement view page: https://research.local/research/context/analysis/reliability/names/disagreement/view
        • Configure:

          • Label: - "prelim_month"
          • Coders to compare (1 through ==>): - 2
          • Reliability names filter type: - Select "Lookup"
          • [Lookup] - Associated Article IDs (comma-delimited): - Enter the ID of the article the coding belonged to.
        • this will bring up all coding for the article whose ID you entered.

        • In the "select" column, click the checkbox in the row where there is a single name part that needs to be merged.
        • In the "merge INTO" column, click the checbox in the row with the full name for that person.
        • In "Reliability Names Action", choose "Merge Coding --> FROM 1 SELECTED / INTO 1"
        • Click "Do Action" button.
    • Remove the Reliability_Names row with the name fragment from reliability data.

      • In the "select" column, click the checkbox in the row where there is a single name part that needs to be removed.
      • In "Reliability Names Action", choose "Delete selected".
      • Click "Do Action" button.

Delete selected single-name data

To see the 8 single-name instances that were addressed: https://research.local/research/admin/context_analysis/reliability_names_evaluation/?label=prelim_month_human

Calculate precision and recall

Now, we use the code we created for assessing OpenCalais to calculate precision and recall of human coders compared to ground truth (corrected human coders).

TODO

TODO:

  • write unit tests for IndexHelper, IndexInfo, and CoderIndexInfo.
  • update reliability_names_builder to use IndexHelper.