prelim_month_human-create_Reliability_Names-ground_truth_vs_human
old name 2017.10.20 - work log - prelim_month_human - create Reliability_Names
Create Reliability_Names data where coder 1 is ground truth, coder 2 is human coding without corrections for ground truth.
In [1]:
import datetime
import six
print( "packages imported at " + str( datetime.datetime.now() ) )
If you are using a virtualenv, make sure that you:
Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py
in this notebook, to configure the python environment manually as if you had activated the sourcenet
virtualenv. To do this, you'd make a code cell that contains:
%run ../dev/wsgi.py
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:
activate your virtualenv:
workon sourcenet
in your virtualenv, install the package ipykernel
.
pip install ipykernel
use the ipykernel python program to install the current environment as a kernel:
python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
sourcenet
example:
python -m ipykernel install --user --name sourcenet --display-name "sourcenet (Python 3)"
More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html
In [2]:
%pwd
Out[2]:
First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.
In [3]:
%run ../django_init.py
In [4]:
# django imports
from context_text.models import Article
from context_text.shared.context_text_base import ContextTextBase
from context_analysis.models import Reliability_Names
from context_analysis.reliability.coder_index_info import CoderIndexInfo
from context_analysis.reliability.index_info import IndexInfo
from context_analysis.reliability.index_helper import IndexHelper
I made big changes to support my human precision and recall. Time to test...
In [5]:
# first, test the CoderIndexInfo
test_info = CoderIndexInfo( 4, None, 1, 5 )
# try getting user instance.
test_user_instance = test_info.get_coder_user_instance()
print( "Coder user: " + str( test_user_instance ) )
print( "test_info = " + str( test_info ) )
In [6]:
# create an index 1
test_index_info = IndexInfo()
test_index_info.set_index( 1 )
# configure as below:
# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_priority = 4
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )
# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )
# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )
# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
add_status = test_index_info.add_coder( current_coder_id, priority_IN = current_priority )
print( ">>>> status = \"{status}\"".format( status = add_status ) )
print( "index info: " + str( test_index_info ) )
test_id_to_info_map = test_index_info.get_coder_id_to_info_map()
for coder_user_id, coder_info in six.iteritems( test_id_to_info_map ):
print( "--> coder id: " + str( coder_user_id ) + " = " + str( coder_info ) )
#-- END loop over coders in index. --#
In [7]:
# create index helper.
test_index_helper = IndexHelper()
# is valid index?
print( test_index_helper.is_index_valid( 0 ) )
print( test_index_helper.is_index_valid( 1 ) )
print( test_index_helper.is_index_valid( 5 ) )
print( test_index_helper.is_index_valid( 10 ) )
print( test_index_helper.is_index_valid( 15 ) )
In [8]:
# create index helper.
#test_index_helper = IndexHelper()
# ==> Index 1: human plus ground truth - set it up so that...
# configure as below:
# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_index = 1
current_priority = 4
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )
# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )
# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )
# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )
# ==> Index 2: human (not ground truth) - set it up so that...
# coder ID 8 is priority 3 for index 2...
current_coder_id = 8
current_index = 2
current_priority = 3
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )
# ...coder ID 9 is priority 2 for index 2...
current_coder_id = 9
current_index = 2
current_priority = 2
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )
# ...coder ID 10 is priority 1 for index 2...
current_coder_id = 10
current_index = 2
current_priority = 1
add_status = test_index_helper.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
print( ">>>> status = \"{status}\"\n\n".format( status = add_status ) )
print( "\n>>>> index helper: " + str( test_index_helper ) )
print( "\n>>>> index info map: " + str( test_index_helper.m_index_to_info_map ) )
test_index_to_info_map = test_index_helper.get_index_to_info_map()
print( "\n>>>> index info map: " + str( test_index_to_info_map ) )
print( "\n" )
for current_index, index_info in six.iteritems( test_index_to_info_map ):
print( "--> we'll see...? : " + str( current_index ) + " = " + str( index_info ) )
#-- END loop over coders in index. --#
In [9]:
index_1_coder = test_index_helper.get_coder_for_index( 1 )
index_2_coder = test_index_helper.get_coder_for_index( 2 )
print( "index 1 coder: {coder1}".format( coder1 = index_1_coder ) )
print( "index 2 coder: {coder2}".format( coder2 = index_2_coder ) )
In [10]:
#article_id = 20813
article_id = 20722
article_instance = Article.objects.get( id = article_id )
coder_map = test_index_helper.map_index_to_coder_for_article( article_instance )
print( "Coder map: {coder_map}".format( coder_map = str( coder_map ) ) )
In [11]:
# get all Reliability_Names with label = "prelim_month_human".
reliability_names_qs = Reliability_Names.objects.filter( label = "prelim_month_human" )
item_count = reliability_names_qs.count()
print( "prelim_month_human count = " + str( item_count ) )
do_delete = False
if ( ( item_count > 0 ) and ( do_delete == True ) ):
for instance in reliability_names_qs:
# delete.
instance.delete()
#-- END loop --#
#-- END check to see if anything to delete. --#
prelim_month_human
Plan:
Reliability_Names
for prelim_month_human
First, check to see if the label "prelim_month_human
" is in use:
SELECT DISTINCT label
FROM context_analysis_reliability_names
ORDER BY label ASC;
Results:
name_data_test_combined_human
prelim_month
prelim_month_exclude
prelim_network
prelim_network_combined
prelim_reliability
prelim_reliability_combined_all
prelim_reliability_combined_all_final
prelim_reliability_combined_human
prelim_reliability_combined_human_final
prelim_reliability_test
prelim_reliability_test_all
prelim_reliability_test_human
prelim_reliability_v2
prelim_training_002
prelim_training_003
Not in use.
Now, run code to actually build the Reliability_Names.
In [12]:
from __future__ import unicode_literals
# django imports
from django.contrib.auth.models import User
# sourcenet imports
from context_text.shared.context_text_base import ContextTextBase
# context_analysis imports
from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder
# declare variables
my_reliability_instance = None
tag_list = None
label = ""
do_work = True
# declare variables - user setup
current_coder = None
current_coder_id = -1
current_index = -1
current_priority = -1
# declare variables - Article_Data filtering.
coder_type = ""
# make reliability instance
my_reliability_instance = ReliabilityNamesBuilder()
#===============================================================================
# configure
#===============================================================================
# list of tags of articles we want to process.
tag_list = [ "grp_month", ]
# label to associate with results, for subsequent lookup.
label = "prelim_month_human"
# ! ====> map coders to indices
# ==> Index 1: set it up so that...
# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_index = 1
current_priority = 4
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ==> Index 2: human (not ground truth) - set it up so that...
# coder ID 8 is priority 3 for index 2...
current_coder_id = 8
current_index = 2
current_priority = 3
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 9 is priority 2 for index 2...
current_coder_id = 9
current_index = 2
current_priority = 2
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 10 is priority 1 for index 2...
current_coder_id = 10
current_index = 2
current_priority = 1
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# and only look at coding by those users.
# output debug JSON to file
#my_reliability_instance.debug_output_json_file_path = "/home/jonathanmorgan/" + label + ".json"
Out[12]:
In [13]:
print( "index_helper: {helper_instance}".format( helper_instance = str( my_reliability_instance.get_index_helper() ) ) )
In [14]:
article_id = 20813
#article_id = 20722
article_instance = Article.objects.get( id = article_id )
coder_map = my_reliability_instance.map_index_to_coder_for_article( article_instance )
print( "Coder map: {coder_map}".format( coder_map = str( coder_map ) ) )
In [ ]:
#===============================================================================
# process articles
#===============================================================================
do_work = True
if ( do_work == True ):
# process articles
my_reliability_instance.process_articles( tag_list )
# output to database.
#my_reliability_instance.output_reliability_data( label )
print( "reliability data created at " + str( datetime.datetime.now() ) )
#-- END check to see if we do work. --#
In [ ]:
#===============================================================================
# output data
#===============================================================================
do_work = True
if ( do_work == True ):
# process articles
#my_reliability_instance.process_articles( tag_list )
# output to database.
my_reliability_instance.output_reliability_data( label )
print( "reliability data created at " + str( datetime.datetime.now() ) )
#-- END check to see if we do work. --#
SELECT COUNT( * )
FROM context_analysis_reliability_names
WHERE label = 'prelim_month';
-- 2446
SELECT COUNT( * )
FROM context_analysis_reliability_names
WHERE label = 'prelim_month_human';
-- 2429
sourcenet-2017.10.20.pg.sql.gz
First, making backup of database.
sourcenet-2017.10.20.pg.sql.gz
command (logged in as postgres user):
pg_dump -O -c --if-exists -C sourcenet | gzip -c > sourcenet-2017.10.20.pg.sql.gz
All articles in tag "grp_month" are coded by OpenCalais.
Next, remove all reliability data that refers to a single name using the "View reliability name information" screen:
To start, enter the following in fields there:
You should see entries where a coder detected people who were mentioned only by their first name.
For each:
Need to look at each instance where a person has a single name part.
Most are probably instances where the computer correctly detected the name part, but where you don't have enough name to match it to a person so the human coding protocol directed them to not capture the name fragment.
However, there might be some where a coder made a mistake and just captured a name part for a person whose full name was in the story. To check, click the "Article ID" in the column that has a link to article ID. It will take you to a view of the article where all the people who coded the article are included, with each detection of a mention or quotation displayed next to the paragraph where the person was originally first detected.
So for each instance of a single name part:
click on the article ID link in the row to go to the article and check to see if there is person whose name the fragment is a part of ( https://research.local/research/context/text/article/article_data/view_with_text/ ).
If there is a person with a full name to which the name fragment is a reference, check to see if the coder has data for the full person.
if not, merge:
Configure:
this will bring up all coding for the article whose ID you entered.
Remove the Reliability_Names
row with the name fragment from reliability data.
To see the 8 single-name instances that were addressed: https://research.local/research/admin/context_analysis/reliability_names_evaluation/?label=prelim_month_human
Now, we use the code we created for assessing OpenCalais to calculate precision and recall of human coders compared to ground truth (corrected human coders).
TODO: