2018.02.09 - prelim - disagreement analysis
In [1]:
import datetime
import six
print( "packages imported at " + str( datetime.datetime.now() ) )
If you are using a virtualenv, make sure that you:
Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py
in this notebook, to configure the python environment manually as if you had activated the sourcenet
virtualenv. To do this, you'd make a code cell that contains:
%run ../dev/wsgi.py
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:
activate your virtualenv:
workon sourcenet
in your virtualenv, install the package ipykernel
.
pip install ipykernel
use the ipykernel python program to install the current environment as a kernel:
python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
sourcenet
example:
python -m ipykernel install --user --name sourcenet --display-name "sourcenet (Python 3)"
More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html
In [2]:
%pwd
Out[2]:
First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.
In [4]:
%run ../django_init.py
In [5]:
# django imports
from context_analysis.models import Reliability_Names_Evaluation
In [6]:
# CONSTANTS-ish - names of boolean model fields.
FIELD_NAME_IS_AMBIGUOUS = "is_ambiguous"
FIELD_NAME_IS_ATTRIBUTION_COMPOUND = "is_attribution_compound"
FIELD_NAME_IS_ATTRIBUTION_FOLLOW_ON = "is_attribution_follow_on"
FIELD_NAME_IS_ATTRIBUTION_PRONOUN = "is_attribution_pronoun"
FIELD_NAME_IS_ATTRIBUTION_SECOND_HAND = "is_attribution_second_hand"
FIELD_NAME_IS_COMPLEX = "is_complex"
FIELD_NAME_IS_COMPOUND_NAMES = "is_compound_names"
FIELD_NAME_IS_CONTRIBUTED_TO = "is_contributed_to"
FIELD_NAME_IS_DICTIONARY_ERROR = "is_dictionary_error"
FIELD_NAME_IS_DISAMBIGUATION = "is_disambiguation"
FIELD_NAME_IS_EDITING_ERROR = "is_editing_error"
FIELD_NAME_IS_ERROR = "is_error"
FIELD_NAME_IS_FOREIGN_NAMES = "is_foreign_names"
FIELD_NAME_IS_GENDER_CONFUSION = "is_gender_confusion"
FIELD_NAME_IS_INITIALS_ERROR = "is_initials_error"
FIELD_NAME_IS_INTERESTING = "is_interesting"
FIELD_NAME_IS_LAYOUT_OR_DESIGN = "is_layout_or_design"
FIELD_NAME_IS_LIST = "is_list"
FIELD_NAME_IS_LOOKUP_ERROR = "is_lookup_error"
FIELD_NAME_IS_NO_HTML = "is_no_html"
FIELD_NAME_IS_NOT_HARD_NEWS = "is_not_hard_news"
FIELD_NAME_IS_POSSESSIVE = "is_possessive"
FIELD_NAME_IS_PRONOUNS = "is_pronouns"
FIELD_NAME_IS_PROPER_NOUN = "is_proper_noun"
FIELD_NAME_IS_QUOTE_DISTANCE = "is_quote_distance"
FIELD_NAME_IS_SAID_VERB = "is_said_verb"
FIELD_NAME_IS_SHORT_N_GRAM = "is_short_n_gram"
FIELD_NAME_IS_SOFTWARE_ERROR = "is_software_error"
FIELD_NAME_IS_SPANISH = "is_spanish"
FIELD_NAME_IS_SPORTS = "is_sports"
FIELD_NAME_IS_STRAIGHTFORWARD = "is_straightforward"
FIELD_NAME_IS_TITLE = "is_title"
FIELD_NAME_IS_TITLE_COMPLEX = "is_title_complex"
FIELD_NAME_IS_TITLE_PREFIX = "is_title_prefix"
In [7]:
# CONSTANTS-ish - other related field names.
FIELD_NAME_IS_SUBJECT_SHB_AUTHOR = "is_subject_shb_author"
FIELD_NAME_IS_NOT_A_PERSON = "is_not_a_person"
In [8]:
# CONSTANTS-ish - names of properties per field.
PROP_NAME = "name"
PROP_TAG_LIST = "tag_list"
PROP_IS_ERROR = "is_error"
PROP_ASSOCIATED_FIELDS = "associated_fields"
# CONSTANTS-ish - map of field names to field traits.
FIELD_NAME_TO_TRAITS_MAP = {}
# CONSTANTS-ish - map tag values to field names.
TAG_TO_FIELD_NAME_MAP = {}
In [9]:
# set up mapping of field names to traits.
temp_traits_map = {}
# FIELD_NAME_IS_AMBIGUOUS = "is_ambiguous"
temp_traits_map = {}
field_name = FIELD_NAME_IS_AMBIGUOUS
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'ambiguous', 'ambiguity' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_ATTRIBUTION_COMPOUND = "is_attribution_compound"
temp_traits_map = {}
field_name = FIELD_NAME_IS_ATTRIBUTION_COMPOUND
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'compound_attribution' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_ATTRIBUTION_FOLLOW_ON = "is_attribution_follow_on"
temp_traits_map = {}
field_name = FIELD_NAME_IS_ATTRIBUTION_FOLLOW_ON
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'follow_on_attribution' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_ATTRIBUTION_PRONOUN = "is_attribution_pronoun"
temp_traits_map = {}
field_name = FIELD_NAME_IS_ATTRIBUTION_PRONOUN
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'pronoun_attribution' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_ATTRIBUTION_SECOND_HAND = "is_attribution_second_hand"
temp_traits_map = {}
field_name = FIELD_NAME_IS_ATTRIBUTION_SECOND_HAND
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'second_hand' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_COMPLEX = "is_complex"
temp_traits_map = {}
field_name = FIELD_NAME_IS_COMPLEX
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'complex' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_COMPOUND_NAMES = "is_compound_names"
temp_traits_map = {}
field_name = FIELD_NAME_IS_COMPOUND_NAMES
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'compound_names' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_CONTRIBUTED_TO = "is_contributed_to"
temp_traits_map = {}
field_name = FIELD_NAME_IS_CONTRIBUTED_TO
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'contributed_to' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR, FIELD_NAME_IS_SUBJECT_SHB_AUTHOR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_DICTIONARY_ERROR = "is_dictionary_error"
temp_traits_map = {}
field_name = FIELD_NAME_IS_DICTIONARY_ERROR
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'dictionary_error' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_DISAMBIGUATION = "is_disambiguation"
temp_traits_map = {}
field_name = FIELD_NAME_IS_DISAMBIGUATION
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'disambiguation' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_EDITING_ERROR = "is_editing_error"
temp_traits_map = {}
field_name = FIELD_NAME_IS_EDITING_ERROR
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'editing_error' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_ERROR = "is_error"
temp_traits_map = {}
field_name = FIELD_NAME_IS_ERROR
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'error' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_FOREIGN_NAMES = "is_foreign_names"
temp_traits_map = {}
field_name = FIELD_NAME_IS_FOREIGN_NAMES
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'foreign_names' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_GENDER_CONFUSION = "is_gender_confusion"
temp_traits_map = {}
field_name = FIELD_NAME_IS_GENDER_CONFUSION
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'gender_confusion' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_INITIALS_ERROR = "is_initials_error"
temp_traits_map = {}
field_name = FIELD_NAME_IS_INITIALS_ERROR
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'initials' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_INTERESTING = "is_interesting"
temp_traits_map = {}
field_name = FIELD_NAME_IS_INTERESTING
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'interesting' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_LAYOUT_OR_DESIGN = "is_layout_or_design"
temp_traits_map = {}
field_name = FIELD_NAME_IS_LAYOUT_OR_DESIGN
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'layout_or_design' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_LIST = "is_list"
temp_traits_map = {}
field_name = FIELD_NAME_IS_LIST
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'list', 'lists' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_LOOKUP_ERROR = "is_lookup_error"
temp_traits_map = {}
field_name = FIELD_NAME_IS_LOOKUP_ERROR
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'lookup' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_NO_HTML = "is_no_html"
temp_traits_map = {}
field_name = FIELD_NAME_IS_NO_HTML
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'no_html' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_NOT_HARD_NEWS = "is_not_hard_news"
temp_traits_map = {}
field_name = FIELD_NAME_IS_NOT_HARD_NEWS
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'non_news' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_POSSESSIVE = "is_possessive"
temp_traits_map = {}
field_name = FIELD_NAME_IS_POSSESSIVE
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'possessive' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_PRONOUNS = "is_pronouns"
temp_traits_map = {}
field_name = FIELD_NAME_IS_PRONOUNS
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'pronouns' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_PROPER_NOUN = "is_proper_noun"
temp_traits_map = {}
field_name = FIELD_NAME_IS_PROPER_NOUN
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'proper_noun', 'proper_nouns' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR, FIELD_NAME_IS_NOT_A_PERSON ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_QUOTE_DISTANCE = "is_quote_distance"
temp_traits_map = {}
field_name = FIELD_NAME_IS_QUOTE_DISTANCE
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'quote_distance' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_SAID_VERB = "is_said_verb"
temp_traits_map = {}
field_name = FIELD_NAME_IS_SAID_VERB
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'said_verb' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_SHORT_N_GRAM = "is_short_n_gram"
temp_traits_map = {}
field_name = FIELD_NAME_IS_SHORT_N_GRAM
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'short_n-gram' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_SOFTWARE_ERROR = "is_software_error"
temp_traits_map = {}
field_name = FIELD_NAME_IS_SOFTWARE_ERROR
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'software_error' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_SPANISH = "is_spanish"
temp_traits_map = {}
field_name = FIELD_NAME_IS_SPANISH
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'spanish' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_SPORTS = "is_sports"
temp_traits_map = {}
field_name = FIELD_NAME_IS_SPORTS
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'sports' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_STRAIGHTFORWARD = "is_straightforward"
temp_traits_map = {}
field_name = FIELD_NAME_IS_STRAIGHTFORWARD
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'straightforward' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = []
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_TITLE = "is_title"
temp_traits_map = {}
field_name = FIELD_NAME_IS_TITLE
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'complex_title', 'complex_titles', 'title_prefix' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_TITLE_COMPLEX = "is_title_complex"
temp_traits_map = {}
field_name = FIELD_NAME_IS_TITLE_COMPLEX
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'complex_title', 'complex_titles' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
# FIELD_NAME_IS_TITLE_PREFIX = "is_title_prefix"
temp_traits_map = {}
field_name = FIELD_NAME_IS_TITLE_PREFIX
temp_traits_map[ PROP_NAME ] = field_name
temp_traits_map[ PROP_TAG_LIST ] = [ 'title_prefix' ]
temp_traits_map[ PROP_ASSOCIATED_FIELDS ] = [ FIELD_NAME_IS_ERROR ]
FIELD_NAME_TO_TRAITS_MAP[ field_name ] = temp_traits_map
In [10]:
# set up mapping of tag values to field names in TAG_TO_FIELD_NAME_MAP.
# declare variables
current_field_name = None
current_traits = None
tag_list = None
current_tag = None
# loop over things in FIELD_NAME_TO_TRAITS_MAP.
for current_field_name in six.iterkeys( FIELD_NAME_TO_TRAITS_MAP ):
# get traits dictionary for field name.
current_traits = FIELD_NAME_TO_TRAITS_MAP.get( current_field_name, None )
# retrieve tag list for field.
tag_list = current_traits.get( PROP_TAG_LIST )
for current_tag in tag_list:
# get list of fields for tag
tag_field_list = TAG_TO_FIELD_NAME_MAP.get( current_tag, [] )
# append current field.
tag_field_list.append( current_field_name )
# put list back.
TAG_TO_FIELD_NAME_MAP[ current_tag ] = tag_field_list
#-- END loop over tags for a given field --#
#-- END loop over field names. --#
print( "Map of tags to field names in TAG_TO_FIELD_NAME_MAP: {}".format( str( TAG_TO_FIELD_NAME_MAP ) ) )
Look at stats for disagreements and evaluation, including human and computer errors.
Process: Look at each instance where there is a disagreement and make sure the human coding is correct.
Most are probably instances where the computer screwed up, but since we are calling this human coding "ground truth", want to winnow out as much human error as possible.
For each disagreement, to check for coder error (like just capturing a name part for a person whose full name was in the story), click the "Article ID" in the column that has a link to article ID. It will take you to a view of the article where all the people who coded the article are included, with each detection of a mention or quotation displayed next to the paragraph where the person was originally first detected.
If not human error, remove TODO tag, filling in details on the diagreement in the record in Reliability_Names_Evaluation
for the removal of the tag (details: Disagreement tracking process).
If human error:
Pull together some numbers and analysis from disagreement work:
From 2017.07.01-work_log-prelim_month-evaluate_disagreements.ipynb - Disagreement resolution:
For each disagreement, click on the article ID link in the row to go to the article and check to see if the human coding for the disagreement in question is correct ( http://research.local/research/context/text/article/article_data/view_with_text/ ).
Once you've evaluated and verified the human coding, remove the "TODO
" tag from the current record (either from the single-article view above if you've removed all disagreements, or from the disagreement view if not):
TODO
" (without the quotes).This will also place information on the Reliability_Names
record into a Reliability_Names_Evaluation
record in the database. The message that results from this action completing will include a link to the record (the first number in the output). Click the link to open the record and update it with additional details. Specifically:
status - status of human coder's coding:
status
" to "ERROR".if problems caused by automated coder error, click the "is_automated_error
" checkbox.
status_message
" so it contains a brief description of what exactly happened (should have been mentioned, should have been quoted, missed the person entirely, etc.).update "Notes
" with more details.
add "Tags
" if appropriate (for sports articles, for example, add "sports" tag).
NOTE: Always remove TODO tag first, so you have a record of status for each Reliability Names record. Then, once you've done that, you can merge, delete, etc.
Reliability_Names_Evaluation
Reliability_Names_Evaluation
table in django: http://research.local/research/admin/context_analysis/reliability_names_evaluation/?label=prelim_month&o=-1.7.8.3.5of those, 13 are same person and article, but different Reliability_Names
record, so disagreements that had to be corrected twice because of rebuilding Reliability_Names
for the article (either human error, or something weird). SQL:
SELECT sarne1.person_name,
sarne1.id,
sarne1.status,
sarne1.original_reliability_names_id,
sarne1.is_duplicate,
sarne2.is_duplicate,
sarne2.id,
sarne2.status,
sarne2.original_reliability_names_id
FROM context_analysis_reliability_names_evaluation AS sarne1,
context_analysis_reliability_names_evaluation AS sarne2
WHERE sarne1.id != sarne2.id
AND sarne1.label = 'prelim_month'
AND sarne2.label = 'prelim_month'
AND sarne1.event_type = 'remove_tags'
AND sarne2.event_type = 'remove_tags'
AND sarne1.article_id = sarne2.article_id
AND sarne1.person_name = sarne2.person_name
AND sarne1.original_reliability_names_id != sarne2.original_reliability_names_id
AND sarne2.original_reliability_names_id > sarne1.original_reliability_names_id
ORDER BY sarne1.id ASC;
So, 428 - 13 = 415 unique disagreements.
Could regenerate Reliability_Names without ground_truth
to look at original counts? Should be able to... Just need to make sure I remember all steps...
TODO:
duplicates
Step 1: get IDs of records with duplicates
SELECT DISTINCT ( sarne1.id )
FROM context_analysis_reliability_names_evaluation AS sarne1,
context_analysis_reliability_names_evaluation AS sarne2
WHERE sarne1.id != sarne2.id
AND sarne1.label = 'prelim_month'
AND sarne2.label = 'prelim_month'
AND sarne1.event_type = 'remove_tags'
AND sarne2.event_type = 'remove_tags'
AND sarne1.article_id = sarne2.article_id
AND sarne1.person_name = sarne2.person_name
AND sarne1.original_reliability_names_id != sarne2.original_reliability_names_id
AND sarne2.original_reliability_names_id > sarne1.original_reliability_names_id
ORDER BY sarne1.id ASC;
and
SELECT DISTINCT ( sarne2.id )
FROM context_analysis_reliability_names_evaluation AS sarne1,
context_analysis_reliability_names_evaluation AS sarne2
WHERE sarne1.id != sarne2.id
AND sarne1.label = 'prelim_month'
AND sarne2.label = 'prelim_month'
AND sarne1.event_type = 'remove_tags'
AND sarne2.event_type = 'remove_tags'
AND sarne1.article_id = sarne2.article_id
AND sarne1.person_name = sarne2.person_name
AND sarne1.original_reliability_names_id != sarne2.original_reliability_names_id
AND sarne2.original_reliability_names_id > sarne1.original_reliability_names_id
ORDER BY sarne2.id ASC;
In [ ]:
# Got IDs that contain duplicates, now tag them as todo
ids_to_process = []
ids_to_process.append( 15 )
ids_to_process.append( 33 )
ids_to_process.append( 75 )
ids_to_process.append( 405 )
ids_to_process.append( 435 )
ids_to_process.append( 512 )
ids_to_process.append( 556 )
ids_to_process.append( 586 )
ids_to_process.append( 610 )
ids_to_process.append( 620 )
ids_to_process.append( 635 )
ids_to_process.append( 646 )
ids_to_process.append( 16 )
ids_to_process.append( 34 )
ids_to_process.append( 76 )
ids_to_process.append( 407 )
ids_to_process.append( 432 )
ids_to_process.append( 513 )
ids_to_process.append( 517 )
ids_to_process.append( 558 )
ids_to_process.append( 596 )
ids_to_process.append( 611 )
ids_to_process.append( 619 )
ids_to_process.append( 637 )
ids_to_process.append( 651 )
do_save = False
# retrieve model instances.
# get all evaluation records with label = "prelim_month" and IDs in our list.
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( pk__in = ids_to_process )
# count?
eval_count = evaluation_qs.count()
print( "record count: {}".format( str( eval_count ) ) )
# loop, setting "is_to_do" to True on each and saving.
for current_eval in evaluation_qs:
# set is_to_do to True and set work_status to "metadata_review".
current_eval.is_to_do = True
current_eval.work_status = "duplicate_processing"
# save?
if ( do_save == True ):
# save! save! save!
current_eval.save()
#-- END check to see if we save(). --#
#-- END loop over QuerySet. --#
Turns out, many of these were a missed person by human coder, that, once fixed, revealed a problem with the automated coding. So, many were actually not duplicates, they were two separate issues with the same person.
TODO:
In [ ]:
'''
django-taggit documentation: https://github.com/alex/django-taggit
Adding tags to a model:
from django.db import models
from taggit.managers import TaggableManager
class Food(models.Model):
# ... fields here
tags = TaggableManager()
Interacting with a model that has tags:
>>> apple = Food.objects.create(name="apple")
>>> apple.tags.add("red", "green", "delicious")
>>> apple.tags.all()
[<Tag: red>, <Tag: green>, <Tag: delicious>]
>>> apple.tags.remove("green")
>>> apple.tags.all()
[<Tag: red>, <Tag: delicious>]
>>> Food.objects.filter(tags__name__in=["red"])
[<Food: apple>, <Food: cherry>]
# include only those with certain tags.
#tags_in_list = [ "prelim_unit_test_001", "prelim_unit_test_002", "prelim_unit_test_003", "prelim_unit_test_004", "prelim_unit_test_005", "prelim_unit_test_006", "prelim_unit_test_007" ]
tags_in_list = [ "grp_month", ]
if ( len( tags_in_list ) > 0 ):
# filter
print( "filtering to just articles with tags: " + str( tags_in_list ) )
grp_article_qs = grp_article_qs.filter( tags__name__in = tags_in_list )
#-- END check to see if we have a specific list of tags we want to include --#
'''
# imports
from context_analysis.models import Reliability_Names_Evaluation
# declare variables
evaluation_qs = None
record_count = -1
record_counter = -1
current_record = None
tag_to_count_map = {}
tag_qs = None
tag_list = None
current_tag = ""
cleaned_tag = ""
current_count = -1
tag_count = -1
no_tags_list = []
# get all evaluation records with label = "prelim_month" and event_type = "remove_tags".
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# first, just make sure that worked.
record_count = evaluation_qs.count()
# Check count of articles retrieved.
print( "Got " + str( record_count ) + " evaluations records." )
# loop over evaluations.
no_tags_count = 0
for current_record in evaluation_qs:
# get tags.
# current_article.tags.add( tag_value )
tag_qs = current_record.tags.all()
# output the tags.
#print( "- Tags for record " + str( current_record.id ) + " : " + str( tag_qs ) )
# count tags
tag_count = tag_qs.count()
# got tags?
if ( tag_count > 0 ):
# loop over tags.
for current_tag in tag_qs:
# standardize
cleaned_tag = str( current_tag )
# to lower case
cleaned_tag = cleaned_tag.lower()
# strip()
cleaned_tag = cleaned_tag.strip()
# in map? Get current count.
current_count = 0
if ( cleaned_tag in tag_to_count_map ):
# It is in map - get counter for it.
current_count = tag_to_count_map.get( cleaned_tag, None )
#-- END check to see if tag in map --#
# increment count and store.
current_count += 1
tag_to_count_map[ cleaned_tag ] = current_count
#-- END loop over tags --#
else:
# increment no_tag_counter bby 1.
no_tags_list.append( current_record )
#-- END check to see if tags or not --#
#-- END loop over records --#
# output number of tagless evaluations
no_tags_count = len( no_tags_list )
print( "--> Count of articles with no tags: {}".format( str( no_tags_count ) ) )
# output tags
key_view = six.viewkeys( tag_to_count_map )
tag_list = list( key_view )
tag_list.sort()
for tag_string in tag_list:
# print each tag and its count.
current_count = tag_to_count_map.get( tag_string, -1 )
print( "- {} ( {} )".format( str( tag_string ), str( current_count ) ) )
#-- END print tags. --#
Output:
--> Count of articles with no tags: 188
Got 428 evaluations records.
tags to create:
Remember:
Notes:
admin:
Need to automatically set the flags based on the tag values.
Set up meta-data on fields, tags, and how they relate.
Now, we set booleans based on tags - first, see if tag is mapped to a field, then, if so, look up field traits to figure out what booleans to set.
In [28]:
'''
django-taggit documentation: https://github.com/alex/django-taggit
Adding tags to a model:
from django.db import models
from taggit.managers import TaggableManager
class Food(models.Model):
# ... fields here
tags = TaggableManager()
Interacting with a model that has tags:
>>> apple = Food.objects.create(name="apple")
>>> apple.tags.add("red", "green", "delicious")
>>> apple.tags.all()
[<Tag: red>, <Tag: green>, <Tag: delicious>]
>>> apple.tags.remove("green")
>>> apple.tags.all()
[<Tag: red>, <Tag: delicious>]
>>> Food.objects.filter(tags__name__in=["red"])
[<Food: apple>, <Food: cherry>]
# include only those with certain tags.
#tags_in_list = [ "prelim_unit_test_001", "prelim_unit_test_002", "prelim_unit_test_003", "prelim_unit_test_004", "prelim_unit_test_005", "prelim_unit_test_006", "prelim_unit_test_007" ]
tags_in_list = [ "grp_month", ]
if ( len( tags_in_list ) > 0 ):
# filter
print( "filtering to just articles with tags: " + str( tags_in_list ) )
grp_article_qs = grp_article_qs.filter( tags__name__in = tags_in_list )
#-- END check to see if we have a specific list of tags we want to include --#
'''
# imports
from context_analysis.models import Reliability_Names_Evaluation
# declare variables
evaluation_qs = None
record_count = -1
record_counter = -1
current_record = None
tag_set = set()
tag_qs = None
tag_list = None
current_tag = ""
cleaned_tag = ""
field_name_list = None
current_field_name = None
current_traits = None
related_field_name_list = None
related_field_name = None
do_save = True
# get all evaluation records with label = "prelim_month" and event_type = "remove_tags".
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# first, just make sure that worked.
record_count = evaluation_qs.count()
# Check count of articles retrieved.
print( "Got " + str( record_count ) + " evaluations records." )
# loop over evaluations.
for current_record in evaluation_qs:
# get tags.
# current_article.tags.add( tag_value )
tag_qs = current_record.tags.all()
# output the tags.
#print( "- Tags for record " + str( current_record.id ) + " : " + str( tag_qs ) )
# loop over tags.
for current_tag in tag_qs:
# standardize
cleaned_tag = str( current_tag )
# to lower case
cleaned_tag = cleaned_tag.lower()
# strip()
cleaned_tag = cleaned_tag.strip()
# First, try to retrieve field name for current tag.
field_name_list = TAG_TO_FIELD_NAME_MAP.get( cleaned_tag, None )
# got a field name?
if ( field_name_list is not None ):
# loop over items in list.
for current_field_name in field_name_list:
# set field to True.
setattr( current_record, current_field_name, True )
# retrieve field's traits.
current_traits = FIELD_NAME_TO_TRAITS_MAP.get( current_field_name, None )
# get list of related fields
related_field_name_list = current_traits.get( PROP_ASSOCIATED_FIELDS, None )
# got anything?
if ( ( related_field_name_list is not None ) and ( len( related_field_name_list ) > 0 ) ):
# yes - set related fields to True, also.
for related_field_name in related_field_name_list:
# set field to True.
setattr( current_record, related_field_name, True )
#-- END loop over related field names. --#
#-- END check to see if any related fields. --#
#-- END loop over related fields. --#
else:
# Unknown tag!
print( "!!!! Unknown tag: {}".format( cleaned_tag ) )
#-- END check to see what tag we are processing. --#
#-- END loop over tags --#
# are we saving the results of this grand endeavour?
if ( do_save == True ):
# do save.
current_record.save()
#-- END check to see if saving. --#
#-- END loop over records --#
# output
print( "Completed at {}".format( str( datetime.datetime.now() ) ) )
In [29]:
# Generate counts for each field.
# declare variables
field_names_view = None
field_names_list = None
current_field_name = None
my_kwargs = None
kwarg_name = None
kwarg_value = None
evaluation_qs = None
filtered_qs = None
filtered_count = -1
# first, get all evaluation instances with label = "prelim_month" and event type "remove_tags".
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# get view of keys.
field_names_view = six.viewkeys( FIELD_NAME_TO_TRAITS_MAP )
# convert to sorted list
field_names_list = list( field_names_view )
field_names_list.sort()
# loop over things in FIELD_NAME_TO_TRAITS_MAP.
for current_field_name in field_names_list:
#print( "current field name: {}".format( current_field_name ) )
# filter and count records where the current field is True.
my_kwargs = {}
kwarg_name = current_field_name
kwarg_value = True
my_kwargs[ kwarg_name ] = kwarg_value
#print( my_kwargs )
# filter.
filtered_qs = evaluation_qs.filter( **my_kwargs )
# count
filtered_count = filtered_qs.count()
# output
print( "- field {} count: {}".format( current_field_name, str( filtered_count ) ) )
#-- END loop over field names. --#
field name | field count | tag count |
---|---|---|
field is_ambiguous count | 67 | 12 |
field is_attribution_compound count | 1 | 1 |
field is_attribution_follow_on count | 24 | 24 |
field is_attribution_pronoun count | 2 | 2 |
field is_attribution_second_hand count | 10 | 10 |
field is_complex count | 23 | 23 |
field is_compound_names count | 9 | 9 |
field is_contributed_to count | 2 | 2 |
field is_dictionary_error count | 20 | 20 |
field is_disambiguation count | 4 | 4 |
field is_editing_error count | 3 | 3 |
field is_error count | 221 | 204 |
field is_foreign_names count | 1 | 1 |
field is_gender_confusion count | 2 | 2 |
field is_initials_error count | 5 | 5 |
field is_interesting count | 198 | 198 |
field is_layout_or_design count | 3 | 3 |
field is_list count | 15 | 15 |
field is_lookup_error count | 5 | 5 |
field is_no_html count | 5 | 5 |
field is_not_hard_news count | 16 | 12 |
field is_possessive count | 1 | 1 |
field is_pronouns count | 4 | 4 |
field is_proper_noun count | 40 | 40 |
field is_quote_distance count | 12 | 12 |
field is_said_verb count | 29 | 29 |
field is_short_n_gram count | 5 | 5 |
field is_software_error count | 1 | 1 |
field is_spanish count | 1 | 1 |
field is_sports count | 6 | 6 |
field is_straightforward count | 74 | 74 |
field is_title count | 33 | 34 |
field is_title_complex count | 26 | 26 |
field is_title_prefix count | 8 | 8 |
ground_truth
" record that we will correct, so we preserve original coding (and evidence of errors) in case we want or need that information later. Below, we have a table of the articles where we had to fix ground truth. To find the original coding, click the Article link.is_ground_truth_fixed
" set to True in the Reliability_Names_Evaluation
table in django: http://research.local/research/admin/context_analysis/reliability_names_evaluation/?is_ground_truth_fixed__exact=1&label=prelim_month&o=-1.7.8.3.5130 total (130/415 = 31.3% - this is a lot - is this right?)
prelim_month_human
" Reliability_Names
tag, 135 disagreements between original and corrected coding. Probably some merging needed here?of those, 4 are same person and article, but different Reliability_Names
record, so disagreements that had to be corrected twice because of rebuilding Reliability_Names
for the article (either human error, or something else weird). SQL:
SELECT sarne1.person_name,
sarne1.id,
sarne1.status,
sarne1.original_reliability_names_id,
sarne1.article_id,
sarne1.is_duplicate,
sarne2.is_duplicate,
sarne2.id,
sarne2.status,
sarne2.original_reliability_names_id,
sarne2.article_id
FROM context_analysis_reliability_names_evaluation AS sarne1,
context_analysis_reliability_names_evaluation AS sarne2
WHERE sarne1.id != sarne2.id
AND sarne1.label = 'prelim_month'
AND sarne2.label = 'prelim_month'
AND sarne1.event_type = 'remove_tags'
AND sarne2.event_type = 'remove_tags'
AND sarne1.is_ground_truth_fixed = TRUE
AND sarne2.is_ground_truth_fixed = TRUE
AND sarne1.article_id = sarne2.article_id
AND sarne1.person_name = sarne2.person_name
AND sarne1.original_reliability_names_id != sarne2.original_reliability_names_id
AND sarne2.original_reliability_names_id > sarne1.original_reliability_names_id
ORDER BY sarne1.id ASC;
Results (looks like these are ones that had to be merged, so ... minimze - when ambiguity, assume error in creating data, treat as duplicates so reduce count by number of duplicates):
person_name | id | status | original_reliability_names_id | article_id | id | status | original_reliability_names_id | article_id |
---|---|---|---|---|---|---|---|---|
Jeff Hawkins | 33 | ERROR | 9408 | 21007 | 34 | ERROR | 9414 | 21007 |
Fritz Wahlfield | 405 | ERROR | 10330 | 22415 | 407 | CORRECT | 10997 | 22415 |
John Agar | 610 | ERROR | 8917 | 23904 | 611 | ERROR | 8918 | 23904 |
Rachael Recker | 620 | ERROR | 8968 | 23920 | 619 | ERROR | 8976 | 23920 |
work:
is_ground_truth_fixed = True
so that is_to_do = True
and work_status = "metadata_review"
.is_ground_truth_updated = True
so is_human_error = True
.
In [ ]:
# get all evaluation records with label = "prelim_month", is_ground_truth_fixed = True, and event_type = "remove_tags".
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( is_ground_truth_fixed = True )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# count?
eval_count = evaluation_qs.count()
print( "record count: {}".format( str( eval_count ) ) )
# loop, setting "is_to_do" to True on each and saving.
for current_eval in evaluation_qs:
# set is_to_do to True and set work_status to "metadata_review".
current_eval.is_to_do = True
current_eval.work_status = "metadata_review"
current_eval.save()
#-- END loop over QuerySet. --#
In [ ]:
# get all evaluation records with label = "prelim_month", is_ground_truth_fixed = True, and event_type = "remove_tags".
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( is_ground_truth_fixed = True )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# count?
eval_count = evaluation_qs.count()
print( "record count: {}".format( str( eval_count ) ) )
# loop, setting "is_human_error" to True on each and saving.
for current_eval in evaluation_qs:
# set is_to_do to True and set work_status to "metadata_review".
current_eval.is_human_error = True
current_eval.save()
#-- END loop over QuerySet. --#
Task 1: go through all "is_to_do" and update the metadata booleans.
is_duplicate
" to True on one of the two records. If two records for a given person because of the computer finding a person and human missing them, mark the computer coder's evaluation record as the duplicate.
In [ ]:
# get all evaluation records with:
# - label = "prelim_month"
# - is_ground_truth_fixed = True
# - event_type = "remove_tags"
# - is_duplicate = False
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( is_ground_truth_fixed = True )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# evaluation_qs = evaluation_qs.filter( is_to_do = True ) # 130 originally
evaluation_qs = evaluation_qs.filter( is_duplicate = False ) # now 123!
# count?
eval_count = evaluation_qs.count()
print( "record count: {}".format( str( eval_count ) ) )
# 123
Task 2: update counts of characterizations above. When filtering for counts:
is_duplicate
" = False).is_skipped
" = False).
In [ ]:
# get all evaluation records with:
# - label = "prelim_month"
# - is_ground_truth_fixed = True
# - is_human_error = True
# - event_type = "remove_tags"
# - is_duplicate = False
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( is_ground_truth_fixed = True )
evaluation_qs = evaluation_qs.filter( is_human_error = True )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# evaluation_qs = evaluation_qs.filter( is_to_do = True ) # 130 originally
evaluation_qs = evaluation_qs.filter( is_duplicate = False ) # now 123!
# count?
eval_count = evaluation_qs.count()
print( "record count: {}".format( str( eval_count ) ) )
# 102
Analysis:
characterization of the problems:
note:
TODO:
work:
remove_tags
" events).event_type
" set to "merge" in the Reliability_Names_Evaluation table in django: http://research.local/research/admin/context_analysis/reliability_names_evaluation/?event_type__exact=merge&label=prelim_month&o=-1.7.8.3.5Some records need to be deleted:
Reliability_Names
(single names removed once again, etc.).Denoted by records with "event_type
" set to "deleted" in the Reliability_Names_Evaluation
table in django: http://research.local/research/admin/context_analysis/reliability_names_evaluation/?event_type__exact=delete&label=prelim_month&o=-1.7.8.3.5
In summary:
after correction:
2,446 overall coding decisions on people in "prelim_month
".
prelim_month
"294 disagreements in "prelim_month
".
prelim_month
"135 disagreements between humans and corrected.
prelim_month_human
"to more readily explore distribution of problems, added tag "prelim_month_human_disagree
"to disagreements in "prelim_month_human
":
prelim_month_human
"prelim_month_human_disagree
"TODO:
look over human errors (disagreement between humans and corrected data).
Figure out types of error, counts of each type. In SQL, filter on tag, then on other traits.
Human error:
In [20]:
# declare variables
eval_record = None
current_article = None
article_id = None
article_id_to_count_map = None
article_id_to_instance_map = None
article_count = None
average_count = None
# get all evaluation records with:
# - label = "prelim_month"
# - is_ground_truth_fixed = True
# - is_human_error = True
# - event_type = "remove_tags"
# - is_duplicate = False
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( is_ground_truth_fixed = True )
evaluation_qs = evaluation_qs.filter( is_human_error = True )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
# evaluation_qs = evaluation_qs.filter( is_to_do = True ) # 130 originally
evaluation_qs = evaluation_qs.filter( is_duplicate = False ) # now 123!
# count?
eval_count = evaluation_qs.count()
print( "record count: {}".format( str( eval_count ) ) )
# loop
article_id_to_count_map = {}
article_id_to_instance_map = {}
article_count = 0
for eval_record in evaluation_qs:
# get article ID
current_article = eval_record.article
article_id = current_article.id
# save instance
if article_id not in article_id_to_instance_map:
# add it
article_id_to_instance_map[ article_id ] = current_article
#-- END check if already saved --#
# update count
article_count = article_id_to_count_map.get( article_id, 0 )
article_count += 1
article_id_to_count_map[ article_id ] = article_count
#-- END loop over records --#
# output
for article_id, article_count in six.iteritems( article_id_to_count_map ):
# print results
print( "- article {}: {}".format( article_id, article_count ) )
#-- END loop over articles --#
# now, get and output count of articles and average per article.
article_count = len( article_id_to_count_map )
average_count = eval_count / article_count
print( "" )
print( "Article-level info:" )
print( "- article count: {}".format( article_count ) )
print( "- average per article: {}".format( average_count ) )
In [18]:
# Generate counts for each field.
# declare variables
field_names_view = None
field_names_list = None
current_field_name = None
my_kwargs = None
kwarg_name = None
kwarg_value = None
evaluation_qs = None
total_count = None
filtered_qs = None
filtered_count = -1
# first, get all evaluation instances with label = "prelim_month" and event type "remove_tags".
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( is_ground_truth_fixed = True )
evaluation_qs = evaluation_qs.filter( is_human_error = True )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
evaluation_qs = evaluation_qs.filter( is_duplicate = False ) # now 123!
total_count = evaluation_qs.count()
# get view of keys.
field_names_view = six.viewkeys( FIELD_NAME_TO_TRAITS_MAP )
# convert to sorted list
field_names_list = list( field_names_view )
field_names_list.sort()
# loop over things in FIELD_NAME_TO_TRAITS_MAP.
field_string = ""
zero_list = []
non_zero_list = []
for current_field_name in field_names_list:
#print( "current field name: {}".format( current_field_name ) )
# filter and count records where the current field is True.
my_kwargs = {}
kwarg_name = current_field_name
kwarg_value = True
my_kwargs[ kwarg_name ] = kwarg_value
#print( my_kwargs )
# filter.
filtered_qs = evaluation_qs.filter( **my_kwargs )
# count
filtered_count = filtered_qs.count()
# how many?
field_string = "- field {} count: {}".format( current_field_name, str( filtered_count ) )
if ( filtered_count > 0 ):
# add to non-zero list
non_zero_list.append( field_string )
else:
# add to zero list
zero_list.append( field_string )
#-- END count check --#
#-- END loop over field names. --#
print( "Total records: {}\n".format( total_count ) )
print( "tags found:" )
print( "\n".join( non_zero_list ) )
print( "\ntags not found:" )
print( "\n".join( zero_list ) )
Computer error - look over classes of error for trends (systemic error) and interesting.
In [15]:
# Generate counts for each field.
# declare variables
field_names_view = None
field_names_list = None
current_field_name = None
my_kwargs = None
kwarg_name = None
kwarg_value = None
evaluation_qs = None
total_count = None
filtered_qs = None
filtered_count = -1
# first, get all evaluation instances with label = "prelim_month" and event type "remove_tags".
evaluation_qs = Reliability_Names_Evaluation.objects.filter( label = "prelim_month" )
evaluation_qs = evaluation_qs.filter( event_type = "remove_tags" )
evaluation_qs = evaluation_qs.exclude( is_ground_truth_fixed = True )
evaluation_qs = evaluation_qs.exclude( is_human_error = True )
total_count = evaluation_qs.count()
# get view of keys.
field_names_view = six.viewkeys( FIELD_NAME_TO_TRAITS_MAP )
# convert to sorted list
field_names_list = list( field_names_view )
field_names_list.sort()
# loop over things in FIELD_NAME_TO_TRAITS_MAP.
field_string = ""
zero_list = []
non_zero_list = []
for current_field_name in field_names_list:
#print( "current field name: {}".format( current_field_name ) )
# filter and count records where the current field is True.
my_kwargs = {}
kwarg_name = current_field_name
kwarg_value = True
my_kwargs[ kwarg_name ] = kwarg_value
#print( my_kwargs )
# filter.
filtered_qs = evaluation_qs.filter( **my_kwargs )
# count
filtered_count = filtered_qs.count()
# how many?
field_string = "- field {} count: {}".format( current_field_name, str( filtered_count ) )
if ( filtered_count > 0 ):
# add to non-zero list
non_zero_list.append( field_string )
else:
# add to zero list
zero_list.append( field_string )
#-- END count check --#
#-- END loop over field names. --#
print( "Total records: {}\n".format( total_count ) )
print( "tags found:" )
print( "\n".join( non_zero_list ) )
print( "\ntags not found:" )
print( "\n".join( zero_list ) )