prelim_month_human - reliability

  • old "title": 2017.10.26 - work log - prelim_month_human - Reliability_Names reliability
  • old file name: 2017.10.25-work_log-prelim_month_human-Reliability_Names_reliability.ipynb

Looks at measures of agreement of coding between corrected (coder 1 - ground truth) and uncorrected (coder 2) human coding.

Setup

Setup - Imports


In [1]:
import datetime
import six

print( "packages imported at " + str( datetime.datetime.now() ) )


packages imported at 2017-10-26 00:50:05.338122

Setup - virtualenv jupyter kernel

If you are using a virtualenv, make sure that you:

  • have installed your virtualenv as a kernel.
  • choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py in this notebook, to configure the python environment manually as if you had activated the sourcenet virtualenv. To do this, you'd make a code cell that contains:

%run ../dev/wsgi.py

This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:

  • activate your virtualenv:

      workon sourcenet
  • in your virtualenv, install the package ipykernel.

      pip install ipykernel
  • use the ipykernel python program to install the current environment as a kernel:

      python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
    
    

    sourcenet example:

      python -m ipykernel install --user --name sourcenet --display-name "sourcenet (Python 3)"

More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html


In [2]:
%pwd


Out[2]:
'/home/jonathanmorgan/work/django/research/work/phd_work'

Setup - Initialize Django

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.


In [3]:
%run ../django_init.py


django initialized at 2017-10-26 04:50:10.826004

Setup R

To allow Python to talk to R, at R prompt:

/* install packages */
install.packages( "Rserve" )
install.packages( "irr" )

/* load Rserve */
library( Rserve )

/* start server */
Rserve( args="--no-save" )

Setup database

Also need to either pass database connection information to names analyzer below, or store database configuration in Django_Config:

# database connection information - 2 options...  Enter it here:
#my_analysis_instance.db_username = ""
#my_analysis_instance.db_password = ""
#my_analysis_instance.db_host = "localhost"
#my_analysis_instance.db_name = "sourcenet"

# Or set up the following properties in Django_Config, inside the django admins.
#     All have application of: "sourcenet-db-admin":
#     - db_username
#     - db_password
#     - db_host
#     - db_port
#     - db_name

Reliability data assessment - prelim_month_human

Generate reliability analysis for label "prelim_month_human".


In [4]:
# start to support python 3:
from __future__ import unicode_literals
from __future__ import division

#==============================================================================#
# ! imports
#==============================================================================#

# grouped by functional area, then alphabetical order by package, then
#     alphabetical order by name of thing being imported.

# context_analysis imports
from context_analysis.reliability.reliability_names_analyzer import ReliabilityNamesAnalyzer

#==============================================================================#
# ! logic
#==============================================================================#

# declare variables
my_analysis_instance = None
label = ""
indices_to_process = -1
result_status = ""

# make reliability instance
my_analysis_instance = ReliabilityNamesAnalyzer()

# database connection information - 2 options...  Enter it here:
#my_analysis_instance.db_username = ""
#my_analysis_instance.db_password = ""
#my_analysis_instance.db_host = "localhost"
#my_analysis_instance.db_name = "sourcenet"

# Or set up the following properties in Django_Config, inside the django admins.
#     All have application of: "sourcenet-db-admin":
#     - db_username
#     - db_password
#     - db_host
#     - db_port
#     - db_name

# run the analyze method
label = "prelim_month_human"
indices_to_process = 2
result_status = my_analysis_instance.analyze_reliability_names( label, indices_to_process )


==> current index = 1
====> comparison index = 2
====> current_column_name_suffix = _detected
====> current_column_result_name = detect
======> compare_column_name_1 = coder1_detected
======> compare_column_name_2 = coder2_detected
========> percentage_agreement = 0.986928104575
========> R irr::kripp.alpha = -0.005482456140350811
========> Potter's Pi = 0.97385620915
====> current_column_name_suffix = _person_id
====> current_column_result_name = lookup
======> compare_column_name_1 = coder1_person_id
======> compare_column_name_2 = coder2_person_id
========> percentage_agreement = 0.986928104575
========> R irr::kripp.alpha = 0.9864845279533468
====> current_column_name_suffix = _person_type_int
====> current_column_result_name = type
======> compare_column_name_1 = coder1_person_type_int
======> compare_column_name_2 = coder2_person_type_int
========> percentage_agreement = 0.986928104575
========> R irr::kripp.alpha = -0.005482456140350811
========> Potter's Pi = 0.9825708061
====> current_column_name_suffix = _first_quote_graf
====> current_column_result_name = first_quote_graf
======> compare_column_name_1 = coder1_first_quote_graf
======> compare_column_name_2 = coder2_first_quote_graf
========> percentage_agreement = 0.0
========> R irr::kripp.alpha = 1.0
====> current_column_name_suffix = _first_quote_index
====> current_column_result_name = first_quote_index
======> compare_column_name_1 = coder1_first_quote_index
======> compare_column_name_2 = coder2_first_quote_index
========> percentage_agreement = 0.0
========> R irr::kripp.alpha = 1.0
====> current_column_name_suffix = _organization_hash
====> current_column_result_name = organization_hash
======> compare_column_name_1 = coder1_organization_hash
======> compare_column_name_2 = coder2_organization_hash
/home/jonathanmorgan/work/django/research/context_analysis/reliability/reliability_names_analyzer.py:442: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  compare_values_1[ compare_values_1.isnull() ] = "-1"
/home/jonathanmorgan/.virtualenvs/sourcenet/lib/python3.5/site-packages/pandas/core/generic.py:5233: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
/home/jonathanmorgan/work/django/research/context_analysis/reliability/reliability_names_analyzer.py:698: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  column_name_prefix )
/home/jonathanmorgan/work/django/research/context_analysis/reliability/reliability_names_analyzer.py:444: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  compare_values_2[ compare_values_2.isnull() ] = "-1"
========> percentage_agreement = 0.986928104575
========> R irr::kripp.alpha = 0.7101922570450356
====> current_column_name_suffix = _person_id
====> current_column_result_name = lookup_non_zero
======> compare_column_name_1 = coder1_person_id
======> compare_column_name_2 = coder2_person_id
========> percentage_agreement = 1.0
========> R irr::kripp.alpha = 1.0
====> current_column_name_suffix = _person_type_int
====> current_column_result_name = type_non_zero
======> compare_column_name_1 = coder1_person_type_int
======> compare_column_name_2 = coder2_person_type_int
========> percentage_agreement = 1.0
========> R irr::kripp.alpha = 1.0
========> Potter's Pi = 1.0
====> current_column_name_suffix = _detected
====> current_column_result_name = detect
======> compare_column_name_1 = coder1_detected
======> compare_column_name_2 = coder2_detected
========> percentage_agreement = 0.945973496432
========> R irr::kripp.alpha = -0.027501309586170697
========> Potter's Pi = 0.891946992864
====> current_column_name_suffix = _person_id
====> current_column_result_name = lookup
======> compare_column_name_1 = coder1_person_id
======> compare_column_name_2 = coder2_person_id
========> percentage_agreement = 0.945463812436
========> R irr::kripp.alpha = 0.9453873170490539
====> current_column_name_suffix = _person_type_int
====> current_column_result_name = type
======> compare_column_name_1 = coder1_person_type_int
======> compare_column_name_2 = coder2_person_type_int
========> percentage_agreement = 0.934760448522
========> R irr::kripp.alpha = 0.8674336064509112
========> Potter's Pi = 0.913013931363
====> current_column_name_suffix = _first_quote_graf
====> current_column_result_name = first_quote_graf
======> compare_column_name_1 = coder1_first_quote_graf
======> compare_column_name_2 = coder2_first_quote_graf
========> percentage_agreement = 0.612130479103
========> R irr::kripp.alpha = 1.0
====> current_column_name_suffix = _first_quote_index
====> current_column_result_name = first_quote_index
======> compare_column_name_1 = coder1_first_quote_index
======> compare_column_name_2 = coder2_first_quote_index
========> percentage_agreement = 0.611620795107
========> R irr::kripp.alpha = 0.999166703975839
====> current_column_name_suffix = _organization_hash
====> current_column_result_name = organization_hash
======> compare_column_name_1 = coder1_organization_hash
======> compare_column_name_2 = coder2_organization_hash
========> percentage_agreement = 0.957696228338
========> R irr::kripp.alpha = 0.9550945519033192
====> current_column_name_suffix = _person_id
====> current_column_result_name = lookup_non_zero
======> compare_column_name_1 = coder1_person_id
======> compare_column_name_2 = coder2_person_id
========> percentage_agreement = 0.999461206897
========> R irr::kripp.alpha = 0.999460812111982
====> current_column_name_suffix = _person_type_int
====> current_column_result_name = type_non_zero
======> compare_column_name_1 = coder1_person_type_int
======> compare_column_name_2 = coder2_person_type_int
========> percentage_agreement = 0.988146551724
========> R irr::kripp.alpha = 0.9740898998785137
========> Potter's Pi = 0.982219827586
==> current index = 2

In [5]:
print( "result status: {status_string}".format( status_string = result_status ) )


result status: 
  • results are in Dropbox/academia/MSU/program_stuff/prelim_paper/analysis/reliability/2016-data/prelim_month_human-reliability_results.pdf.