prelim_month - create Reliability_Names data

2016.12.04 - work log - prelim_month - create Reliability_Names

original file name: 2016.12.04-work_log-prelim_month-create_Reliability_Names.ipynb

This is the notebook where the underlying name comparison data was created - one row per person per article, columns for the ways up to ten different coders captured that person from the text.

Setup

Setup - Imports


In [1]:
import datetime

print( "packages imported at " + str( datetime.datetime.now() ) )


packages imported at 2018-08-15 19:37:09.208903

Setup - virtualenv jupyter kernel

If you are using a virtualenv, make sure that you:

  • have installed your virtualenv as a kernel.
  • choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py in this notebook, to configure the python environment manually as if you had activated the sourcenet virtualenv. To do this, you'd make a code cell that contains:

%run ../dev/wsgi.py

This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:

  • activate your virtualenv:

      workon sourcenet
  • in your virtualenv, install the package ipykernel.

      pip install ipykernel
  • use the ipykernel python program to install the current environment as a kernel:

      python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
    
    

    sourcenet example:

      python -m ipykernel install --user --name sourcenet --display-name "sourcenet (Python 3)"

More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html

Setup - Initialize Django

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.


In [2]:
%pwd


Out[2]:
'/home/jonathanmorgan/work/django/research/work/phd_work/methods/data_creation'

In [3]:
%ls


2016.12.09-work_log-prelim_month-no_single_names.ipynb
2016.12.10-work_log-prelim_month-single_name_match_error.ipynb
2016.12.11-work_log-prelim_month-remove_single_names.ipynb
2017.06.01-work_log-prelim_month-remove_single_names.ipynb
prelim_month-create_Reliability_Names_data.ipynb
reliability-build_name_data.py

In [4]:
%run ../django_init.py


/home/jonathanmorgan/.virtualenvs/research/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)
/home/jonathanmorgan/.virtualenvs/research/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
django initialized at 2018-08-15 19:38:10.518822

Data characterization

Description of data, for paper.

  • grp_month article count = 441

In [5]:
from context_text.models import Article

In [6]:
# how many articles in "grp_month"?
article_qs = Article.objects.filter( tags__name__in = [ "grp_month" ] )
grp_month_count = article_qs.count()

print( "grp_month count = {}".format( grp_month_count ) )


grp_month count = 441

Reliability data creation - prelim_month

Create the data.

Initialize from file:


In [ ]:
%run ../config-coder_index-prelim_month.py

Example snapshot of configuration in this file:

'''
You must create an index-able instance and place it in my_index_instance before
    you run this code.  The index configuration in this file will be applied to
    the instance stored in "my_index_instance".

Objects you can pass in this instance:

from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder
from context_analysis.network.network_person_info import NetworkPersonInfo
'''

# imports
import datetime

# sourcenet imports
from context_text.shared.context_text_base import ContextTextBase

# context_analysis imports
from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder
from context_analysis.network.network_person_info import NetworkPersonInfo

# return reference
index_helper_OUT = None

# declare variables
tag_list = None
label = ""

# declare variables - user setup
my_info_instance = None
my_reliability_instance = None
current_coder = None
current_coder_id = -1
current_priority = -1

# declare variables - Article_Data filtering.
coder_type = ""

#===============================================================================
# configure
#===============================================================================

# list of tags of articles we want to process.
tag_list = [ "grp_month", ]

# label to associate with results, for subsequent lookup.
label = "prelim_month"

# create index instances
my_info_instance = NetworkPersonInfo()
my_reliability_instance = ReliabilityNamesBuilder()

# ! ====> map coders to indices

# set it up so that...

# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_index = 1
current_priority = 4
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# ...and automated coder (2) is index 2
current_coder = ContextTextBase.get_automated_coding_user()
current_coder_id = current_coder.id
current_index = 2
current_priority = 1
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )

# and only look at coding by those users.  And...

# configure so that it limits to automated coder_type of OpenCalais_REST_API_v2.
coder_type = "OpenCalais_REST_API_v2"
#my_reliability_instance.limit_to_automated_coder_type = "OpenCalais_REST_API_v2"
my_info_instance.automated_coder_type_include_list.append( coder_type )
my_reliability_instance.automated_coder_type_include_list.append( coder_type )

index_helper_OUT = my_info_instance.get_index_helper()

print( "indexing for grp_month/prelim_month initialized at " + str( datetime.datetime.now() ) )

In [ ]:
# output debug JSON to file
my_reliability_instance.debug_output_json_file_path = "/home/jonathanmorgan/" + label + ".json"

#===============================================================================
# process
#===============================================================================

# process articles
#my_reliability_instance.process_articles( tag_list )

# output to database.
#my_reliability_instance.output_reliability_data( label )

print( "reliability data created at " + str( datetime.datetime.now() ) )

Database backup - sourcenet-2016.12.04.pgsql.gz

First, making backup of database.

  • File name: sourcenet-2016.12.04.pgsql.gz
  • All articles in tag "grp_month" are coded by OpenCalais.
  • Reliability data generated with label "prelim_month", no cleanup done yet.

Data cleanup

Remove single-name reliability data

Next, remove all reliability data that refers to a single name using the "View reliability name information" screen:

To start, enter the following in fields there:

  • Label: - "prelim_month"
  • Coders to compare (1 through ==>): - 2
  • Reliability names filter type: - Select "Lookup"
  • [Lookup] - Person has first name, no other name parts. - CHECK the checkbox

You should see lots of entries where the automated coder detected people who were mentioned only by their first name.

Delete selected single-name data

See 2016.12.09-work_log-prelim_month-no_single_names.ipynb