prelim_month - create Reliability_Names data
2016.12.04 - work log - prelim_month - create Reliability_Names
original file name: 2016.12.04-work_log-prelim_month-create_Reliability_Names.ipynb
This is the notebook where the underlying name comparison data was created - one row per person per article, columns for the ways up to ten different coders captured that person from the text.
In [1]:
import datetime
print( "packages imported at " + str( datetime.datetime.now() ) )
If you are using a virtualenv, make sure that you:
Since I use a virtualenv, need to get that activated somehow inside this notebook. One option is to run ../dev/wsgi.py
in this notebook, to configure the python environment manually as if you had activated the sourcenet
virtualenv. To do this, you'd make a code cell that contains:
%run ../dev/wsgi.py
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is. I'd worry about collisions with the actual Python 3 kernel. Better, one can install their virtualenv as a separate kernel. Steps:
activate your virtualenv:
workon sourcenet
in your virtualenv, install the package ipykernel
.
pip install ipykernel
use the ipykernel python program to install the current environment as a kernel:
python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
sourcenet
example:
python -m ipykernel install --user --name sourcenet --display-name "sourcenet (Python 3)"
More details: http://ipython.readthedocs.io/en/stable/install/kernel_install.html
First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.
In [2]:
%pwd
Out[2]:
In [3]:
%ls
In [4]:
%run ../django_init.py
Description of data, for paper.
In [5]:
from context_text.models import Article
In [6]:
# how many articles in "grp_month"?
article_qs = Article.objects.filter( tags__name__in = [ "grp_month" ] )
grp_month_count = article_qs.count()
print( "grp_month count = {}".format( grp_month_count ) )
prelim_month
Create the data.
Initialize from file:
In [ ]:
%run ../config-coder_index-prelim_month.py
Example snapshot of configuration in this file:
'''
You must create an index-able instance and place it in my_index_instance before
you run this code. The index configuration in this file will be applied to
the instance stored in "my_index_instance".
Objects you can pass in this instance:
from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder
from context_analysis.network.network_person_info import NetworkPersonInfo
'''
# imports
import datetime
# sourcenet imports
from context_text.shared.context_text_base import ContextTextBase
# context_analysis imports
from context_analysis.reliability.reliability_names_builder import ReliabilityNamesBuilder
from context_analysis.network.network_person_info import NetworkPersonInfo
# return reference
index_helper_OUT = None
# declare variables
tag_list = None
label = ""
# declare variables - user setup
my_info_instance = None
my_reliability_instance = None
current_coder = None
current_coder_id = -1
current_priority = -1
# declare variables - Article_Data filtering.
coder_type = ""
#===============================================================================
# configure
#===============================================================================
# list of tags of articles we want to process.
tag_list = [ "grp_month", ]
# label to associate with results, for subsequent lookup.
label = "prelim_month"
# create index instances
my_info_instance = NetworkPersonInfo()
my_reliability_instance = ReliabilityNamesBuilder()
# ! ====> map coders to indices
# set it up so that...
# ...the ground truth user has highest priority (4) for index 1...
current_coder = ContextTextBase.get_ground_truth_coding_user()
current_coder_id = current_coder.id
current_index = 1
current_priority = 4
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 8 is priority 3 for index 1...
current_coder_id = 8
current_index = 1
current_priority = 3
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 9 is priority 2 for index 1...
current_coder_id = 9
current_index = 1
current_priority = 2
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...coder ID 10 is priority 1 for index 1...
current_coder_id = 10
current_index = 1
current_priority = 1
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# ...and automated coder (2) is index 2
current_coder = ContextTextBase.get_automated_coding_user()
current_coder_id = current_coder.id
current_index = 2
current_priority = 1
my_info_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
my_reliability_instance.add_coder_at_index( current_coder_id, current_index, priority_IN = current_priority )
# and only look at coding by those users. And...
# configure so that it limits to automated coder_type of OpenCalais_REST_API_v2.
coder_type = "OpenCalais_REST_API_v2"
#my_reliability_instance.limit_to_automated_coder_type = "OpenCalais_REST_API_v2"
my_info_instance.automated_coder_type_include_list.append( coder_type )
my_reliability_instance.automated_coder_type_include_list.append( coder_type )
index_helper_OUT = my_info_instance.get_index_helper()
print( "indexing for grp_month/prelim_month initialized at " + str( datetime.datetime.now() ) )
In [ ]:
# output debug JSON to file
my_reliability_instance.debug_output_json_file_path = "/home/jonathanmorgan/" + label + ".json"
#===============================================================================
# process
#===============================================================================
# process articles
#my_reliability_instance.process_articles( tag_list )
# output to database.
#my_reliability_instance.output_reliability_data( label )
print( "reliability data created at " + str( datetime.datetime.now() ) )
sourcenet-2016.12.04.pgsql.gz
First, making backup of database.
sourcenet-2016.12.04.pgsql.gz
Next, remove all reliability data that refers to a single name using the "View reliability name information" screen:
To start, enter the following in fields there:
You should see lots of entries where the automated coder detected people who were mentioned only by their first name.