Methods - network analysis - create network data
2017.11.14 - work log - prelim - network analysis
NOTE: The work captured here is outdated. See methods_paper_planning.ipynb --> Network Analysis for the up-to-date network analysis summary.
In [1]:
from __future__ import unicode_literals
from __future__ import division
# python base imports
import datetime
# import six
import six
print( "packages imported at " + str( datetime.datetime.now() ) )
First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.
In [2]:
%run ../django_init.py
In [3]:
# django imports
from django.contrib.auth.models import User
# sourcenet imports
from context_text.shared.context_text_base import ContextTextBase
# context_analysis imports
from context_analysis.network.network_person_info import NetworkPersonInfo
# sourcenet imports
from context_text.models import Article
from context_text.models import Article_Data
from context_text.models import Person
Generate some basic network statistics from the ground truth and automated attribution data, characterize and compare using QAP (including explaining substantial limitations of this given sparseness of networks).
examine traits of ground_truth and automated networks
Notes:
First, we configure as we did before, use Network Builder to render human and automated networks, and compare the files to the previous run - should be identical.
Configuration of Network Builder:
Configuration to generate network files for prelim:
Config of "Select Articles" - use defaults, except:
Coders:
coder_type filter
- only for coder "automated", for now.
"coder_type 'Value In' List (comma-delimited):" - Enter the coder types you want included. Examples:
Article Tag List: "prelim_network"
Configure "Network Settings" - use defaults, except:
Config of "Select People" - use defaults, except:
coder_type filter
"coder_type 'Value In' List (comma-delimited):" - Enter the coder types you want included. Examples:
Article Tag List: "prelim_network"
Resulting files stored in Dropbox (Dropbox/academia/MSU/program_stuff/prelim_paper/data/network_analysis/2017.11.14/network/original_coders/prelim_network
):
sourcenet_data-20171114-182817-prelim_network-week-human-original-unordered.tab
sourcenet_data-20171114-183930-prelim_network-week-human-original-ordered-64.tab
sourcenet_data-20171115-024247-prelim_network-week-human-original-ordered-64-264.tab
sourcenet_data-20171114-182942-prelim_network-week-automated.tab
Notes:
Next, we start with orignal configuration in Network Builder, update to reflect new coder users and ground_truth, and render human and automated networks for the same week, and compare the files to the old coders.
Configuration to test with new coders:
in Article:
Resulting files stored in Dropbox (Dropbox/academia/MSU/program_stuff/prelim_paper/data/network_analysis/2017.11.14/network/new_coders/prelim_network
):
sourcenet_data-20171115-033413-new-prelim_network-week-automated.tab
sourcenet_data-20171115-041030-new-prelim_network-week-automated-all_ordered-person-2.13.8.9.10.tab
sourcenet_data-20171115-035924-new-prelim_network-week-human-unordered.tab
sourcenet_data-20171114-183930-new-prelim_network-week-human-new-ordered-13.8.9.10.tab
Notes:
Next, we use Network Builder to create prelim_month data.
Configuration of Network Builder:
Configuration to generate network files for prelim:
Config of "Select Articles" - fields in bold need to be changed from default values:
Start date (YYYY-MM-DD):
2009-12-01End date (YYYY-MM-DD):
2009-12-31Fancy date range:
- Empty.Publications:
"Grand Rapids Press, The"Coders:
None selected.Coder IDs to include, in order of highest to lowest priority:
if automated: Article_Data coder_type Filter Type
and coder_type 'Value In' List (comma-delimited):
use the coder_type filter fields to filter automatically coded Article_Data on coder type if you have tried different automated coder types:
Article_Data coder_type Filter Type:
- Just automated
coder_type 'Value In' List (comma-delimited):
- Enter the coder types you want included. Examples:
Topics
: None selected.
Article Tag List (comma-delimited):
- "grp_month"Unique Identifier List (comma-delimited):
- Empty.Allow duplicate articles:
- "No"Configure "Network Settings" - fields in bold need to be changed from default values:
relations - Include source contact types
- All selected.relations - Include source capacities:
- None selected.relations - Exclude source capacities:
- None selected.Download as File?
- "Yes"Include render details?
- "No"Data Format:
- "Tab-Delimited Matrix"Data Output Type:
- "Network + Attribute Columns"Network Label:
- Empty.Include Headers:
- "Yes"Config of "Select People" - fields in bold need to be changed from default values:
Person Query Type:
- "Custom, defined below"People from (YYYY-MM-DD):
- 2009-12-01People to (YYYY-MM-DD):
- 2009-12-31Fancy person date range:
- Empty.Person publications:
- "Grand Rapids Press, The"Person coders:
- "automated", "minnesota1", "minnesota2", "minnesota3", "ground_truth"Coder IDs to include, in order of highest to lowest priority:
- Empty.Article_Data coder_type Filter Type
and coder_type 'Value In' List (comma-delimited):
use the coder_type filter fields to filter automatically coded Article_Data on coder type if you have tried different automated coder types:
Article_Data coder_type Filter Type:
- Just automated
coder_type 'Value In' List (comma-delimited):
- Enter the coder types you want included. Examples:
Person Topics
: None
Article Tag List (comma-delimited):
- "grp_month"Unique Identifier List (comma-delimited):
- Empty.Person allow duplicate articles:
- "Yes"Resulting files stored in Dropbox (Dropbox/academia/MSU/program_stuff/prelim_paper/data/network_analysis/2017.11.14/network/new_coders/grp_month
):
sourcenet_data-20171115-043102-grp_month-human.tab
sourcenet_data-20171205-022551-grp_month-automated.tab
NOTE: To test configuration, render network file for either human or automated, then check it against the corresponding file in Dropbox. If configured correctly, files will be the same.
Then, alter to just output the week, but with all the people for the entire month (so looking at how same matrix compares when populated using a week's worth of data compared to a month's).
Configuration to generate network files for prelim - use the above grp_month config, except:
Config of "Select Articles" - use fields as configured for full month of ties, except:
Start date (YYYY-MM-DD):
2009-12-06End date (YYYY-MM-DD):
2009-12-12Resulting files stored in Dropbox (Dropbox/academia/MSU/program_stuff/prelim_paper/data/network_analysis/2017.11.14/network/prelim_network/new_coders/grp_month
):
sourcenet_data-20171206-031319-grp_month-human-week1_subset.tab
sourcenet_data-20171206-031358-grp_month-automated-week1_subset.tab
Then, alter to just output the week, but with all the people for the entire month (so looking at how same matrix compares when populated using a week's worth of data compared to a month's).
Configuration to generate network files for prelim - use the above grp_month config, except:
Config of "Select Articles" - use fields as configured for full month of ties, except:
Start date (YYYY-MM-DD):
2009-12-13End date (YYYY-MM-DD):
2009-12-19Resulting files stored in Dropbox (Dropbox/academia/MSU/program_stuff/prelim_paper/data/network_analysis/2017.11.14/network/prelim_network/new_coders/grp_month
):
sourcenet_data-20180326-034401-grp_month-human-week2_subset.tab
sourcenet_data-20180326-040445-grp_month-automated-week2_subset.tab
Then, alter to just output the week, but with all the people for the entire month (so looking at how same matrix compares when populated using a week's worth of data compared to a month's).
Configuration to generate network files for prelim - use the above grp_month config, except:
Config of "Select Articles" - use fields as configured for full month of ties, except:
Start date (YYYY-MM-DD):
2009-12-20End date (YYYY-MM-DD):
2009-12-26Resulting files stored in Dropbox (Dropbox/academia/MSU/program_stuff/prelim_paper/data/network_analysis/2017.11.14/network/prelim_network/new_coders/grp_month
):
sourcenet_data-20180326-034548-grp_month-human-week3_subset.tab
sourcenet_data-20180326-040736-grp_month-automated-week3_subset.tab
Notes on original network analysis are in Evernote: MSU PhD - prelim - analysis - Network Analysis notes
network descriptives
Dropbox/academia/MSU/program_stuff/prelim_paper/analysis/analysis_summary.xlsx
network-level
files
python script:
context_text/examples/analysis/analysis-person_info.py
- calculates per-author information - on shared sources, article counts per author, etc.R scripts:
context_text/R/db_connect.r
context_text/R/sna/functions-sna.r
context_text/R/sna/sna-load_data.r
context_text/R/sna/igraph/*
context_text/R/sna/statnet/*
statnet/sna
sna::gden()
- graph densityigraph
igraph::transitivity()
- vector of transitivity scores for each node in a graph, plus network-level transitivity score.
First, need to figure out context_text/examples/analysis/reliability-build_relations.py
Original file: context_text/examples/analysis/analysis-person_info.py
Moved to:
context_analysis/network/network_person_info.py
context_analysis/examples/network/network-person_info.py
Try reproducing below with new class - Configure:
In [4]:
%run ../config-coder_index-prelim_month.py
In [5]:
%run ../config-coder_index-prelim_week.py
And then run the code:
In [7]:
#===============================================================================
# process articles
#===============================================================================
# process articles
my_info_instance.process_articles( tag_list )
#output lists of counts of sources and shared source by author
# declare variables - looking at data
coder_index_to_data_dict = None
coder_index = -1
coder_data_dict = None
coder_author_id_list = None
coder_author_source_count_list = None
coder_author_shared_count_list = None
coder_author_article_count_list = None
mean_source_count = -1
mean_shared_count = -1
mean_article_count = -1
author_index = -1
shared_count = -1
temp_author_id_list = []
temp_source_count_list = []
temp_shared_count_list = []
temp_article_count_list = []
# for each index, get authors.
coder_index_to_data_dict = my_info_instance.coder_index_to_data_map
# loop over the dictionary to process each index.
for coder_index, coder_data_dict in six.iteritems( coder_index_to_data_dict ):
# get data for coder
coder_author_id_list = coder_data_dict.get( NetworkPersonInfo.PROP_CODER_AUTHOR_ID_LIST, None )
coder_author_source_count_list = coder_data_dict.get( NetworkPersonInfo.PROP_CODER_AUTHOR_SOURCE_COUNT_LIST, None )
coder_author_shared_count_list = coder_data_dict.get( NetworkPersonInfo.PROP_CODER_AUTHOR_SHARED_COUNT_LIST, None )
coder_author_article_count_list = coder_data_dict.get( NetworkPersonInfo.PROP_CODER_AUTHOR_ARTICLE_COUNT_LIST, None )
# output
print( "" )
print( "================================================================================" )
print( "Data for Coder index " + str( coder_index ) + ":" )
print( "" )
print( "==> All authors" )
print( "- author ID list = " + str( coder_author_id_list ) )
print( "- author source count list = " + str( coder_author_source_count_list ) )
print( "- author shared count list = " + str( coder_author_shared_count_list ) )
print( "- author article count list = " + str( coder_author_article_count_list ) )
# and some computations
# author count
print( "- author count = " + str( len( coder_author_id_list ) ) )
# mean source count per author
mean_source_count = float( sum( coder_author_source_count_list ) ) / len( coder_author_source_count_list )
print( "- mean source count per author = " + str( mean_source_count ) )
# mean shared count per author
mean_shared_count = float( sum( coder_author_shared_count_list ) ) / len( coder_author_shared_count_list )
print( "- mean shared count per author = " + str( mean_shared_count ) )
# mean article count per author
mean_article_count = float( sum( coder_author_article_count_list ) ) / len( coder_author_article_count_list )
print( "- mean article count per author = " + str( mean_article_count ) )
# the same, but just for those with shared sources.
author_index = -1
temp_author_id_list = []
temp_source_count_list = []
temp_shared_count_list = []
temp_article_count_list = []
for shared_count in coder_author_shared_count_list:
# increment index
author_index += 1
# greater than 0?
if ( shared_count > 0 ):
# yes, add info to temp lists.
temp_author_id_list.append( coder_author_id_list[ author_index ] )
temp_source_count_list.append( coder_author_source_count_list[ author_index ] )
temp_shared_count_list.append( coder_author_shared_count_list[ author_index ] )
temp_article_count_list.append( coder_author_article_count_list[ author_index ] )
#-- END check to see if shared count > 0 --#
#-- END loop over shared_count_list --#
print( "" )
print( "==> Authors with shared sources" )
print( "- author ID list = " + str( temp_author_id_list ) )
print( "- author source count list = " + str( temp_source_count_list ) )
print( "- author shared count list = " + str( temp_shared_count_list ) )
print( "- author article count list = " + str( temp_article_count_list ) )
# and some computations
# author count
print( "- author count = " + str( len( temp_author_id_list ) ) )
# mean source count per author with shared sources
mean_source_count = float( sum( temp_source_count_list ) ) / len( temp_source_count_list )
print( "- mean source count per author with shared sources = " + str( mean_source_count ) )
# mean shared count per author with shared sources
mean_shared_count = float( sum( temp_shared_count_list ) ) / len( temp_shared_count_list )
print( "- mean shared count per author with shared sources = " + str( mean_shared_count ) )
# mean article count per author
mean_article_count = float( sum( temp_article_count_list ) ) / len( temp_article_count_list )
print( "- mean article count per author = " + str( mean_article_count ) )
#-- END loop over coders. --#
Results for:
grp_month/prelim_month: phd_work/results/network_person_info-grp_month.txt
Processed 441 Articles.
Processed 882 Article_Data records.
================================================================================
Data for Coder index 1:
==> All authors
- author ID list = [387, 2310, 2567, 394, 652, 13, 654, 3, 46, 23, 2004, 29, 30, 417, 36, 425, 2614, 302, 178, 437, 566, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 482, 591, 336, 84, 598, 599, 217, 223, 736, 2018, 743, 937, 1782, 1655, 332, 505, 703, 637]
- author source count list = [18, 2, 0, 33, 9, 36, 3, 27, 57, 31, 4, 50, 28, 4, 31, 30, 5, 31, 41, 45, 4, 13, 43, 36, 92, 43, 37, 30, 46, 3, 1, 76, 9, 64, 21, 50, 46, 18, 2, 5, 2, 7, 4, 6, 7, 18, 2, 13]
- author shared count list = [7, 2, 0, 2, 0, 1, 0, 9, 22, 12, 0, 2, 2, 0, 9, 2, 0, 3, 6, 13, 0, 5, 9, 10, 37, 19, 12, 10, 5, 1, 0, 6, 2, 19, 5, 4, 13, 9, 0, 0, 0, 7, 0, 6, 1, 1, 0, 0]
- author article count list = [7, 1, 1, 8, 5, 17, 1, 13, 21, 15, 2, 18, 13, 4, 11, 10, 1, 12, 13, 15, 1, 8, 16, 17, 30, 15, 14, 12, 19, 4, 1, 25, 4, 27, 9, 17, 18, 6, 1, 1, 1, 1, 4, 1, 4, 8, 2, 4]
- author count = 48
- mean source count per author = 24.645833333333332
- mean shared count per author = 5.6875
- mean article count per author = 9.541666666666666
==> Authors with shared sources
- author ID list = [387, 2310, 394, 13, 3, 46, 23, 29, 30, 36, 425, 302, 178, 437, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 591, 336, 84, 598, 599, 217, 223, 937, 1655, 332, 505]
- author source count list = [18, 2, 33, 36, 27, 57, 31, 50, 28, 31, 30, 31, 41, 45, 13, 43, 36, 92, 43, 37, 30, 46, 3, 76, 9, 64, 21, 50, 46, 18, 7, 6, 7, 18]
- author shared count list = [7, 2, 2, 1, 9, 22, 12, 2, 2, 9, 2, 3, 6, 13, 5, 9, 10, 37, 19, 12, 10, 5, 1, 6, 2, 19, 5, 4, 13, 9, 7, 6, 1, 1]
- author article count list = [7, 1, 8, 17, 13, 21, 15, 18, 13, 11, 10, 12, 13, 15, 8, 16, 17, 30, 15, 14, 12, 19, 4, 25, 4, 27, 9, 17, 18, 6, 1, 1, 4, 8]
- author count = 34
- mean source count per author with shared sources = 33.088235294117645
- mean shared count per author with shared sources = 8.029411764705882
- mean article count per author = 12.617647058823529
================================================================================
Data for Coder index 2:
==> All authors
- author ID list = [387, 2310, 2567, 394, 652, 13, 654, 3, 46, 23, 2004, 29, 30, 417, 36, 425, 2614, 302, 178, 437, 566, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 482, 591, 336, 84, 598, 599, 217, 223, 736, 2018, 743, 1782, 1655, 332, 505, 703, 637]
- author source count list = [18, 2, 0, 27, 8, 39, 2, 29, 46, 33, 4, 50, 26, 4, 28, 31, 6, 31, 42, 49, 2, 15, 43, 34, 88, 45, 34, 28, 46, 4, 1, 72, 9, 69, 22, 46, 43, 13, 2, 5, 2, 4, 6, 7, 14, 2, 10]
- author shared count list = [7, 2, 0, 2, 0, 1, 0, 12, 13, 11, 0, 0, 2, 0, 7, 3, 1, 4, 8, 10, 0, 7, 8, 9, 35, 19, 11, 10, 4, 1, 0, 6, 1, 20, 7, 3, 11, 8, 0, 0, 0, 0, 6, 1, 1, 0, 0]
- author article count list = [7, 1, 1, 8, 5, 17, 1, 13, 20, 15, 2, 18, 13, 4, 11, 10, 1, 12, 13, 15, 1, 8, 16, 17, 30, 15, 14, 12, 19, 4, 1, 25, 4, 27, 9, 17, 18, 6, 1, 1, 1, 4, 1, 4, 8, 2, 4]
- author count = 47
- mean source count per author = 24.27659574468085
- mean shared count per author = 5.340425531914893
- mean article count per author = 9.702127659574469
==> Authors with shared sources
- author ID list = [387, 2310, 394, 13, 3, 46, 23, 30, 36, 425, 2614, 302, 178, 437, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 591, 336, 84, 598, 599, 217, 223, 1655, 332, 505]
- author source count list = [18, 2, 27, 39, 29, 46, 33, 26, 28, 31, 6, 31, 42, 49, 15, 43, 34, 88, 45, 34, 28, 46, 4, 72, 9, 69, 22, 46, 43, 13, 6, 7, 14]
- author shared count list = [7, 2, 2, 1, 12, 13, 11, 2, 7, 3, 1, 4, 8, 10, 7, 8, 9, 35, 19, 11, 10, 4, 1, 6, 1, 20, 7, 3, 11, 8, 6, 1, 1]
- author article count list = [7, 1, 8, 17, 13, 20, 15, 13, 11, 10, 1, 12, 13, 15, 8, 16, 17, 30, 15, 14, 12, 19, 4, 25, 4, 27, 9, 17, 18, 6, 1, 4, 8]
- author count = 33
- mean source count per author with shared sources = 31.666666666666668
- mean shared count per author with shared sources = 7.606060606060606
- mean article count per author = 12.424242424242424
prelim_network (1 week): phd_work/results/network_person_info-prelim_network-1week.txt
Processed 109 Articles.
Processed 214 Article_Data records.
================================================================================
Data for Coder index 1:
==> All authors
- author ID list = [66, 387, 69, 73, 74, 23, 332, 13, 591, 336, 3, 84, 302, 599, 217, 29, 30, 223, 161, 36, 425, 46, 178, 937, 505, 443, 460, 394, 377]
- author source count list = [17, 5, 23, 6, 16, 10, 3, 14, 23, 9, 6, 19, 15, 8, 11, 11, 10, 7, 16, 17, 5, 32, 13, 7, 4, 4, 0, 7, 2]
- author shared count list = [12, 0, 6, 1, 1, 7, 0, 0, 1, 0, 0, 7, 2, 0, 1, 1, 0, 7, 9, 7, 2, 16, 0, 7, 0, 0, 0, 0, 0]
- author article count list = [4, 2, 9, 2, 8, 3, 1, 5, 6, 3, 4, 6, 6, 4, 6, 4, 3, 1, 5, 5, 2, 11, 6, 1, 1, 1, 1, 1, 2]
- author count = 29
- mean source count per author = 11.03448275862069
- mean shared count per author = 3.0
- mean article count per author = 3.896551724137931
==> Authors with shared sources
- author ID list = [66, 69, 73, 74, 23, 591, 84, 302, 217, 29, 223, 161, 36, 425, 46, 937]
- author source count list = [17, 23, 6, 16, 10, 23, 19, 15, 11, 11, 7, 16, 17, 5, 32, 7]
- author shared count list = [12, 6, 1, 1, 7, 1, 7, 2, 1, 1, 7, 9, 7, 2, 16, 7]
- author article count list = [4, 9, 2, 8, 3, 6, 6, 6, 6, 4, 1, 5, 5, 2, 11, 1]
- author count = 16
- mean source count per author with shared sources = 14.6875
- mean shared count per author with shared sources = 5.4375
- mean article count per author = 4.9375
================================================================================
Data for Coder index 2:
==> All authors
- author ID list = [66, 387, 69, 73, 74, 23, 332, 13, 591, 336, 505, 3, 340, 46, 599, 217, 29, 30, 223, 84, 161, 36, 425, 566, 302, 178, 350, 758, 377, 443, 460, 394]
- author source count list = [15, 5, 23, 6, 15, 10, 3, 14, 21, 9, 3, 5, 2, 23, 8, 10, 10, 9, 5, 17, 14, 14, 5, 1, 16, 12, 1, 1, 2, 4, 1, 6]
- author shared count list = [10, 0, 5, 0, 0, 6, 0, 0, 1, 0, 0, 0, 0, 7, 0, 0, 0, 0, 5, 6, 7, 5, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0]
- author article count list = [4, 2, 9, 2, 8, 3, 1, 5, 6, 3, 1, 4, 1, 10, 4, 6, 4, 3, 1, 6, 5, 5, 2, 1, 6, 6, 1, 1, 2, 1, 1, 1]
- author count = 32
- mean source count per author = 9.0625
- mean shared count per author = 1.75
- mean article count per author = 3.59375
==> Authors with shared sources
- author ID list = [66, 69, 23, 591, 46, 223, 84, 161, 36, 425, 302]
- author source count list = [15, 23, 10, 21, 23, 5, 17, 14, 14, 5, 16]
- author shared count list = [10, 5, 6, 1, 7, 5, 6, 7, 5, 2, 2]
- author article count list = [4, 9, 3, 6, 10, 1, 6, 5, 5, 2, 6]
- author count = 11
- mean source count per author with shared sources = 14.818181818181818
- mean shared count per author with shared sources = 5.090909090909091
- mean article count per author = 5.181818181818182
Notebooks:
TODO:
DONE:
ArticleSelectForm
and PersonSelectForm
to include field for "coder_id_priority_list
"/"person_coder_id_priority_list
".created method NetworkOutput.get_coder_id_list() that:
if prioritzed list is present:
updated NetworkOutput.create_query_set() to use get_coder_id_list() method.
need to update NetworkOutput.remove_duplicate_article_data() - it is where we choose which Article_Data to omit per article where there are duplicates. Need to go with order of list. Might already do this... Nope.
Need to test
person-coded articles:
look for differences in:
automated coder:
as long as the tests above check out, then try out the whole month, with prioritized coder list.
need to update NetworkDataOutput and children? Looks like no - all comes down to the remove_duplicate_article_data().
figure out the old network analysis stuff.
context_text/examples/analysis/analysis-person_info.py
moved into context_analysis and working with new index ordering code (might be enough to just extend Reliability_Names_Builder for init and for index specing, then override process_articles()
.