2017.12.07 - work log - prelim - R - grp_month - sna-author_info.r

Setup

Setup - working directories

Store important directories and file names in variables:


In [3]:
getwd()


'/home/jonathanmorgan/work/django/research/work/phd_work'

In [4]:
# code files (in particular SNA function library, modest though it may be)
code_directory <- "/home/jonathanmorgan/work/django/research/context_analysis/R/sna"
sna_function_file_path <- paste( code_directory, "/", 'functions-sna.r', sep = "" )

# home directory
home_directory <- getwd()
home_directory <- "/home/jonathanmorgan/work/django/research/work/phd_work/methods"

# data directories
data_directory <- paste( home_directory, "/data", sep = "" )
workspace_file_name <- "statnet-grp_month.RData"
workspace_file_path <- paste( data_directory, "/", workspace_file_name )

# output workspace
output_workspace_file_name <- workspace_file_name
output_workspace_file_path <- paste( data_directory, "/", output_workspace_file_name )

In [5]:
# set working directory to data directory for now.
setwd( data_directory )
getwd()


'/home/jonathanmorgan/work/django/research/work/phd_work/data'

Setup - load workspace

In the original file, it assumed you'd just source it after running other stuff. Here, we have done that other stuff in another notebook, now we need to reload the workspace in which it was done:


In [6]:
# assumes that you've already set working directory above to the
#     working directory.
setwd( data_directory )
message( paste( "Loading workspace : ", workspace_file_name, sep = "" ) )
load( workspace_file_name )


Loading workspace : statnet-grp_month.RData

Setup - put data in expected variables

Load original network data dataframes into humanNetworkDataDF and automatedNetworkDataDF.


In [7]:
# in statsnet files, orginal automated data was loaded into gmAutomatedDataDF
automatedNetworkDataDF <- gmAutomatedDataDF

# original human data was loaded into gmHumanDataDF
humanNetworkDataDF <- gmHumanDataDF

Calculate Author Info.

The original file on which this is based is: context_text/R/sna/sna-author_info.r:

Notes:

  • humanNetworkDataDF is original data DataFrame (gmHumanDataDF, etc.).
  • automatedNetworkDataDF is original data DataFrame (gmAutomatedDataDF, etc.).

In [8]:
# For this to work, you'll need to have run either of the following, including
#    all of the prerequisite files listed in each file:
#    - context_text/R/igraph/sna-igraph-network_stats.r
#    - context_text/R/statnet/sna-statnet-network_stats.r
# Also assumes that you haven't re-ordered the <type>NetworkData data frames.

#==============================================================================#
# information for all authors - source_type = 2 (reporter) or 4 (both source and reporter)
#==============================================================================#

# source_type = 2 (reporter) or 4 (both source and reporter)

# human - all authors
humanAuthorsNetworkData <- humanNetworkDataDF[ humanNetworkDataDF$person_type == 2 | humanNetworkDataDF$person_type == 4, ]
humanAuthorsCount <- nrow( humanAuthorsNetworkData )
humanAuthorsMeanDegree <- mean( humanAuthorsNetworkData$degree )
humanAuthorsMaxDegree <- max( humanAuthorsNetworkData$degree )
humanAuthorsMeanTieWeightGE0 <- mean( humanAuthorsNetworkData$meanTieWeightGE0 )
humanAuthorsMeanTieWeightGE1 <- mean( humanAuthorsNetworkData$meanTieWeightGE1 )
humanAuthorsMaxTieWeight <- max( humanAuthorsNetworkData$maxTieWeight )

# automated - all authors
automatedAuthorsNetworkData <- automatedNetworkDataDF[ automatedNetworkDataDF$person_type == 2 | automatedNetworkDataDF$person_type == 4, ]
automatedAuthorsCount <- nrow( automatedAuthorsNetworkData )
automatedAuthorsMeanDegree <- mean( automatedAuthorsNetworkData$degree )
automatedAuthorsMaxDegree <- max( automatedAuthorsNetworkData$degree )
automatedAuthorsMeanTieWeightGE0 <- mean( automatedAuthorsNetworkData$meanTieWeightGE0 )
automatedAuthorsMeanTieWeightGE1 <- mean( automatedAuthorsNetworkData$meanTieWeightGE1 )
automatedAuthorsMaxTieWeight <- max( automatedAuthorsNetworkData$maxTieWeight )

#==============================================================================#
# Generate information on individual reporters who have shared sources (subset
#    of all authors).
#==============================================================================#

# human - subsetting based on position of authors who had shared sources.
#humanAuthorsSharedNetworkData <- humanNetworkDataDF[ c( 3, 6, 9, 11, 12, 13, 14, 16, 21, 43, 44, 63, 169, 310 ), ]

# subsetting based on person IDs.
humanAuthorsSharedIDs <- c( 387, 2310, 394, 13, 3, 46, 23, 29, 30, 36, 425, 302, 178, 437, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 591, 336, 84, 598, 599, 217, 223, 937, 1655, 332, 505 )
humanAuthorsSharedNetworkData <- humanNetworkDataDF[ humanNetworkDataDF$person_id %in% humanAuthorsSharedIDs , ]

# human - make data
humanAuthorsSharedCount <- nrow( humanAuthorsSharedNetworkData )
humanAuthorsSharedMeanDegree <- mean( humanAuthorsSharedNetworkData$degree )
humanAuthorsSharedMaxDegree <- max( humanAuthorsSharedNetworkData$degree )
humanAuthorsSharedMeanTieWeightGE0 <- mean( humanAuthorsSharedNetworkData$meanTieWeightGE0 )
humanAuthorsSharedMeanTieWeightGE1 <- mean( humanAuthorsSharedNetworkData$meanTieWeightGE1 )
humanAuthorsSharedMaxTieWeight <- max( humanAuthorsSharedNetworkData$maxTieWeight )

# automated - subsetting based on position of authors who had shared sources.
#automatedAuthorsSharedNetworkData <- automatedNetworkDataDF[ c( 3, 6, 9, 11, 12, 13, 16, 21, 44, 63, 169, 310 ), ]

# subsetting based on person IDs.
automatedAuthorsSharedIDs <- c( 387, 2310, 394, 13, 3, 46, 23, 30, 36, 425, 2614, 302, 178, 437, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 591, 336, 84, 598, 599, 217, 223, 1655, 332, 505 )
automatedAuthorsSharedNetworkData <- automatedNetworkDataDF[ automatedNetworkDataDF$person_id %in% automatedAuthorsSharedIDs , ]

# automated - make data
automatedAuthorsSharedCount <- nrow( automatedAuthorsSharedNetworkData )
automatedAuthorsSharedMeanDegree <- mean( automatedAuthorsSharedNetworkData$degree )
automatedAuthorsSharedMaxDegree <- max( automatedAuthorsSharedNetworkData$degree )
automatedAuthorsSharedMeanTieWeightGE0 <- mean( automatedAuthorsSharedNetworkData$meanTieWeightGE0 )
automatedAuthorsSharedMeanTieWeightGE1 <- mean( automatedAuthorsSharedNetworkData$meanTieWeightGE1 )
automatedAuthorsSharedMaxTieWeight <- max( automatedAuthorsSharedNetworkData$maxTieWeight )

#==============================================================================#
# Do some regression to see if article or source count predict source sharing.
#==============================================================================#

#------------------------------------------------------------------------------#
# first, set up data frames (from results of running python script:
#    context_text/examples/analysis/analysis-person_info.py)
#------------------------------------------------------------------------------#

# human coder (index 1), all authors.
humanIdVector <- c( 387, 2310, 2567, 394, 652, 13, 654, 3, 46, 23, 2004, 29, 30, 417, 36, 425, 2614, 302, 178, 437, 566, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 482, 591, 336, 84, 598, 599, 217, 223, 736, 2018, 743, 937, 1782, 1655, 332, 505, 703, 637 )
humanSourceCountsVector <- c( 18, 2, 0, 33, 9, 36, 3, 27, 57, 31, 4, 50, 28, 4, 31, 30, 5, 31, 41, 45, 4, 13, 43, 36, 92, 43, 37, 30, 46, 3, 1, 76, 9, 64, 21, 50, 46, 18, 2, 5, 2, 7, 4, 6, 7, 18, 2, 13 )
humanSharedCountsVector <- c( 7, 2, 0, 2, 0, 1, 0, 9, 22, 12, 0, 2, 2, 0, 9, 2, 0, 3, 6, 13, 0, 5, 9, 10, 37, 19, 12, 10, 5, 1, 0, 6, 2, 19, 5, 4, 13, 9, 0, 0, 0, 7, 0, 6, 1, 1, 0, 0 )
humanArticleCountsVector <- c( 7, 1, 1, 8, 5, 17, 1, 13, 21, 15, 2, 18, 13, 4, 11, 10, 1, 12, 13, 15, 1, 8, 16, 17, 30, 15, 14, 12, 19, 4, 1, 25, 4, 27, 9, 17, 18, 6, 1, 1, 1, 1, 4, 1, 4, 8, 2, 4 )
humanAuthorsDF <- data.frame( humanIdVector, humanSourceCountsVector, humanSharedCountsVector, humanArticleCountsVector )
names( humanAuthorsDF ) <- c( "authorID", "sourceCount", "sharedCount", "articleCount" )

# human coder, only authors with shared sources.
humanSharedIdVector <- c( 387, 2310, 394, 13, 3, 46, 23, 29, 30, 36, 425, 302, 178, 437, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 591, 336, 84, 598, 599, 217, 223, 937, 1655, 332, 505 )
humanSharedSourceCountsVector <- c( 18, 2, 33, 36, 27, 57, 31, 50, 28, 31, 30, 31, 41, 45, 13, 43, 36, 92, 43, 37, 30, 46, 3, 76, 9, 64, 21, 50, 46, 18, 7, 6, 7, 18 )
humanSharedSharedCountsVector <- c( 7, 2, 2, 1, 9, 22, 12, 2, 2, 9, 2, 3, 6, 13, 5, 9, 10, 37, 19, 12, 10, 5, 1, 6, 2, 19, 5, 4, 13, 9, 7, 6, 1, 1 )
humanSharedArticleCountsVector <- c( 7, 1, 8, 17, 13, 21, 15, 18, 13, 11, 10, 12, 13, 15, 8, 16, 17, 30, 15, 14, 12, 19, 4, 25, 4, 27, 9, 17, 18, 6, 1, 1, 4, 8 )
humanSharedDF <- data.frame( humanSharedIdVector, humanSharedSourceCountsVector, humanSharedSharedCountsVector, humanSharedArticleCountsVector )
names( humanSharedDF ) <- c( "authorID", "sourceCount", "sharedCount", "articleCount" )

# computer coder, all authors.
automatedIdVector <- c( 387, 2310, 2567, 394, 652, 13, 654, 3, 46, 23, 2004, 29, 30, 417, 36, 425, 2614, 302, 178, 437, 566, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 482, 591, 336, 84, 598, 599, 217, 223, 736, 2018, 743, 1782, 1655, 332, 505, 703, 637 )
automatedSourceCountsVector <- c( 18, 2, 0, 27, 8, 39, 2, 29, 46, 33, 4, 50, 26, 4, 28, 31, 6, 31, 42, 49, 2, 15, 43, 34, 88, 45, 34, 28, 46, 4, 1, 72, 9, 69, 22, 46, 43, 13, 2, 5, 2, 4, 6, 7, 14, 2, 10 )
automatedSharedCountsVector <- c( 7, 2, 0, 2, 0, 1, 0, 12, 13, 11, 0, 0, 2, 0, 7, 3, 1, 4, 8, 10, 0, 7, 8, 9, 35, 19, 11, 10, 4, 1, 0, 6, 1, 20, 7, 3, 11, 8, 0, 0, 0, 0, 6, 1, 1, 0, 0 )
automatedArticleCountsVector <- c( 7, 1, 1, 8, 5, 17, 1, 13, 20, 15, 2, 18, 13, 4, 11, 10, 1, 12, 13, 15, 1, 8, 16, 17, 30, 15, 14, 12, 19, 4, 1, 25, 4, 27, 9, 17, 18, 6, 1, 1, 1, 4, 1, 4, 8, 2, 4 )
automatedAuthorsDF <- data.frame( automatedIdVector, automatedSourceCountsVector, automatedSharedCountsVector, automatedArticleCountsVector )
names( automatedAuthorsDF ) <- c( "authorID", "sourceCount", "sharedCount", "articleCount" )

# computer coder, only authors with shared sources.
automatedSharedIdVector <- c( 387, 2310, 394, 13, 3, 46, 23, 30, 36, 425, 2614, 302, 178, 437, 1082, 443, 377, 66, 69, 161, 73, 74, 460, 591, 336, 84, 598, 599, 217, 223, 1655, 332, 505 )
automatedSharedSourceCountsVector <- c( 18, 2, 27, 39, 29, 46, 33, 26, 28, 31, 6, 31, 42, 49, 15, 43, 34, 88, 45, 34, 28, 46, 4, 72, 9, 69, 22, 46, 43, 13, 6, 7, 14 )
automatedSharedSharedCountsVector <- c( 7, 2, 2, 1, 12, 13, 11, 2, 7, 3, 1, 4, 8, 10, 7, 8, 9, 35, 19, 11, 10, 4, 1, 6, 1, 20, 7, 3, 11, 8, 6, 1, 1 )
automatedSharedArticleCountsVector <- c( 7, 1, 8, 17, 13, 20, 15, 13, 11, 10, 1, 12, 13, 15, 8, 16, 17, 30, 15, 14, 12, 19, 4, 25, 4, 27, 9, 17, 18, 6, 1, 4, 8 )
automatedSharedDF <- data.frame( automatedSharedIdVector, automatedSharedSourceCountsVector, automatedSharedSharedCountsVector, automatedSharedArticleCountsVector )
names( automatedSharedDF ) <- c( "authorID", "sourceCount", "sharedCount", "articleCount" )

#------------------------------------------------------------------------------#
# regression
#------------------------------------------------------------------------------#

# all human-coded authors:
humanLmResults <- lm( sharedCount ~ sourceCount + articleCount, data = humanAuthorsDF )
humanLmResults

# all computer-coded authors:
automatedLmResults <- lm( sharedCount ~ sourceCount + articleCount, data = automatedAuthorsDF )
automatedLmResults

#------------------------------------------------------------------------------#
# means of counts from python file
#------------------------------------------------------------------------------#

# Article Count
humanAuthorsMeanArticleCount <- mean( humanAuthorsDF$articleCount )
humanAuthorsSharedMeanArticleCount <- mean( humanSharedDF$articleCount )
automatedAuthorsMeanArticleCount <- mean( automatedAuthorsDF$articleCount )
automatedAuthorsSharedMeanArticleCount <- mean( automatedSharedDF$articleCount )

# Source Count
humanAuthorsMeanSourceCount <- mean( humanAuthorsDF$sourceCount )
humanAuthorsSharedMeanSourceCount <- mean( humanSharedDF$sourceCount )
automatedAuthorsMeanSourceCount <- mean( automatedAuthorsDF$sourceCount )
automatedAuthorsSharedMeanSourceCount <- mean( automatedSharedDF$sourceCount )

# Shared Count
humanAuthorsMeanSharedCount <- mean( humanAuthorsDF$sharedCount )
humanAuthorsSharedMeanSharedCount <- mean( humanSharedDF$sharedCount )
automatedAuthorsMeanSharedCount <- mean( automatedAuthorsDF$sharedCount )
automatedAuthorsSharedMeanSharedCount <- mean( automatedSharedDF$sharedCount )


Call:
lm(formula = sharedCount ~ sourceCount + articleCount, data = humanAuthorsDF)

Coefficients:
 (Intercept)   sourceCount  articleCount  
    -0.54250       0.24345       0.02411  
Call:
lm(formula = sharedCount ~ sourceCount + articleCount, data = automatedAuthorsDF)

Coefficients:
 (Intercept)   sourceCount  articleCount  
    -0.42119       0.22519       0.03037  

In [9]:
#------------------------------------------------------------------------------#
# output
#------------------------------------------------------------------------------#

message( "====> HUMAN - all authors" )
message( paste( "human author count = ", humanAuthorsCount, sep = "" ) )
message( paste( "human author mean degree = ", humanAuthorsMeanDegree, sep = "" ) )
message( paste( "human author max degree = ", humanAuthorsMaxDegree, sep = "" ) )
message( paste( "human author mean tie weight GE0 = ", humanAuthorsMeanTieWeightGE0, sep = "" ) )
message( paste( "human author mean tie weight GE1 = ", humanAuthorsMeanTieWeightGE1, sep = "" ) )
message( paste( "human author max tie weight = ", humanAuthorsMaxTieWeight, sep = "" ) )
message( paste( "human author mean article count = ", humanAuthorsMeanArticleCount, sep = "" ) )
message( paste( "human author mean source count = ", humanAuthorsMeanSourceCount, sep = "" ) )
message( paste( "human author mean shared count = ", humanAuthorsMeanSharedCount, sep = "" ) )
message( "" )
message( "" )

message( "====> HUMAN - authors with shared sources" )
message( paste( "human shared count = ", humanAuthorsSharedCount, sep = "" ) )
message( paste( "human shared mean degree = ", humanAuthorsSharedMeanDegree, sep = "" ) )
message( paste( "human shared max degree = ", humanAuthorsSharedMaxDegree, sep = "" ) )
message( paste( "human shared mean tie weight GE0 = ", humanAuthorsSharedMeanTieWeightGE0, sep = "" ) )
message( paste( "human shared mean tie weight GE1 = ", humanAuthorsSharedMeanTieWeightGE1, sep = "" ) )
message( paste( "human shared max tie weight = ", humanAuthorsSharedMaxTieWeight, sep = "" ) )
message( paste( "human shared mean article count = ", humanAuthorsSharedMeanArticleCount, sep = "" ) )
message( paste( "human shared mean source count = ", humanAuthorsSharedMeanSourceCount, sep = "" ) )
message( paste( "human shared mean shared count = ", humanAuthorsSharedMeanSharedCount, sep = "" ) )
message( "regression results:" )
print( humanLmResults )
message( "" )
message( "" )

message( "====> AUTOMATED - all authors" )
message( paste( "automated author count = ", automatedAuthorsCount, sep = "" ) )
message( paste( "automated author mean degree = ", automatedAuthorsMeanDegree, sep = "" ) )
message( paste( "automated author max degree = ", automatedAuthorsMaxDegree, sep = "" ) )
message( paste( "automated author mean tie weight GE0 = ", automatedAuthorsMeanTieWeightGE0, sep = "" ) )
message( paste( "automated author mean tie weight GE1 = ", automatedAuthorsMeanTieWeightGE1, sep = "" ) )
message( paste( "automated author max tie weight = ", automatedAuthorsMaxTieWeight, sep = "" ) )
message( paste( "automated author mean article count = ", automatedAuthorsMeanArticleCount, sep = "" ) )
message( paste( "automated author mean source count = ", automatedAuthorsMeanSourceCount, sep = "" ) )
message( paste( "automated author mean shared count = ", automatedAuthorsMeanSharedCount, sep = "" ) )
message( "" )
message( "" )

message( "====> AUTOMATED - authors with shared sources" )
message( paste( "automated shared count = ", automatedAuthorsSharedCount, sep = "" ) )
message( paste( "automated shared mean degree = ", automatedAuthorsSharedMeanDegree, sep = "" ) )
message( paste( "automated shared max degree = ", automatedAuthorsSharedMaxDegree, sep = "" ) )
message( paste( "automated shared mean tie weight GE0 = ", automatedAuthorsSharedMeanTieWeightGE0, sep = "" ) )
message( paste( "automated shared mean tie weight GE1 = ", automatedAuthorsSharedMeanTieWeightGE1, sep = "" ) )
message( paste( "automated shared max tie weight = ", automatedAuthorsSharedMaxTieWeight, sep = "" ) )
message( paste( "automated shared mean article count = ", automatedAuthorsSharedMeanArticleCount, sep = "" ) )
message( paste( "automated shared mean source count = ", automatedAuthorsSharedMeanSourceCount, sep = "" ) )
message( paste( "automated shared mean shared count = ", automatedAuthorsSharedMeanSharedCount, sep = "" ) )
message( "regression results:" )
print( automatedLmResults )
message( "" )
message( "" )


====> HUMAN - all authors
human author count = 48
human author mean degree = 25.3958333333333
human author max degree = 99
human author mean tie weight GE0 = 0.0238860325621251
human author mean tie weight GE1 = 1.06601763268533
human author max tie weight = 8
human author mean article count = 9.54166666666667
human author mean source count = 24.6458333333333
human author mean shared count = 5.6875


====> HUMAN - authors with shared sources
human shared count = 34
human shared mean degree = 34.1470588235294
human shared max degree = 99
human shared mean tie weight GE0 = 0.0322092847421745
human shared mean tie weight GE1 = 1.10463927228779
human shared max tie weight = 8
human shared mean article count = 12.6176470588235
human shared mean source count = 33.0882352941176
human shared mean shared count = 8.02941176470588
regression results:
Call:
lm(formula = sharedCount ~ sourceCount + articleCount, data = humanAuthorsDF)

Coefficients:
 (Intercept)   sourceCount  articleCount  
    -0.54250       0.24345       0.02411  


====> AUTOMATED - all authors
automated author count = 47
automated author mean degree = 24.7872340425532
automated author max degree = 93
automated author mean tie weight GE0 = 0.0230997830407118
automated author mean tie weight GE1 = 1.06408773176287
automated author max tie weight = 8
automated author mean article count = 9.70212765957447
automated author mean source count = 24.2765957446809
automated author mean shared count = 5.34042553191489


====> AUTOMATED - authors with shared sources
automated shared count = 33
automated shared mean degree = 32.3939393939394
automated shared max degree = 93
automated shared mean tie weight GE0 = 0.0301472306613695
automated shared mean tie weight GE1 = 1.0977916179653
automated shared max tie weight = 8
automated shared mean article count = 12.4242424242424
automated shared mean source count = 31.6666666666667
automated shared mean shared count = 7.60606060606061
regression results:
Call:
lm(formula = sharedCount ~ sourceCount + articleCount, data = automatedAuthorsDF)

Coefficients:
 (Intercept)   sourceCount  articleCount  
    -0.42119       0.22519       0.03037  


Save workspace image

Save all the information in the current image, in case we need/want it later.


In [10]:
# help( save.image )
message( paste( "Output workspace to: ", output_workspace_file_name, sep = "" ) )
save.image( file = output_workspace_file_name )


Output workspace to: statnet-grp_month.RData