1 Overview

1.1 TITLE ‒ Introduction to social networking analysis of the Full Disclosure email network

1.2 DESCRIPTION ‒ A security email list includes useful information for predictive analysis of emerging threats, but this information may be lost alongside irrelevant discussion threads. Social network analysis provides one method to identify and filter out unwanted messages. This introductory notebook develops techniques to concentrate on the core content of the email network studied.

1.3 ABSTRACT ‒ This notebook begins with preliminary analysis and exploration of the Full Disclosure email network, developing a general sense of the network's structure over time and within a single year. The general overview reveals that the email list changed in character over time, growing in activity in early year but declining sharply in participation in recent years. Important features of the network, particularly its "firework" structures, are noted and discussed, leading to a working hypothesis: that the firework clusters represent prolific authors whose emails do not generate much list discussion, and that these threads that are not useful for the PERCEIVE project's study of emerging threats, as they represent content like security advisories. Manual inspection of nodes confirms this hypothesis.

In order to concentrate on actual discussion topics within the list, programmatic methods are developed to filter out these undesired firework structures. Once filtered, the network is amenable to community detection, which in turn points to the need for further filtering. This general process -- filtering the network, identifying subcommunities, filtering again, and identifying subcommunites again -- emerges as a useful iterative procedure to find genuine discussion communities.

2 Introduction

The PERCEIVE project seeks to proactively identify upcoming cybersecurity threats through textual similarity. Social network analysis can be used to partition a network and evaluate its textual content, and hence provide word-selection criteria for defining corpus documents.

The subject of this notebook, the Full Disclosure (FD) mailing list is a "public, vendor-neutral forum for detailed discussion of vulnerabilities and exploitation techniques, as well as tools, papers, news, and events of interest to the community."

2.1 Problem Statement

Although cybersecurity email mailing lists provide a rich source to identify emerging concepts, they contain a large amount of content that is irrelevant to the project's purpose, including but not limited to:

conference invitations[1]
vendor announcements[2]
extensive conversations on security topics[3]
a significant amount of trolling[4]
and nonsense[5] [6].

As some of this irrelevant content can be strictly tied to its producer, i.e., the authors of the e-mail replies, social network analysis provides an interesting opportunity for the isolation of relevant discussion topics in order to filter non-related vulnerability content.

2.2 Network Definition

Earlier in this project, the FD email lists were developed into networks of edges and nodes, divided by year. These original csv files are available online.

In the FD graphs, nodes represent either documents (i.e., emails) or authors, as identified by a nodeType attribute. Edges are directed, representing authorship and replies; edge weight indicates an increasing number of replies.

The following screenshots show sample data from the edge and node lists for 2008. The edge lists provide three attributes: source (an author who posted an email), target (the document written), and weight (the number of times the author wrote back to the resulting thread). The node list attributes include an id (the author or the email document), a label (email address of an author or email subject for a document), Label1 (response dates if applicable), Label2 (URLs for the the documents), hexadecimal Color of red (#FF00000) or black (#000000) based on nodeType of author or document.

2008 edge list sample	2008 node list sample

2.3 Tools used

2.3.1 Gephi

While Gephi proved useful for data exploration, some difficulties arose. In particular, many of the core functions[7] and plugins[8] once used for social network analysis are not compatible with current (0.9) software versions. Gephi specifications note that the 0.9.0 version (December 2015) "Removed [the] Clustering module"[9] that these plugins relied upon.

2.3.2 Python-igraph

Python-igraph (igraph hereafter) allows for programmatic manipulation of network data. Igraph can generate images via a plot function, but this is a slow process for large graphs. Igraph has a wide variety of import and export options for network data, including straightforward export of subgraphs.

2.3.3 Interoperability of Gephi and igraph

While Gephi can export graphs, its export implementations include temporary node properties like position and color attributes. Igraph can import Gephi-exported GraphML files (and those are available here. However, it was more useful in the long run to generate graphs in igraph directly from the provided edge and node lists. Igraph's exported GraphML files functioned as expected in Gephi.

2.4 Generating graphs from edge and node lists

In Gephi, graphs were generated via the "import spreadsheet" function, and separate Gephi projects saved for each year. Exported versions of these graphs are available in the supplementary zip file.

In igraph, graphs are generated via importing csv data into lists, then using igraph's DictReader function.



In [1]:

    
from igraph import Graph, summary
from csv import DictReader

# 2008 will be the sample year for this introductory analysis, but let's define it as a variable
# for future development of functions that will work for all years
year = '2008'  

# import edge list and vertex list as lists of dictionaries:
e = []
with open('data/input/author_threadID_edgelist_' + year + '.csv') as csvfile:
    reader = DictReader(csvfile, dialect='excel')
    reader.fieldnames = [name.lower() for name in reader.fieldnames]  # igraph expects lowercase attribute names
    for row in reader:
        row['weight'] = int(row['weight'])  # convert weight from string to int
        e.append(row)
v = []
with open('data/input/nodelist_aut_doc_' + year +'.csv') as csvfile:
    reader = DictReader(csvfile, dialect='excel')
    reader.fieldnames = [name.lower() for name in reader.fieldnames]  # igraph expects lowercase attribute names
    for row in reader:
        v.append(row)

# build a graph from the lists; see http://igraph.org/python/doc/igraph.Graph-class.html#DictList
ml = Graph.DictList(vertices=v, edges=e, directed=True, vertex_name_attr='id')  # specify the 'id' attribute here
                                                                                # because igraph anticipates a 'name' instead

ml['name'] = year + ' Full Disclosure network'  # provide a name for this graph

summary(ml)  # list properties and attributes









    



IGRAPH D-W- 9793 12073 -- 2008 Full Disclosure network
+ attr: name (g), color (v), id (v), label (v), label1 (v), label2 (v), nodetype (v), source (e), target (e), weight (e)

The summary command providing the output seen above is explained in the igraph documentation[10]. In this example, the four-character code "D-W-" indicates that the graph is directed and weighted. The graph for 2008 has 9793 vertices (nodes) and 12073 edges.

The list of attributes in the summary ("name", "color", etc.) are those for the graph (g), vertices (v), or edges (e).

In the remainder of this notebook, 2008 will be used for single-year analysis.

3 Preliminary overview

3.1 Visual assessment via Gephi-generated images

Importing the entire graphs for each year into Gephi allowed for early visual inspection and exploration of the FD lists.

The following table includes full-network Gephi graphics for each year of the email list's existence; blue nodes represent authors and red nodes represent documents. Click through for full-size renderings.

3.2 The expansion and contraction of Full Disclosure

It is clear from a quick look at the graphs that the overall pattern of the Full Disclosure network has shifted over time. In the early years of the list, there were tightly clustered conversations, particularly from about 2003-2006. As the years went by, the level of discussion seems to have dropped. Some of this pattern can be inferred simply from the file sizes of the edge and node lists, or by counting the nodes.

The visual impression is further confirmed by a documented change in the FD network in March, 2014, when one of the original managers of the list decided to halt the list entirely[11] but another team soon revived it[12].

Following the 2014 format change, the number of edges drops significantly, and the list displays a different network structure from the earlier years. Currently, the list remains active but few members respond to published posts.

3.3 "Firework" patterns

Visual analysis of the graphs for each year showed distinct fireworks patterns throughout the lifecycle of the list. These are cases where one author is posting to the list many times, with little or no response from other people on the list, resulting in a hub-and-spoke graph form. The working hypothesis is that these firework patterns generally represent security advisories. If so, these are discussions known issues, and not useful for the project's overall predictive goals.

These fireworks were analyzed both visually, in Gephi, and programmatically, in igraph, to test the hypothesis.

In Gephi, manual examination included use of the Linkfluence plugin to launch URLs in a web browser directly from nodes studied. This node-by-node analysis demonstrated that firework nodes were security advisories[13][14][15].

In igraph, it was informative to assess each author node and examine the in-degree of its connected (document) nodes. In cases where all (or most) of an author's neighbors have an in-degree of one, the author is the hub of a firework.



In [2]:

    
print('List of authors and their connected (document) vertices, among first 45 in total list,\nlimited to those with neighbors with max in-degree < 2')
for node in ml.vs(nodetype_eq='author')[0:45]:
    indegrees = [ml.degree(_, mode='in') for _ in ml.neighbors(node)]
    if indegrees:  # a few authors have 0 neighbors... 
        if max(indegrees) < 2:
            print()
            print('-' * 55)
            print(node['label'], '\n')
            print('  In-degree\t Label')
            print('  ---------\t -----')
            for _ in ml.neighbors(node):
                print(' ', ml.degree(_, mode='in'), '\t\t', ml.vs[_]['label'][0:60] )
            print('  MAX:', max(indegrees))
            print('  AVERAGE:', sum(indegrees) / max(len(indegrees), 1))









    



List of authors and their connected (document) vertices, among first 45 in total list,
limited to those with neighbors with max in-degree < 2

-------------------------------------------------------
andur matrix <andurmatrix () gmail com> 

  In-degree	 Label
  ---------	 -----
  1 		 Re: [OOT] Thesis for master degree
  MAX: 1
  AVERAGE: 1.0

-------------------------------------------------------
Matousec - Transparent security Research <research () matousec com> 

  In-degree	 Label
  ---------	 -----
  1 		 Kerio Fake 'iphlpapi' DLL injection Vulnerability
  1 		 Outpost Bypassing Self-Protection using file	links Vulnerabi
  1 		 Comodo Multiple insufficient argument validation of hooked S
  1 		 Comodo DLL injection via weak hash function	exploitation Vul
  1 		 Comodo Bypassing settings protection using magic	pipe Vulner
  1 		 Norton Insufficient validation of 'SymTDI' driver	input buff
  1 		 Norton Multiple insufficient argument validation of hooked S
  1 		 ZoneAlarm Multiple insufficient argument validation of hooke
  1 		 ZoneAlarm Insufficient validation of 'vsdatant' driver input
  1 		 Bypassing PFW/HIPS open process control with	uncommon identi
  1 		 Outpost Enforcing system reboot with	'outpost_ipc_hdr' mutex
  1 		 Kaspersky Multiple insufficient argument validation of hooke
  1 		 Plague in (security) software drivers & BSDOhook	utility
  MAX: 1
  AVERAGE: 1.0

-------------------------------------------------------
Sebastian Wolfgarten <sebastian.wolfgarten () gmx net> 

  In-degree	 Label
  ---------	 -----
  1 		 Security contact at TrendMicro
  1 		 Buffer overflow in VSAPI library of Trend Micro	VirusWall 3.
  MAX: 1
  AVERAGE: 1.0

-------------------------------------------------------
xploitzz <xploitzz () gmail com> 

  In-degree	 Label
  ---------	 -----
  1 		 Vista Reduced Function mode	triggered&In-Reply-To=AA4FD01470
  MAX: 1
  AVERAGE: 1.0

-------------------------------------------------------
Tal Argoni <moskito () 012 net il> 

  In-degree	 Label
  ---------	 -----
  1 		 Inforamtion Discloser Vulnerabilities in	"phpMyAdmin"
  MAX: 1
  AVERAGE: 1.0

-------------------------------------------------------
<sftsi () hushmail com> 

  In-degree	 Label
  ---------	 -----
  1 		 It's all in the details, sapheal
  MAX: 1
  AVERAGE: 1.0

-------------------------------------------------------
Vic Vandal <vvandal () well com> 

  In-degree	 Label
  ---------	 -----
  1 		 CarolinaCon 2007 - Call for Speakers/Papers
  1 		 Re: Call For Participants For A Research Study Of	Hacker Cul
  1 		 CarolinaCon-2008, March 28th-30th
  1 		 CarolinaCon 2007 Announcement/Press Release
  1 		 CarolinaCon-2008, March 28th-30th,	full agenda posted
  1 		 CarolinaCon presentation drafts
  1 		 CarolinaCon 2008 - Call For Papers/Speakers
  MAX: 1
  AVERAGE: 1.0

In the above code example, it is revealed that authors "Matousec" and "Vic Vandal" are centers of fireworks. Email subjects suggest that these are security advisories and conference announcements, respectively.

A few authors in the 2008 network have zero neighboring nodes. These are responses to email threads from earlier years. In general, we are not interested in these delayed responses. Security literature[16] suggests that vulnerabilities are only of interest in the short term, and discussions within a single year are a reasonable target for our analysis.

4 Filtering the network

Igraph allows for removal of vertices from the graph based on useful criteria. Since the "fireworks" are not desired in future analysis, they can be removed.

4.1 Saving some attributes before filtering

Before filtering, it may be useful to store some graph statistics in the node attributes.



In [3]:

    
pr = ml.pagerank(directed=True, weights='weight')
for idx, _ in enumerate(ml.vs):
    _['original_pagerank'] = pr[idx]
    _['original_enumeration'] = idx - 1  # this can link us back to the node list if needed

A simple function to list total numbers of nodes and authors will be useful as we attempt to shrink the analyzed graph:



In [4]:

    
def nodeCount(g):
    print('Total vertices for', g['name'], ':\n    ', len(g.vs), 'including', len(g.vs(nodetype_eq='author')), 'authors')

nodeCount(ml)









    



Total vertices for 2008 Full Disclosure network :
     9793 including 2438 authors

4.2 Removing fireworks

The earlier analysis of the in-degree of authors' neighbors suggested that those authors whose neighbors' in-degree was mostly 1 are centers of fireworks. A fuller examination of the data revealed that these types of list postings occasionally gain responses[16]. Therefore, it was determined that a firework including documents with average (or mean) in-degrees of < 1.2 can be removed from the graph.



In [5]:

    
# now let's start filtering...
# first remove authors with neighbors mostly of in-degree 1
print('Filtering out authors with neighbors mostly of in-degree 1...')

for node in ml.vs:
    if node['nodetype'] == 'author':
        indegrees = [ml.degree(_, mode='in') for _ in ml.neighbors(node)]
        if indegrees:  # a few authors have 0 neighbors... this is something to investigate later
            if sum(indegrees) / max(len(indegrees), 1) < 1.2: #  there are examples where most of the indegrees are 1 but
                                                     #  we have a random response, so let's use the mean
                ml.delete_vertices(node) #  deleting vertices also deletes all of its edges
ml['name'] = ml['name'] + ' (fireworks removed)'

nodeCount(ml)









    



Filtering out authors with neighbors mostly of in-degree 1...
Total vertices for 2008 Full Disclosure network (fireworks removed) :
     9253 including 1898 authors

The above reduction in network size is helpful, but the removal of firework hubs is likely to leave a number of isolated documents in the graph. Removing all nodes with degree 0 can complete the firework cleanup.



In [6]:

    
print('Filtering out documents or authors with degree 0...')
for node in ml.vs:
    if ml.degree(node) == 0: 
        ml.delete_vertices(node)
ml['name'] = ml['name'] + ' (isolated nodes removed)'
nodeCount(ml)









    



Filtering out documents or authors with degree 0...
Total vertices for 2008 Full Disclosure network (fireworks removed) (isolated nodes removed) :
     7179 including 1898 authors

The numbers after firework-filtering are still large. In order to focus more narrowly on relevant portions of this network, some deeper social network analysis techniques are required.

The network structure varies considerably across the years. This provides an opportunity to partition a network for a given year into several clusters and investigate if the visualized structure has any association to the textual content of a given e-mail thread discussion.

If such association exists, then we can leverage the social network structure in order to simplify the identification of emerging concepts through text alone. For instance, identifying a group of individuals who prefer certain subjects, or are spammers or trolls, may become a trivial pre-processing stage before textual content is analyzed for emerging concepts.

We begin by considering 2 methods to more precisely define how to partition a network at any given year:

Community Detection
Betweenness Centrality

5.1 Community Detection

In real-world social networks, there is a tendency towards clustering of nodes with strong ties: if I have a close friend who has another close friend, I am likely to make a connection with that person. There are mathematical methods for identifying clusters based on the existing edges of their neighbors in comparison to the possible connections among them.

Both Gephi and igraph have clustering functions, but igraph’s are more robust and allow us to easily filter out subgraphs.

Igraph's community measures do not support directed graphs, as seen in the warning message below, but the directionality can be removed. It is worth noting that directionality is implied in the structure of the graph (authors write or respond to documents).



In [7]:

    
com = ml.community_leading_eigenvector(weights='weight', clusters=3)









    



C:\Anaconda3\lib\site-packages\igraph\__init__.py:1072: RuntimeWarning: This method was developed for undirected graphs at src\community.c:1565
  membership, _, q = GraphBase.community_leading_eigenvector(self, clusters, **kwds)



In [8]:

    
ml.to_undirected(combine_edges='max') # eliminate the directionality

# Attempt to identify communities

com = ml.community_leading_eigenvector(weights='weight', clusters=3)
print('clustering attempt, leading eigenvector:')
summary(com)









    



clustering attempt, leading eigenvector:
Clustering with 7179 elements and 1092 clusters

Igraph allows a suggested number of communities generated, but on the entire network, the applied settings fail (the program cannot divide the network into 3 clusters, in this case).

It will be useful to define a method of saving cluster information in vertex attributes.



In [9]:

    
def saveClusterInfo(g, com, attr):
    '''add to graph 'g' a cluster number from community 'com' as 'attr' attribute'''
    for idx, c in enumerate(com):
        for _ in c:
            g.vs[_][attr] = idx

# Apply above function to our overall graph
saveClusterInfo(ml, com, 'filtered_clustering')

5.2 Centrality measures

Igraph allows us to calculate betweenness centrality and pagerank. These are related measures for identifying the most central vertices in a graph.

We can add a function to calculate and store these measures as attributes.



In [10]:

    
# add betweenness centrality, pagerank, and clustering info to original graph
def saveCentrality(g, name):
    bc = g.betweenness()
    for idx, node in enumerate(g.vs):
        g.vs[idx][name + '_betweenness'] = bc[idx]

    pr = g.pagerank(directed=False)
    for idx, node in enumerate(g.vs):
        g.vs[idx][name + '_pagerank'] = pr[idx]
    return bc, pr

bc, pr = saveCentrality(ml, 'filtered')

5.2.1 Community and centrality analysis

Examining the results in igraph can reveal some useful information. Beginning with the clusters:



In [11]:

    
# First, we will define a function to summarize the clustering attempt.
def summarizeClusters(com, n=5):
    print(len(com), 'clusters.')
    print('maximum size:', len(max(com, key=len)))
    print('minimum size:', len(min(com, key=len)))

    print('\nSummary of first', n, 'clusters:')
    for i in range(n):
        etc = ''
        if len(com[i]) > 5:
            etc += '(and ' + str(len(com[i]) - 5) + ' more)'
        print('[{}] has size of {}. Vertices: {} {}'.format(i, len(com[i]), com[i][0:5], etc))

summarizeClusters(com, n=10)









    



1092 clusters.
maximum size: 5594
minimum size: 1

Summary of first 10 clusters:
[0] has size of 5594. Vertices: [0, 1, 2, 3, 4] (and 5589 more)
[1] has size of 8. Vertices: [37, 1922, 1970, 2242, 2730] (and 3 more)
[2] has size of 4. Vertices: [44, 1931, 2847, 6026] 
[3] has size of 2. Vertices: [80, 1973] 
[4] has size of 4. Vertices: [81, 83, 1974, 4715] 
[5] has size of 2. Vertices: [101, 1997] 
[6] has size of 3. Vertices: [107, 2009, 2686] 
[7] has size of 5. Vertices: [118, 136, 2017, 2020, 2021] 
[8] has size of 5. Vertices: [130, 2033, 2186, 2649, 3428] 
[9] has size of 4. Vertices: [133, 2036, 2037, 2094]

The above output immediately demonstrates that one cluster is by far the largest in the network -- consisting here of 5594 nodes (out of 7179 in the total network). Most of the other clusters are tiny in comparison.

The initial impressions from betweenness measures are also interesting.



In [12]:

    
# sort the betweenness and then list nodes in order of centrality
bc.sort(reverse=True)
max_bc = max(bc)
bc_normalized = [x / max_bc for x in bc]

for idx,val in enumerate(bc_normalized[0:15]):
    print('Node', idx, ':', '{:.0%}'.format(bc_normalized[idx]))









    



Node 0 : 100%
Node 1 : 62%
Node 2 : 50%
Node 3 : 48%
Node 4 : 48%
Node 5 : 48%
Node 6 : 29%
Node 7 : 27%
Node 8 : 19%
Node 9 : 18%
Node 10 : 17%
Node 11 : 16%
Node 12 : 15%
Node 13 : 15%
Node 14 : 14%

We can observe that the betweenness centrality quickly drops, which may indicate nodes that often discuss different subjects for contributing e-mail discussions for different "community of threats".

At this point it is helpful to look at some visualizations to assess the 2008 graph in its current state.

5.2.2 Visualizing the current graph



In [13]:

    
# export the ml graph
filename = 'data/output/out_' + year + '_filtered.graphml'
ml.save(filename, format='graphml') # export graph

When the resulting graph is opened in Gephi, and colored by cluster, it is apparent that one large "blob" still comprises the bulk of the network.

There are still too many communities for easy analysis. The next objective is to isolate the central blob.

5.3 Breaking down the network into subgraphs

5.3.1 The blob

In the case of year 2008, the blob is easily identifiable, taking up majority of the graph. In other years -- particularly later years -- this will not be the case. More sophisticated selection criteria will have to be developed based on the initial clustering results.



In [14]:

    
# The blob has been manually identified as cluster 0

blob = ml.induced_subgraph(com[0])
blob['name'] = year + ' subgraph for cluster 0 (blob)'
summary(blob)









    



IGRAPH U-W- 5594 8576 -- 2008 subgraph for cluster 0 (blob)
+ attr: name (g), color (v), filtered_betweenness (v), filtered_clustering (v), filtered_pagerank (v), id (v), label (v), label1 (v), label2 (v), nodetype (v), original_enumeration (v), original_pagerank (v), source (e), target (e), weight (e)



In [15]:

    
# Now working with the blob, we can re-attempt clustering and bc!

com = blob.community_leading_eigenvector(weights='weight', clusters=8)
print('clustering attempt, leading eigenvector:')
summary(com)
saveClusterInfo(blob, com, 'blob_clustering')

bc, pr = saveCentrality(blob, 'blob')









    



clustering attempt, leading eigenvector:
Clustering with 5594 elements and 8 clusters

Igraph's clustering algorithms are far more effective on the blob than they were on the whole graph. Here, we can specify the number of clusters we are seeking. Experimenting with results, though, demonstrated that there was little significant difference among different numbers of communities; in each case the next steps are the same.

5.3.2 The "blob" colored by subcommunity

8-cluster visualization	15-cluster visualization
[]img/blob8clusters.png?raw=true)

The blob displays distinctive subcommunities. A few of these new clusters still include fireworks (green in these visualizations), but others (colored purple here) are strongly suggestive of discussion communities in which multiple authors interact with each others' messages.

Examining these subcommunities in turn provides an opportunity to discover real communities and begin to develop a corpus from the Full Disclosure discussion.

5.4 Saving the blob subcommunities

Using 8 clusters as a baseline for analysis, the relevant subgraphs can include all useful attributes and be exported in turn.



In [16]:

    
# now let's try each cluster separately

subs = {}  # build all the clusters into a dictionary
for idx, c in enumerate(com):
    subs[idx] = blob.induced_subgraph(com[idx])
    subs[idx]['name'] = year + ' blob subgraph ' + str(idx) + ' of ' + str(len(com))
    # rerun the centrality, clustering, pagerank analyses
    subcom = subs[idx].community_leading_eigenvector(weights='weight', clusters=5)
    saveClusterInfo(subs[idx], subcom, 'local_clustering')
    bc, pr = saveCentrality(subs[idx], 'local')    
    filename = 'output/' + subs[idx]['name'].replace(' ', '_') + '_out.graphml'
    # subs[idx].save(filename, format='graphml')

The last line above, commented out here, will generate subgraph GraphML file for each subcommunity. Output files from the above code snippet are available in the supplementary zip and were used to generate the images below.

Loading each subcommunity into Gephi, and retaining the original coloring (red nodes for authors, black for documents), the different graph structures for each subcommunity become clear.

5.4.1 Each subcommunity of the 2008 blob

Sub-cluster 0	Sub-cluster 1	Sub-cluster 2	Sub-cluster 3

Sub-cluster 4	Sub-cluster 5	Sub-cluster 6	Sub-cluster 7

Sub-groups 2 and 6 are too small for analysis, but the remaining groups represent real communities in dialogue. Thus, the sub-clusters represent opportunities to identify individual authors and attempt to identify conversational topics. Two problems remain to discuss:

Do these groupings have distinctive characteristics?
How can we assess the remaining firework features (e.g. in subcluster 7 above)? Is it reasonable to generalize the original filtering process at the subgraph level?

5.4.2 Cluster features and manual textual analysis

Cluster 0 (161 nodes, 183 edges) features a number of discussions of exploit auctions or sales[17], [18], [19 and includes a potential firework clustering around user Nick FitzGerald <nick () virus-l demon co uk>. A researcher more familiar with security matters should assess whether or not the Nick FitzGerald messages (for example [20], [21], [22]) are valuable for further study; they are primarily general discussions of security topics, in conversation with a wider group of list members. Potential keywords: exploits, sales, auction, zero-day
Cluster 1 (488 nodes, 983 edges) is heavily dominated by three authors, Ureleet <ureleet () gmail com>, Valdis.Kletnieks () vt edu, and above all n3td3v <xploitable () gmail com>, an infamous list member discussed in several forums even outside of Full Disclosure[23], [24]. n3td3v responds to a wide range of topics including over 300 separate emails in the overall 2008 graph. The fact that these users are not typical makes analysis difficult. The best practice is to treat them as trolls (n3td3v is referred to as a troll in the cited links, with some caveats.) Conveniently, the blob clustering has linked together all of these trolling-related postings into one giant subgraph of argumentation and flame wars. (Examples of such posts include [25], [26], [27].) This cluster can be dropped from further discussion. Potential keywords: n3td3v
Cluster 2, as mentioned earlier, can also be ignored for its small size of 26 nodes.
Cluster 3 is the largest in the blob (4151 nodes, 4039 edges) and corresponds to the large purple communities in the blob overviews above. This cluster is so large that it is a candidate for further subdivision, treating it as yet another 'blob.' It also includes fireworks within its graph. Fireworks within this cluster includ, for example, author zdi-disclosures () 3com com whose postings are in fact security advisories (e.g. [28]. These authors' advisories garnered some responses in the original full-year network and might have been removed with more robust filtering in an earlier stage. For non-firework postings, a manual overview suggests a wide variety of topics in this cluster, including many that touch on the mailing list's central topic, full-disclosure security postings[29]. [30], [31].
Cluster 4 is a small subgraph (118 nodes, 138 edges) notable for Windows-focused discussions[31, [32], including then-current conversations about Windows Vista in comparison to Windows XP[31. There are some additional general discussion of security concerns, but little to make this small group distinct from the overall list. Potential keywords: Microsoft, Windows
Cluster 5 is another small subgraph (145 nodes, 164 edges) with a few dominant authors, similar in some ways to cluster 1. These authors seem to treat the Full Disclosure list as a social group as well as a security forum, with email posts featuring titles like "Came across this site"[32], "Anyone have a Lindows/Linspire contact"[33, and "Southwest Airlines Ticket Silliness"[34]. Although the members of this clustering discuss some security measures, their emphasis on general discussions remove this group from consideration.
Cluster 6 (21 nodes, 0 edges) is again too small to analyze.
Cluster 7 (484 nodes, 601 edges) presents a problem for future analysis: the first examples of truly belligerent trolling. In cluster 1 above, n3td3v and other users were extremely active and annoyed others, but did not display overtly antisocial behavior. At least one member of cluster 7 posted offensive[35][36] content on a regular basis, while other users here display more aggression[37] or attacks on personal lives[38]. The difficulty in this cluster is assessing the level of discussion of actual security concerns vs. trolling and attacks. This subcluster is thus a strong candidate for further clustering analysis which is likely to separate out the authors who engage in malbehavior from the rest of the group.

6 Conclusions

The techniques developed above can be summarized as follows:

Filter out "firework" nodes (and the subsequently orphaned nodes that the firework authors originally posted).
Identify subcommunities within the graph that deserve deeper analysis (e.g., the "blob")
Subdivide into communities again.
Manually analyze the resulting subgraphs.

This strategy was successful, though incomplete. Two of the resulting subgraphs (numbers 3 and 7 above) merit further subdivision.

The results suggest that a more fully-developed programming framework could automate much of this work, as a researcher could discard clusters based on graph structure or content, narrowing in on relevant subclusters of subclusters. Early stages of such a framework in Python are suggested above.

The preliminary results based on the subclusters of the initial "blob" are encouraging, demonstrating that social-focused clusters (like cluster 5) and flame-war clusters (like cluster 1) can be isolated from the surrounding graphs through community detection.

6.2 Assessing tools used

Gephi and igraph were adequate for this research, but both tools are flawed. Alternatives worthy of consideration for future iterations include:

Cytoscape -- comparable to Gephi, but with the potentially useful feature of allowing for publication of graph data as an interactive web page[38].
NetworkX -- comparable to igraph. The main reason to suggest NetworkX as a potential replacement for igraph is that NetworkX currently has a better-functioning integration with Jupyter notebooks, whereas igraph faces bugs when attempting to plot images within a notebook[40.

6.3 Suggestions for next steps

Next steps for this analysis should include building out the skeleton Python program above to adhere to "Don't Repeat Yourself" principles and allow for automatic analysis of multiple years of the Full Disclosure list, breaking down large networks into manageable subgraphs.
Rather than relying on static csv edge and node lists, it is possible to pull graph data directly from a database into igraph[42]. As the PERCEIVE project moves in this direction, the social network analysis segment will be able to keep up.
Igraph's community-generation algorithms allow a user to provide a desired number of subcommunities. A Jupyter notebook with interactive features could allow a user to modify this number and visually assess the results, potentially identifying a number that could most readily identify targeted subcommunities.
The research conclusions fail to develop a model of an ideal subcommunity size for analysis. Cluster 6 with 21 nodes was too small to have meaningful discussion, while Cluster 3 with 4151 nodes was too large to break down. A useful middle ground could be identified and then program parameters automatically adjusted to attempt to generate subgroups of preferred size.
Manual examination of dozens of nodes is prone to error. Future researchers should utilize multiple human analysts to identify and confirm commonalities among subcommunity members. This task could be facilitated by services like Mechanical Turk.



In [ ]: