1.1 TITLE ‒ Introduction to social networking analysis of the Full Disclosure email network
1.2 DESCRIPTION ‒ A security email list includes useful information for predictive analysis of emerging threats, but this information may be lost alongside irrelevant discussion threads. Social network analysis provides one method to identify and filter out unwanted messages. This introductory notebook develops techniques to concentrate on the core content of the email network studied.
1.3 ABSTRACT ‒ This notebook begins with preliminary analysis and exploration of the Full Disclosure email network, developing a general sense of the network's structure over time and within a single year. The general overview reveals that the email list changed in character over time, growing in activity in early year but declining sharply in participation in recent years. Important features of the network, particularly its "firework" structures, are noted and discussed, leading to a working hypothesis: that the firework clusters represent prolific authors whose emails do not generate much list discussion, and that these threads that are not useful for the PERCEIVE project's study of emerging threats, as they represent content like security advisories. Manual inspection of nodes confirms this hypothesis.
In order to concentrate on actual discussion topics within the list, programmatic methods are developed to filter out these undesired firework structures. Once filtered, the network is amenable to community detection, which in turn points to the need for further filtering. This general process -- filtering the network, identifying subcommunities, filtering again, and identifying subcommunites again -- emerges as a useful iterative procedure to find genuine discussion communities.
The PERCEIVE project seeks to proactively identify upcoming cybersecurity threats through textual similarity. Social network analysis can be used to partition a network and evaluate its textual content, and hence provide word-selection criteria for defining corpus documents.
The subject of this notebook, the Full Disclosure (FD) mailing list is a "public, vendor-neutral forum for detailed discussion of vulnerabilities and exploitation techniques, as well as tools, papers, news, and events of interest to the community."
Although cybersecurity email mailing lists provide a rich source to identify emerging concepts, they contain a large amount of content that is irrelevant to the project's purpose, including but not limited to:
As some of this irrelevant content can be strictly tied to its producer, i.e., the authors of the e-mail replies, social network analysis provides an interesting opportunity for the isolation of relevant discussion topics in order to filter non-related vulnerability content.
Earlier in this project, the FD email lists were developed into networks of edges and nodes, divided by year. These original csv files are available online.
In the FD graphs, nodes represent either documents (i.e., emails) or authors, as identified by a nodeType attribute. Edges are directed, representing authorship and replies; edge weight indicates an increasing number of replies.
The following screenshots show sample data from the edge and node lists for 2008. The edge lists provide three attributes: source (an author who posted an email), target (the document written), and weight (the number of times the author wrote back to the resulting thread). The node list attributes include an id (the author or the email document), a label (email address of an author or email subject for a document), Label1 (response dates if applicable), Label2 (URLs for the the documents), hexadecimal Color of red (#FF00000) or black (#000000) based on nodeType of author or document.
| 2008 edge list sample | 2008 node list sample |
|---|---|
While Gephi proved useful for data exploration, some difficulties arose. In particular, many of the core functions[7] and plugins[8] once used for social network analysis are not compatible with current (0.9) software versions. Gephi specifications note that the 0.9.0 version (December 2015) "Removed [the] Clustering module"[9] that these plugins relied upon.
Python-igraph (igraph hereafter) allows for programmatic manipulation of network data. Igraph can generate images via a plot function, but this is a slow process for large graphs. Igraph has a wide variety of import and export options for network data, including straightforward export of subgraphs.
While Gephi can export graphs, its export implementations include temporary node properties like position and color attributes. Igraph can import Gephi-exported GraphML files (and those are available here. However, it was more useful in the long run to generate graphs in igraph directly from the provided edge and node lists. Igraph's exported GraphML files functioned as expected in Gephi.
In Gephi, graphs were generated via the "import spreadsheet" function, and separate Gephi projects saved for each year. Exported versions of these graphs are available in the supplementary zip file.
In igraph, graphs are generated via importing csv data into lists, then using igraph's DictReader function.
In [1]:
from igraph import Graph, summary
from csv import DictReader
# 2008 will be the sample year for this introductory analysis, but let's define it as a variable
# for future development of functions that will work for all years
year = '2008'
# import edge list and vertex list as lists of dictionaries:
e = []
with open('data/input/author_threadID_edgelist_' + year + '.csv') as csvfile:
reader = DictReader(csvfile, dialect='excel')
reader.fieldnames = [name.lower() for name in reader.fieldnames] # igraph expects lowercase attribute names
for row in reader:
row['weight'] = int(row['weight']) # convert weight from string to int
e.append(row)
v = []
with open('data/input/nodelist_aut_doc_' + year +'.csv') as csvfile:
reader = DictReader(csvfile, dialect='excel')
reader.fieldnames = [name.lower() for name in reader.fieldnames] # igraph expects lowercase attribute names
for row in reader:
v.append(row)
# build a graph from the lists; see http://igraph.org/python/doc/igraph.Graph-class.html#DictList
ml = Graph.DictList(vertices=v, edges=e, directed=True, vertex_name_attr='id') # specify the 'id' attribute here
# because igraph anticipates a 'name' instead
ml['name'] = year + ' Full Disclosure network' # provide a name for this graph
summary(ml) # list properties and attributes
The summary command providing the output seen above is explained in the igraph documentation[10]. In this example, the four-character code "D-W-" indicates that the graph is directed and weighted. The graph for 2008 has 9793 vertices (nodes) and 12073 edges.
The list of attributes in the summary ("name", "color", etc.) are those for the graph (g), vertices (v), or edges (e).
In the remainder of this notebook, 2008 will be used for single-year analysis.
Importing the entire graphs for each year into Gephi allowed for early visual inspection and exploration of the FD lists.
The following table includes full-network Gephi graphics for each year of the email list's existence; blue nodes represent authors and red nodes represent documents. Click through for full-size renderings.
It is clear from a quick look at the graphs that the overall pattern of the Full Disclosure network has shifted over time. In the early years of the list, there were tightly clustered conversations, particularly from about 2003-2006. As the years went by, the level of discussion seems to have dropped. Some of this pattern can be inferred simply from the file sizes of the edge and node lists, or by counting the nodes.
The visual impression is further confirmed by a documented change in the FD network in March, 2014, when one of the original managers of the list decided to halt the list entirely[11] but another team soon revived it[12].
Following the 2014 format change, the number of edges drops significantly, and the list displays a different network structure from the earlier years. Currently, the list remains active but few members respond to published posts.
Visual analysis of the graphs for each year showed distinct fireworks patterns throughout the lifecycle of the list. These are cases where one author is posting to the list many times, with little or no response from other people on the list, resulting in a hub-and-spoke graph form. The working hypothesis is that these firework patterns generally represent security advisories. If so, these are discussions known issues, and not useful for the project's overall predictive goals.
These fireworks were analyzed both visually, in Gephi, and programmatically, in igraph, to test the hypothesis.
In Gephi, manual examination included use of the Linkfluence plugin to launch URLs in a web browser directly from nodes studied. This node-by-node analysis demonstrated that firework nodes were security advisories[13][14][15].
In igraph, it was informative to assess each author node and examine the in-degree of its connected (document) nodes. In cases where all (or most) of an author's neighbors have an in-degree of one, the author is the hub of a firework.
In [2]:
print('List of authors and their connected (document) vertices, among first 45 in total list,\nlimited to those with neighbors with max in-degree < 2')
for node in ml.vs(nodetype_eq='author')[0:45]:
indegrees = [ml.degree(_, mode='in') for _ in ml.neighbors(node)]
if indegrees: # a few authors have 0 neighbors...
if max(indegrees) < 2:
print()
print('-' * 55)
print(node['label'], '\n')
print(' In-degree\t Label')
print(' ---------\t -----')
for _ in ml.neighbors(node):
print(' ', ml.degree(_, mode='in'), '\t\t', ml.vs[_]['label'][0:60] )
print(' MAX:', max(indegrees))
print(' AVERAGE:', sum(indegrees) / max(len(indegrees), 1))
In the above code example, it is revealed that authors "Matousec" and "Vic Vandal" are centers of fireworks. Email subjects suggest that these are security advisories and conference announcements, respectively.
A few authors in the 2008 network have zero neighboring nodes. These are responses to email threads from earlier years. In general, we are not interested in these delayed responses. Security literature[16] suggests that vulnerabilities are only of interest in the short term, and discussions within a single year are a reasonable target for our analysis.
Igraph allows for removal of vertices from the graph based on useful criteria. Since the "fireworks" are not desired in future analysis, they can be removed.
Before filtering, it may be useful to store some graph statistics in the node attributes.
In [3]:
pr = ml.pagerank(directed=True, weights='weight')
for idx, _ in enumerate(ml.vs):
_['original_pagerank'] = pr[idx]
_['original_enumeration'] = idx - 1 # this can link us back to the node list if needed
A simple function to list total numbers of nodes and authors will be useful as we attempt to shrink the analyzed graph:
In [4]:
def nodeCount(g):
print('Total vertices for', g['name'], ':\n ', len(g.vs), 'including', len(g.vs(nodetype_eq='author')), 'authors')
nodeCount(ml)
The earlier analysis of the in-degree of authors' neighbors suggested that those authors whose neighbors' in-degree was mostly 1 are centers of fireworks. A fuller examination of the data revealed that these types of list postings occasionally gain responses[16]. Therefore, it was determined that a firework including documents with average (or mean) in-degrees of < 1.2 can be removed from the graph.
In [5]:
# now let's start filtering...
# first remove authors with neighbors mostly of in-degree 1
print('Filtering out authors with neighbors mostly of in-degree 1...')
for node in ml.vs:
if node['nodetype'] == 'author':
indegrees = [ml.degree(_, mode='in') for _ in ml.neighbors(node)]
if indegrees: # a few authors have 0 neighbors... this is something to investigate later
if sum(indegrees) / max(len(indegrees), 1) < 1.2: # there are examples where most of the indegrees are 1 but
# we have a random response, so let's use the mean
ml.delete_vertices(node) # deleting vertices also deletes all of its edges
ml['name'] = ml['name'] + ' (fireworks removed)'
nodeCount(ml)
The above reduction in network size is helpful, but the removal of firework hubs is likely to leave a number of isolated documents in the graph. Removing all nodes with degree 0 can complete the firework cleanup.
In [6]:
print('Filtering out documents or authors with degree 0...')
for node in ml.vs:
if ml.degree(node) == 0:
ml.delete_vertices(node)
ml['name'] = ml['name'] + ' (isolated nodes removed)'
nodeCount(ml)
The numbers after firework-filtering are still large. In order to focus more narrowly on relevant portions of this network, some deeper social network analysis techniques are required.
The network structure varies considerably across the years. This provides an opportunity to partition a network for a given year into several clusters and investigate if the visualized structure has any association to the textual content of a given e-mail thread discussion.
If such association exists, then we can leverage the social network structure in order to simplify the identification of emerging concepts through text alone. For instance, identifying a group of individuals who prefer certain subjects, or are spammers or trolls, may become a trivial pre-processing stage before textual content is analyzed for emerging concepts.
We begin by considering 2 methods to more precisely define how to partition a network at any given year:
In real-world social networks, there is a tendency towards clustering of nodes with strong ties: if I have a close friend who has another close friend, I am likely to make a connection with that person. There are mathematical methods for identifying clusters based on the existing edges of their neighbors in comparison to the possible connections among them.
Both Gephi and igraph have clustering functions, but igraph’s are more robust and allow us to easily filter out subgraphs.
Igraph's community measures do not support directed graphs, as seen in the warning message below, but the directionality can be removed. It is worth noting that directionality is implied in the structure of the graph (authors write or respond to documents).
In [7]:
com = ml.community_leading_eigenvector(weights='weight', clusters=3)
In [8]:
ml.to_undirected(combine_edges='max') # eliminate the directionality
# Attempt to identify communities
com = ml.community_leading_eigenvector(weights='weight', clusters=3)
print('clustering attempt, leading eigenvector:')
summary(com)
Igraph allows a suggested number of communities generated, but on the entire network, the applied settings fail (the program cannot divide the network into 3 clusters, in this case).
It will be useful to define a method of saving cluster information in vertex attributes.
In [9]:
def saveClusterInfo(g, com, attr):
'''add to graph 'g' a cluster number from community 'com' as 'attr' attribute'''
for idx, c in enumerate(com):
for _ in c:
g.vs[_][attr] = idx
# Apply above function to our overall graph
saveClusterInfo(ml, com, 'filtered_clustering')
In [10]:
# add betweenness centrality, pagerank, and clustering info to original graph
def saveCentrality(g, name):
bc = g.betweenness()
for idx, node in enumerate(g.vs):
g.vs[idx][name + '_betweenness'] = bc[idx]
pr = g.pagerank(directed=False)
for idx, node in enumerate(g.vs):
g.vs[idx][name + '_pagerank'] = pr[idx]
return bc, pr
bc, pr = saveCentrality(ml, 'filtered')
In [11]:
# First, we will define a function to summarize the clustering attempt.
def summarizeClusters(com, n=5):
print(len(com), 'clusters.')
print('maximum size:', len(max(com, key=len)))
print('minimum size:', len(min(com, key=len)))
print('\nSummary of first', n, 'clusters:')
for i in range(n):
etc = ''
if len(com[i]) > 5:
etc += '(and ' + str(len(com[i]) - 5) + ' more)'
print('[{}] has size of {}. Vertices: {} {}'.format(i, len(com[i]), com[i][0:5], etc))
summarizeClusters(com, n=10)
The above output immediately demonstrates that one cluster is by far the largest in the network -- consisting here of 5594 nodes (out of 7179 in the total network). Most of the other clusters are tiny in comparison.
The initial impressions from betweenness measures are also interesting.
In [12]:
# sort the betweenness and then list nodes in order of centrality
bc.sort(reverse=True)
max_bc = max(bc)
bc_normalized = [x / max_bc for x in bc]
for idx,val in enumerate(bc_normalized[0:15]):
print('Node', idx, ':', '{:.0%}'.format(bc_normalized[idx]))
We can observe that the betweenness centrality quickly drops, which may indicate nodes that often discuss different subjects for contributing e-mail discussions for different "community of threats".
At this point it is helpful to look at some visualizations to assess the 2008 graph in its current state.
In [13]:
# export the ml graph
filename = 'data/output/out_' + year + '_filtered.graphml'
ml.save(filename, format='graphml') # export graph
When the resulting graph is opened in Gephi, and colored by cluster, it is apparent that one large "blob" still comprises the bulk of the network.
There are still too many communities for easy analysis. The next objective is to isolate the central blob.
In the case of year 2008, the blob is easily identifiable, taking up majority of the graph. In other years -- particularly later years -- this will not be the case. More sophisticated selection criteria will have to be developed based on the initial clustering results.
In [14]:
# The blob has been manually identified as cluster 0
blob = ml.induced_subgraph(com[0])
blob['name'] = year + ' subgraph for cluster 0 (blob)'
summary(blob)
In [15]:
# Now working with the blob, we can re-attempt clustering and bc!
com = blob.community_leading_eigenvector(weights='weight', clusters=8)
print('clustering attempt, leading eigenvector:')
summary(com)
saveClusterInfo(blob, com, 'blob_clustering')
bc, pr = saveCentrality(blob, 'blob')
Igraph's clustering algorithms are far more effective on the blob than they were on the whole graph. Here, we can specify the number of clusters we are seeking. Experimenting with results, though, demonstrated that there was little significant difference among different numbers of communities; in each case the next steps are the same.
| 8-cluster visualization | 15-cluster visualization |
|---|---|
| [ |
The blob displays distinctive subcommunities. A few of these new clusters still include fireworks (green in these visualizations), but others (colored purple here) are strongly suggestive of discussion communities in which multiple authors interact with each others' messages.
Examining these subcommunities in turn provides an opportunity to discover real communities and begin to develop a corpus from the Full Disclosure discussion.
Using 8 clusters as a baseline for analysis, the relevant subgraphs can include all useful attributes and be exported in turn.
In [16]:
# now let's try each cluster separately
subs = {} # build all the clusters into a dictionary
for idx, c in enumerate(com):
subs[idx] = blob.induced_subgraph(com[idx])
subs[idx]['name'] = year + ' blob subgraph ' + str(idx) + ' of ' + str(len(com))
# rerun the centrality, clustering, pagerank analyses
subcom = subs[idx].community_leading_eigenvector(weights='weight', clusters=5)
saveClusterInfo(subs[idx], subcom, 'local_clustering')
bc, pr = saveCentrality(subs[idx], 'local')
filename = 'output/' + subs[idx]['name'].replace(' ', '_') + '_out.graphml'
# subs[idx].save(filename, format='graphml')
The last line above, commented out here, will generate subgraph GraphML file for each subcommunity. Output files from the above code snippet are available in the supplementary zip and were used to generate the images below.
Loading each subcommunity into Gephi, and retaining the original coloring (red nodes for authors, black for documents), the different graph structures for each subcommunity become clear.
| Sub-cluster 0 | Sub-cluster 1 | Sub-cluster 2 | Sub-cluster 3 |
|---|---|---|---|
| Sub-cluster 4 | Sub-cluster 5 | Sub-cluster 6 | Sub-cluster 7 |
Sub-groups 2 and 6 are too small for analysis, but the remaining groups represent real communities in dialogue. Thus, the sub-clusters represent opportunities to identify individual authors and attempt to identify conversational topics. Two problems remain to discuss:
Nick FitzGerald <nick () virus-l demon co uk>. A researcher more familiar with security matters should assess whether or not the Nick FitzGerald messages (for example [20], [21], [22]) are valuable for further study; they are primarily general discussions of security topics, in conversation with a wider group of list members. Potential keywords: exploits, sales, auction, zero-dayUreleet <ureleet () gmail com>, Valdis.Kletnieks () vt edu, and above all n3td3v <xploitable () gmail com>, an infamous list member discussed in several forums even outside of Full Disclosure[23], [24]. n3td3v responds to a wide range of topics including over 300 separate emails in the overall 2008 graph. The fact that these users are not typical makes analysis difficult. The best practice is to treat them as trolls (n3td3v is referred to as a troll in the cited links, with some caveats.) Conveniently, the blob clustering has linked together all of these trolling-related postings into one giant subgraph of argumentation and flame wars. (Examples of such posts include [25], [26], [27].) This cluster can be dropped from further discussion. Potential keywords: n3td3vzdi-disclosures () 3com com whose postings are in fact security advisories (e.g. [28]. These authors' advisories garnered some responses in the original full-year network and might have been removed with more robust filtering in an earlier stage. For non-firework postings, a manual overview suggests a wide variety of topics in this cluster, including many that touch on the mailing list's central topic, full-disclosure security postings[29]. [30], [31].The techniques developed above can be summarized as follows:
This strategy was successful, though incomplete. Two of the resulting subgraphs (numbers 3 and 7 above) merit further subdivision.
The results suggest that a more fully-developed programming framework could automate much of this work, as a researcher could discard clusters based on graph structure or content, narrowing in on relevant subclusters of subclusters. Early stages of such a framework in Python are suggested above.
The preliminary results based on the subclusters of the initial "blob" are encouraging, demonstrating that social-focused clusters (like cluster 5) and flame-war clusters (like cluster 1) can be isolated from the surrounding graphs through community detection.
Gephi and igraph were adequate for this research, but both tools are flawed. Alternatives worthy of consideration for future iterations include:
In [ ]: