We want to analyze participants and patterns of participation across IETF groups. How many people participate, in which groups, how does affiliation, gender, RFC authorship or other characteristics relate to levels of participation, and a variety of other related questions. How do groups relate to one another? Which participants provide important connections between groups?
Start by importing the necessary libraries.
In [1]:
%matplotlib inline
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
import bigbang.utils as utils
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os
import csv
import re
import scipy
import scipy.cluster.hierarchy as sch
import email
In [2]:
#pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options
plt.rcParams['axes.facecolor'] = 'white'
import seaborn as sns
sns.set()
sns.set_style("white")
Let's start with a single IETF mailing list. (Later, we can expand to all current groups, or all IETF lists ever.)
In [3]:
list_url = '6lo' # perpass happens to be one that I subscribe to
ietf_archives_dir = '../archives' # relative location of the ietf-archives directory/repo
list_archive = mailman.open_list_archives(list_url, ietf_archives_dir)
activity = Archive(list_archive).get_activity()
In [4]:
people = None
people = pd.DataFrame(activity.sum(0), columns=['6lo']) # sum the message count, rather than by date
In [5]:
people.describe()
Out[5]:
Now repeat, parsing the archives and collecting the activities for all the mailing lists in the corpus. To make this faster, we try to open pre-created -activity.csv
files which contain the activity summary for the full list archive. These files are created with bin/mail_to_activity.py
or might be included in the mailing list archive repository.
In [6]:
f = open('../examples/mm.ietf.org.txt', 'r')
ietf_lists = set(f.readlines()) # remove duplicates, which is a bug in list maintenance
In [7]:
list_activities = []
for list_url in ietf_lists:
try:
activity_summary = mailman.open_activity_summary(list_url, ietf_archives_dir)
if activity_summary is not None:
list_activities.append((list_url, activity_summary))
except Exception as e:
print(str(e))
In [8]:
len(list_activities)
Out[8]:
Merge all of the activity summaries together, so that every row is a "From" field, with a column for every mailing list and a cell that includes the number of messages sent to that list. This will be a very sparse, 2-d table. This operation is a little slow. Don't repeat this operation without recreating people
from the cells above.
In [9]:
list_columns = []
for (list_url, activity_summary) in list_activities:
list_name = mailman.get_list_name(list_url)
activity_summary.rename(columns={'Message Count': list_name}, inplace=True) # name the message count column for the list
people = pd.merge(people, activity_summary, how='outer', left_index=True, right_index=True)
list_columns.append(list_name) # keep a list of the columns that specifically represent mailing list message counts
In [10]:
# the original message column was duplicated during the merge process, so we remove it here
people = people.drop(columns=['6lo_y'])
people = people.rename(columns={'6lo_x':'6lo'})
In [12]:
people.describe()
Out[12]:
In [46]:
# not sure how the index ended up with NaN values, but need to change them to strings here so additional steps will work
new_index = people.index.fillna('missing')
people.index = new_index
Split out the email address and header name from the From header we started with.
In [47]:
froms = pd.Series(people.index)
emails = froms.apply(lambda x: email.utils.parseaddr(x)[1])
emails.index = people.index
names = froms.apply(lambda x: email.utils.parseaddr(x)[0])
names.index = people.index
people['email'] = emails
people['name'] = names
Let's create some summary statistical columns.
In [48]:
people['Total Messages'] = people[list_columns].sum(axis=1)
people['Number of Groups'] = people[list_columns].count(axis=1)
people['Median Messages per Group'] = people[list_columns].median(axis=1)
In [49]:
people['Total Messages'].sum()
Out[49]:
In this corpus, 101,510 "people" sent a combined total of 1.2 million messages. Most people sent only 1 message.
The vast majority of people send only a few messages, and to only a couple of lists. (These histograms use a log axis for Y, without which you couldn't even see the columns besides the first.)
In [22]:
people[['Total Messages']].plot(kind='hist', bins=100, logy=True, logx=False)
people[['Number of Groups']].plot(kind='hist', bins=100, logy=True, logx=False)
Out[22]:
Let's limit our analysis for now to people who have sent at least 5 messages. We will also create log base 10 versions of our summary columns for easier graphing later.
In [23]:
working = people[people['Total Messages'] > 5]
working['Total Messages (log)'] = np.log10(working['Total Messages'])
working['Number of Groups (log)'] = np.log10(working['Number of Groups'])
The median number of messages that a user sends to a group is also heavily weighted towards a small number, but the curve doesn't seem to drop off in the same extreme manner. There is a non-random tendency to send some messages to a group?
In [24]:
working[['Median Messages per Group']].plot(kind='hist', bins=100, logy=True)
Out[24]:
Is there a relationship between the number of groups that a user has sent messages to and the number of messages that user has sent (total, or the median number to groups)?
In [25]:
working.plot.scatter('Number of Groups','Total Messages', xlim=(1,300), ylim=(1,20000), logx=False, logy=True)
Out[25]:
It appears that there are interesting outliers here. Some who send a couple messages each to a large number of groups, but then a separate group of outliers that sends lots of messages and to lots of groups. That might be an elite component worthy of separate analysis.
A density graph will show, however, that while there are people who send many messages to a small number of groups, still, most people are clustered around sending few messages, to few groups.
In [26]:
sns.jointplot(x='Number of Groups',y='Total Messages (log)', data=working, kind="kde", xlim=(0,50), ylim=(0,3));
Can we learn implicit relationships between groups based on the messaging patterns of participants?
We want to work with just the data of people and how many messages they sent to each group.
In [27]:
df = people[people['Total Messages'] > 5]
df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)
Principal Component Analysis (PCA) will seek to explain the most variance in the samples (participants) based on the features (messages sent to different lists). Let's try with two components and see what PCA sees as the most distinguishing dimensions of IETF participation.
In [28]:
import sklearn
from sklearn.decomposition import PCA
scaled = sklearn.preprocessing.maxabs_scale(df)
pca = PCA(n_components=2, whiten=True)
pca.fit(scaled)
Out[28]:
In [29]:
components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
components_frame
Out[29]:
In [30]:
for i, row in components_frame.iterrows():
print('\nComponent %d' % i)
r = row.sort_values(ascending=False)
print('Most positive correlation:\n %s' % r[:5].index.values)
print('Most negative correlation:\n %s' % r[-5:].index.values)
Component 0 is mostly routing (Layer 3 and Layer 2 VPNs, the routing area working group, interdomain routing. (IP Performance/Measurement seems different -- is it related?)
Component 1 is all Internet area groups, mostly related to IPv6, and specifically different groups working on mobility-related extensions to IPv6.
When data was unscaled, PCA components seemed to connect to ops and ipv6, a significantly different result.
For our two components, we can see which features are most positively correlated and which are most negatively correlated. On positive correlation, looking up these groups, it seems like there is some meaningful coherence here. On Component 0, we see groups in the "ops" area: groups related to the management, configuration and measurement of networks. On the other component, we see groups in the Internet and transport areas: groups related to IPv6, the transport area and PSTN transport.
That we see such different results when the data is first scaled by each feature perhaps suggests that the initial analysis was just picking up on the largest groups.
In [31]:
pca.explained_variance_
Out[31]:
The explained variance by our components seems extremely tiny.
With two components (or the two most significant components), we can attempt a basic visualization as a scatter plot.
In [32]:
component_df = pd.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(2)], index=df.index)
component_df.plot.scatter(x='PCA0',y='PCA1')
Out[32]:
And with a larger number of components?
In [33]:
pca = PCA(n_components=10, whiten=True)
pca.fit(scaled)
components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
for i, row in components_frame.iterrows():
print('\nComponent %d' % i)
r = row.sort_values(ascending=False)
print('Most positive correlation:\n %s' % r[:5].index.values)
print('Most negative correlation:\n %s' % r[-5:].index.values)
There are definitely subject domain areas in these lists (the last one, for example, on groups related to phone calls and emergency services). Also interesting is the presence of some meta-topics, like mtgvenue
or policy
or iasa20
(an IETF governance topic).
Future work: we might be able to use this sparse matrix of participation in different lists to provide recommendations of similarity. "People who send messages to the same mix of groups you send to also like this other list" or "People who like this list, also often like this list".
Because we have people and the groups they send to, we can construct a bipartite graph.
We'll use just the top 5000 people, in order to make complicated calculations run faster.
In [34]:
df = people.sort_values(by="Total Messages",ascending=False)[:5000]
df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)
In [35]:
import networkx as nx
G = nx.Graph()
for group in df.columns:
G.add_node(group,type="group")
for name, data in df.iterrows():
G.add_node(name,type="person")
for group, weight in data.items():
if weight > 0:
G.add_edge(name,group,weight=weight)
In [36]:
nx.is_bipartite(G)
Out[36]:
Yep, it is bipartite! Now, we can export a graph file for use in visualization software Gephi.
In [37]:
nx.write_gexf(G,'ietf-participation-bipartite.gexf')
In [38]:
people_nodes, group_nodes = nx.algorithms.bipartite.sets(G)
We can calculate the "PageRank" of each person and group, using the weights (number of messages) between groups and people to distribute a kind of influence.
In [39]:
pr = nx.pagerank(G, weight="weight")
In [40]:
nx.set_node_attributes(G, "pagerank", pr)
In [ ]:
sorted([node for node in list(G.nodes(data=True))
if node[1]['type'] == 'group'],
key=lambda x: x[1]['pagerank'],
reverse =True)[:10]
In [41]:
sorted([node for node in list(G.nodes(data=True))
if node[1]['type'] == 'person'],
key=lambda x: x[1]['pagerank'],
reverse =True)[:10]
However, PageRank is probably less informative than usual here, because this is a bipartite, non-directed graph. Instead, let's calculate a normalized, closeness centrality specific to bipartite graphs.
In [42]:
person_nodes = [node[0] for node in G.nodes(data=True) if node[1]['type'] == 'person']
NB: Slow operation for large graphs.
In [43]:
cc = nx.algorithms.bipartite.centrality.closeness_centrality(G, person_nodes, normalized=True)
In [44]:
for node, value in list(cc.items()):
if type(node) not in [str, str]:
print(node)
print(value)
In [45]:
del cc[14350.0] # remove a spurious node value
In [ ]:
nx.set_node_attributes(G, "closeness", cc)
In [ ]:
sorted([node for node in list(G.nodes(data=True))
if node[1]['type'] == 'person'],
key=lambda x: x[1]['closeness'],
reverse=True)[:25]
The people with the highest closeness centrality are the ones that have the most co-affiliation with every other person, or the shortest path to every other person. Automated accounts are, as we might expect, extremely high on this measure -- they're used to send announcements of publications and do so to basically every group. The individual people highest ranked on this measure include Stephen Farrell, Jari Arkko, Ben Campbell -- long-time participants with leadership roles. The highest ranked woman is Alissa Cooper, current Chair of the IETF.
TODO: calculating bi-cliques (the people who all are connected to the same group) and then measuring correlation in bi-cliques (people who belong to many of the same groups) could allow for analysis of cohesive subgroups and a different network analysis/visualization. See Borgatti, S.P. and Halgin, D. In press. “Analyzing Affiliation Networks”. In Carrington, P. and Scott, J. (eds) The Sage Handbook of Social Network Analysis. Sage Publications. http://www.steveborgatti.com/papers/bhaffiliations.pdf
In [ ]: