We want to analyze participants and patterns of participation across IETF groups. How many people participate, in which groups, how does affiliation, gender, RFC authorship or other characteristics relate to levels of participation, and a variety of other related questions. How do groups relate to one another? Which participants provide important connections between groups?

Setup and gather data

Start by importing the necessary libraries.


In [1]:
%matplotlib inline
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
import bigbang.utils as utils
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os
import csv
import re
import scipy
import scipy.cluster.hierarchy as sch
import email

In [2]:
#pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options
plt.rcParams['axes.facecolor'] = 'white'
import seaborn as sns
sns.set()
sns.set_style("white")

Let's start with a single IETF mailing list. (Later, we can expand to all current groups, or all IETF lists ever.)


In [3]:
list_url = '6lo' # perpass happens to be one that I subscribe to

ietf_archives_dir = '../archives' # relative location of the ietf-archives directory/repo

list_archive = mailman.open_list_archives(list_url, ietf_archives_dir)
activity = Archive(list_archive).get_activity()


/home/lem/Data/bigbang/bigbang/mailman.py:138: UserWarning: No mailing list name found at 6lo
  warnings.warn("No mailing list name found at %s" % url)
59
Opening 59 archive files

In [4]:
people = None
people = pd.DataFrame(activity.sum(0), columns=['6lo']) # sum the message count, rather than by date

In [5]:
people.describe()


Out[5]:
6lo
count 248.000000
mean 10.479839
std 29.060171
min 1.000000
25% 1.000000
50% 2.000000
75% 8.000000
max 278.000000

Now repeat, parsing the archives and collecting the activities for all the mailing lists in the corpus. To make this faster, we try to open pre-created -activity.csv files which contain the activity summary for the full list archive. These files are created with bin/mail_to_activity.py or might be included in the mailing list archive repository.


In [6]:
f = open('../examples/mm.ietf.org.txt', 'r')
ietf_lists = set(f.readlines()) # remove duplicates, which is a bug in list maintenance

In [7]:
list_activities = []

for list_url in ietf_lists:
    try:
        activity_summary = mailman.open_activity_summary(list_url, ietf_archives_dir)
        if activity_summary is not None:
            list_activities.append((list_url, activity_summary))
    except Exception as e:
        print(str(e))


/home/lem/Data/bigbang/bigbang/mailman.py:138: UserWarning: No mailing list name found at 
  warnings.warn("No mailing list name found at %s" % url)

In [8]:
len(list_activities)


Out[8]:
955

Merge all of the activity summaries together, so that every row is a "From" field, with a column for every mailing list and a cell that includes the number of messages sent to that list. This will be a very sparse, 2-d table. This operation is a little slow. Don't repeat this operation without recreating people from the cells above.


In [9]:
list_columns = []
for (list_url, activity_summary) in list_activities:
    list_name = mailman.get_list_name(list_url)
    activity_summary.rename(columns={'Message Count': list_name}, inplace=True) # name the message count column for the list
    people = pd.merge(people, activity_summary, how='outer', left_index=True, right_index=True)
    list_columns.append(list_name) # keep a list of the columns that specifically represent mailing list message counts

In [10]:
# the original message column was duplicated during the merge process, so we remove it here
people = people.drop(columns=['6lo_y'])
people = people.rename(columns={'6lo_x':'6lo'})

In [12]:
people.describe()


Out[12]:
6lo caldav 3gv6 martini hip lmap 90all casm 97-newcomers aaa-doctors ... ietf-announce raven supa imapext 77attendees ieee-ietf-coord meta-model 92all openpgp arp222
count 248.000000 24.000000 64.000000 68.000000 159.000000 175.000000 11.000000 28.000000 4.000000 107.000000 ... 207.000000 136.000000 139.000000 97.000000 160.000000 119.000000 6.000000 13.000000 1451.000000 11.000000
mean 10.479839 4.041667 6.171875 31.044118 11.081761 16.788571 2.272727 5.714286 2.000000 11.598131 ... 85.183575 8.235294 11.978417 14.587629 2.543750 8.949580 2.333333 2.384615 5.337698 3.272727
std 29.060171 4.037640 8.233795 55.723313 33.983236 50.301915 1.737292 7.402988 2.160247 48.594949 ... 702.130008 27.255237 27.795179 29.020378 2.799027 32.589342 2.250926 2.501282 26.098838 4.173510
min 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 ... 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000
25% 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.750000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
50% 2.000000 2.000000 2.000000 5.000000 2.000000 3.000000 1.000000 2.500000 1.500000 2.000000 ... 2.000000 2.000000 3.000000 3.000000 1.000000 2.000000 1.500000 1.000000 1.000000 2.000000
75% 8.000000 6.500000 8.250000 28.750000 5.000000 7.500000 3.500000 7.250000 2.750000 4.500000 ... 5.500000 6.000000 7.000000 12.000000 3.000000 6.000000 3.500000 3.000000 2.000000 3.000000
max 278.000000 15.000000 34.000000 250.000000 289.000000 462.000000 6.000000 27.000000 5.000000 392.000000 ... 8951.000000 282.000000 193.000000 185.000000 16.000000 345.000000 6.000000 10.000000 524.000000 15.000000

8 rows × 955 columns


In [46]:
# not sure how the index ended up with NaN values, but need to change them to strings here so additional steps will work
new_index = people.index.fillna('missing')
people.index = new_index

Split out the email address and header name from the From header we started with.


In [47]:
froms = pd.Series(people.index)
emails = froms.apply(lambda x: email.utils.parseaddr(x)[1])
emails.index = people.index
names = froms.apply(lambda x: email.utils.parseaddr(x)[0])
names.index = people.index
people['email'] = emails
people['name'] = names

Let's create some summary statistical columns.


In [48]:
people['Total Messages'] = people[list_columns].sum(axis=1)
people['Number of Groups'] = people[list_columns].count(axis=1)
people['Median Messages per Group'] = people[list_columns].median(axis=1)

In [49]:
people['Total Messages'].sum()


Out[49]:
1944019.0

In this corpus, 101,510 "people" sent a combined total of 1.2 million messages. Most people sent only 1 message.

Participation patterns

The vast majority of people send only a few messages, and to only a couple of lists. (These histograms use a log axis for Y, without which you couldn't even see the columns besides the first.)


In [22]:
people[['Total Messages']].plot(kind='hist', bins=100, logy=True, logx=False)
people[['Number of Groups']].plot(kind='hist', bins=100, logy=True, logx=False)


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fee281d4590>

Let's limit our analysis for now to people who have sent at least 5 messages. We will also create log base 10 versions of our summary columns for easier graphing later.


In [23]:
working = people[people['Total Messages'] > 5]

working['Total Messages (log)'] = np.log10(working['Total Messages'])
working['Number of Groups (log)'] = np.log10(working['Number of Groups'])


/home/lem/.local/lib/python2.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/home/lem/.local/lib/python2.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.

The median number of messages that a user sends to a group is also heavily weighted towards a small number, but the curve doesn't seem to drop off in the same extreme manner. There is a non-random tendency to send some messages to a group?


In [24]:
working[['Median Messages per Group']].plot(kind='hist', bins=100, logy=True)


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fee2b12d450>

Is there a relationship between the number of groups that a user has sent messages to and the number of messages that user has sent (total, or the median number to groups)?


In [25]:
working.plot.scatter('Number of Groups','Total Messages', xlim=(1,300), ylim=(1,20000), logx=False, logy=True)


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fee2b76f150>

It appears that there are interesting outliers here. Some who send a couple messages each to a large number of groups, but then a separate group of outliers that sends lots of messages and to lots of groups. That might be an elite component worthy of separate analysis.

A density graph will show, however, that while there are people who send many messages to a small number of groups, still, most people are clustered around sending few messages, to few groups.


In [26]:
sns.jointplot(x='Number of Groups',y='Total Messages (log)', data=working, kind="kde", xlim=(0,50), ylim=(0,3));


Relationships between groups and participants

Can we learn implicit relationships between groups based on the messaging patterns of participants?

PCA

We want to work with just the data of people and how many messages they sent to each group.


In [27]:
df = people[people['Total Messages'] > 5]

df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)

Principal Component Analysis (PCA) will seek to explain the most variance in the samples (participants) based on the features (messages sent to different lists). Let's try with two components and see what PCA sees as the most distinguishing dimensions of IETF participation.


In [28]:
import sklearn
from sklearn.decomposition import PCA

scaled = sklearn.preprocessing.maxabs_scale(df)

pca = PCA(n_components=2, whiten=True)
pca.fit(scaled)


Out[28]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=True)

In [29]:
components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
components_frame


Out[29]:
6lo caldav 3gv6 martini hip lmap 90all casm 97-newcomers aaa-doctors ... ietf-announce raven supa imapext 77attendees ieee-ietf-coord meta-model 92all openpgp arp222
0 0.046593 0.007496 0.034607 0.029775 0.012952 0.024247 0.062401 0.047514 0.012421 0.017786 ... 0.013914 0.000477 0.024968 0.022945 0.149100 0.020810 0.014023 0.049329 0.004796 0.002724
1 -0.028740 -0.002030 -0.012836 -0.018841 -0.015708 -0.019847 0.169830 -0.010719 0.011904 -0.011855 ... -0.009133 0.000118 -0.018471 -0.017098 -0.028839 -0.006234 -0.009671 0.128938 -0.005413 -0.000900

2 rows × 955 columns


In [30]:
for i, row in components_frame.iterrows():
    print('\nComponent %d' % i)
    r = row.sort_values(ascending=False)
    print('Most positive correlation:\n %s' % r[:5].index.values)
    print('Most negative correlation:\n %s' % r[-5:].index.values)


Component 0
Most positive correlation:
 ['93attendees' '88attendees' '77attendees' '87attendees' 'bofchairs']
Most negative correlation:
 ['tap' 'eos' 'dmarc-report' 'web' 'spam']

Component 1
Most positive correlation:
 ['89all' '90all' '91all' '82all' '94all']
Most negative correlation:
 ['ippm' 'rtgwg' 'i-d-announce' 'l2vpn' 'l3vpn']

Component 0 is mostly routing (Layer 3 and Layer 2 VPNs, the routing area working group, interdomain routing. (IP Performance/Measurement seems different -- is it related?)

Component 1 is all Internet area groups, mostly related to IPv6, and specifically different groups working on mobility-related extensions to IPv6.

When data was unscaled, PCA components seemed to connect to ops and ipv6, a significantly different result.

For our two components, we can see which features are most positively correlated and which are most negatively correlated. On positive correlation, looking up these groups, it seems like there is some meaningful coherence here. On Component 0, we see groups in the "ops" area: groups related to the management, configuration and measurement of networks. On the other component, we see groups in the Internet and transport areas: groups related to IPv6, the transport area and PSTN transport.

That we see such different results when the data is first scaled by each feature perhaps suggests that the initial analysis was just picking up on the largest groups.


In [31]:
pca.explained_variance_


Out[31]:
array([0.00354831, 0.0025176 ])

The explained variance by our components seems extremely tiny.

With two components (or the two most significant components), we can attempt a basic visualization as a scatter plot.


In [32]:
component_df = pd.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(2)], index=df.index)
component_df.plot.scatter(x='PCA0',y='PCA1')


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fee2b0e9a10>

And with a larger number of components?


In [33]:
pca = PCA(n_components=10, whiten=True)
pca.fit(scaled)
components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
for i, row in components_frame.iterrows():
    print('\nComponent %d' % i)
    r = row.sort_values(ascending=False)
    print('Most positive correlation:\n %s' % r[:5].index.values)
    print('Most negative correlation:\n %s' % r[-5:].index.values)


Component 0
Most positive correlation:
 ['93attendees' '88attendees' '77attendees' '87attendees' 'bofchairs']
Most negative correlation:
 ['tap' 'eos' 'dmarc-report' 'web' 'spam']

Component 1
Most positive correlation:
 ['89all' '90all' '91all' '82all' '94all']
Most negative correlation:
 ['ippm' 'rtgwg' 'i-d-announce' 'l2vpn' 'l3vpn']

Component 2
Most positive correlation:
 ['l3vpn' 'l2vpn' 'adslmib' 'i-d-announce' 'psamp-text']
Most negative correlation:
 ['100attendees' '96attendees' '88attendees' '97attendees' '93attendees']

Component 3
Most positive correlation:
 ['88attendees' 'ngtrans' '94attendees' '96attendees' '93attendees']
Most negative correlation:
 ['websec' 'happiana' 'art' 'http-auth' 'apps-discuss']

Component 4
Most positive correlation:
 ['97attendees' '96attendees' 'rtgwg' '99attendees' 'rtg-yang-coord']
Most negative correlation:
 ['monami6' '68attendees' 'mip6' '77attendees' '72attendees']

Component 5
Most positive correlation:
 ['ianaplan' 'iasa20' 'v6ops' 'mtgvenue' 'ipv6']
Most negative correlation:
 ['martini' '87attendees' '81attendees' 'rai' 'dispatch']

Component 6
Most positive correlation:
 ['72attendees' 'opsawg' 'netconf' 'mib-doctors' 'supa']
Most negative correlation:
 ['94attendees' '99attendees' '96attendees' '100attendees' '97attendees']

Component 7
Most positive correlation:
 ['dispatch' 'rai' 'p2psip' 'martini' 'avtext']
Most negative correlation:
 ['ietf-message-headers' 'hubmib' 'happiana' 'psamp-text' 'apps-discuss']

Component 8
Most positive correlation:
 ['72attendees' 'idr' '81attendees' '74attendees' '75attendees']
Most negative correlation:
 ['bofchairs' 'sipcore' 'martini' 'rai' 'dispatch']

Component 9
Most positive correlation:
 ['tools-development' 'ietf-sow' 'agenda-tool' 'ccg' 'iola-wgcharter-tool']
Most negative correlation:
 ['mcic' 'vpn4dc' 'wgguide' 'apps-discuss' '77attendees']

There are definitely subject domain areas in these lists (the last one, for example, on groups related to phone calls and emergency services). Also interesting is the presence of some meta-topics, like mtgvenue or policy or iasa20 (an IETF governance topic).

Future work: we might be able to use this sparse matrix of participation in different lists to provide recommendations of similarity. "People who send messages to the same mix of groups you send to also like this other list" or "People who like this list, also often like this list".

Betweenness, PageRank and graph visualization

Because we have people and the groups they send to, we can construct a bipartite graph.

We'll use just the top 5000 people, in order to make complicated calculations run faster.


In [34]:
df = people.sort_values(by="Total Messages",ascending=False)[:5000]
df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)

In [35]:
import networkx as nx

G = nx.Graph()

for group in df.columns:
    G.add_node(group,type="group")
    
for name, data in df.iterrows():
    G.add_node(name,type="person")
    
    for group, weight in data.items():
        if weight > 0:
            G.add_edge(name,group,weight=weight)

In [36]:
nx.is_bipartite(G)


Out[36]:
True

Yep, it is bipartite! Now, we can export a graph file for use in visualization software Gephi.


In [37]:
nx.write_gexf(G,'ietf-participation-bipartite.gexf')

In [38]:
people_nodes, group_nodes = nx.algorithms.bipartite.sets(G)


---------------------------------------------------------------------------
AmbiguousSolution                         Traceback (most recent call last)
<ipython-input-38-9df133334611> in <module>()
----> 1 people_nodes, group_nodes = nx.algorithms.bipartite.sets(G)

/home/lem/.local/lib/python2.7/site-packages/networkx/algorithms/bipartite/basic.pyc in sets(G, top_nodes)
    204         if not is_connected(G):
    205             msg = 'Disconnected graph: Ambiguous solution for bipartite sets.'
--> 206             raise nx.AmbiguousSolution(msg)
    207         c = color(G)
    208         X = {n for n, is_top in c.items() if is_top}

AmbiguousSolution: Disconnected graph: Ambiguous solution for bipartite sets.

We can calculate the "PageRank" of each person and group, using the weights (number of messages) between groups and people to distribute a kind of influence.


In [39]:
pr = nx.pagerank(G, weight="weight")

In [40]:
nx.set_node_attributes(G, "pagerank", pr)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-76db8672f77d> in <module>()
----> 1 nx.set_node_attributes(G, "pagerank", pr)

/home/lem/.local/lib/python2.7/site-packages/networkx/classes/function.pyc in set_node_attributes(G, values, name)
    652         except AttributeError:  # `values` is a constant
    653             for n in G:
--> 654                 G.nodes[n][name] = values
    655     else:  # `values` must be dict of dict
    656         for n, d in values.items():

TypeError: unhashable type: 'dict'

In [ ]:
sorted([node for node in list(G.nodes(data=True)) 
        if node[1]['type'] == 'group'], 
       key=lambda x: x[1]['pagerank'], 
       reverse =True)[:10]

In [41]:
sorted([node for node in list(G.nodes(data=True)) 
        if node[1]['type'] == 'person'], 
       key=lambda x: x[1]['pagerank'], 
       reverse =True)[:10]


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-41-0b71f2221c5b> in <module>()
      2         if node[1]['type'] == 'person'], 
      3        key=lambda x: x[1]['pagerank'],
----> 4        reverse =True)[:10]

<ipython-input-41-0b71f2221c5b> in <lambda>(x)
      1 sorted([node for node in list(G.nodes(data=True)) 
      2         if node[1]['type'] == 'person'], 
----> 3        key=lambda x: x[1]['pagerank'],
      4        reverse =True)[:10]

KeyError: 'pagerank'

However, PageRank is probably less informative than usual here, because this is a bipartite, non-directed graph. Instead, let's calculate a normalized, closeness centrality specific to bipartite graphs.


In [42]:
person_nodes = [node[0] for node in G.nodes(data=True) if node[1]['type'] == 'person']

NB: Slow operation for large graphs.


In [43]:
cc = nx.algorithms.bipartite.centrality.closeness_centrality(G, person_nodes, normalized=True)

In [44]:
for node, value in list(cc.items()):
    if type(node) not in [str, str]:
        print(node)
        print(value)


5000.0
0.0

In [45]:
del cc[14350.0] # remove a spurious node value


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-45-00ab5d6aabb5> in <module>()
----> 1 del cc[14350.0] # remove a spurious node value

KeyError: 14350.0

In [ ]:
nx.set_node_attributes(G, "closeness", cc)

In [ ]:
sorted([node for node in list(G.nodes(data=True)) 
        if node[1]['type'] == 'person'], 
       key=lambda x: x[1]['closeness'], 
       reverse=True)[:25]

The people with the highest closeness centrality are the ones that have the most co-affiliation with every other person, or the shortest path to every other person. Automated accounts are, as we might expect, extremely high on this measure -- they're used to send announcements of publications and do so to basically every group. The individual people highest ranked on this measure include Stephen Farrell, Jari Arkko, Ben Campbell -- long-time participants with leadership roles. The highest ranked woman is Alissa Cooper, current Chair of the IETF.

TODO: calculating bi-cliques (the people who all are connected to the same group) and then measuring correlation in bi-cliques (people who belong to many of the same groups) could allow for analysis of cohesive subgroups and a different network analysis/visualization. See Borgatti, S.P. and Halgin, D. In press. “Analyzing Affiliation Networks”. In Carrington, P. and Scott, J. (eds) The Sage Handbook of Social Network Analysis. Sage Publications. http://www.steveborgatti.com/papers/bhaffiliations.pdf


In [ ]: