We want to analyze participants and patterns of participation across IETF groups. How many people participate, in which groups, how does affiliation, gender, RFC authorship or other characteristics relate to levels of participation, and a variety of other related questions. How do groups relate to one another? Which participants provide important connections between groups?

Setup and gather data

Start by importing the necessary libraries.



In [1]:

    
%matplotlib inline
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
import bigbang.utils as utils
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os
import csv
import re
import scipy
import scipy.cluster.hierarchy as sch
import email



In [2]:

    
#pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options
plt.rcParams['axes.facecolor'] = 'white'
import seaborn as sns
sns.set()
sns.set_style("white")

Let's start with a single IETF mailing list. (Later, we can expand to all current groups, or all IETF lists ever.)



In [3]:

    
list_url = '6lo' # perpass happens to be one that I subscribe to

ietf_archives_dir = '../archives' # relative location of the ietf-archives directory/repo

list_archive = mailman.open_list_archives(list_url, ietf_archives_dir)
activity = Archive(list_archive).get_activity()









    



/home/lem/Data/bigbang/bigbang/mailman.py:138: UserWarning: No mailing list name found at 6lo
  warnings.warn("No mailing list name found at %s" % url)






    



59
Opening 59 archive files



In [4]:

    
people = None
people = pd.DataFrame(activity.sum(0), columns=['6lo']) # sum the message count, rather than by date



In [5]:

    
people.describe()

Now repeat, parsing the archives and collecting the activities for all the mailing lists in the corpus. To make this faster, we try to open pre-created -activity.csv files which contain the activity summary for the full list archive. These files are created with bin/mail_to_activity.py or might be included in the mailing list archive repository.



In [6]:

    
f = open('../examples/mm.ietf.org.txt', 'r')
ietf_lists = set(f.readlines()) # remove duplicates, which is a bug in list maintenance



In [7]:

    
list_activities = []

for list_url in ietf_lists:
    try:
        activity_summary = mailman.open_activity_summary(list_url, ietf_archives_dir)
        if activity_summary is not None:
            list_activities.append((list_url, activity_summary))
    except Exception as e:
        print(str(e))









    



/home/lem/Data/bigbang/bigbang/mailman.py:138: UserWarning: No mailing list name found at 
  warnings.warn("No mailing list name found at %s" % url)



In [8]:

    
len(list_activities)









    Out[8]:





955

Merge all of the activity summaries together, so that every row is a "From" field, with a column for every mailing list and a cell that includes the number of messages sent to that list. This will be a very sparse, 2-d table. This operation is a little slow. Don't repeat this operation without recreating people from the cells above.



In [9]:

    
list_columns = []
for (list_url, activity_summary) in list_activities:
    list_name = mailman.get_list_name(list_url)
    activity_summary.rename(columns={'Message Count': list_name}, inplace=True) # name the message count column for the list
    people = pd.merge(people, activity_summary, how='outer', left_index=True, right_index=True)
    list_columns.append(list_name) # keep a list of the columns that specifically represent mailing list message counts



In [10]:

    
# the original message column was duplicated during the merge process, so we remove it here
people = people.drop(columns=['6lo_y'])
people = people.rename(columns={'6lo_x':'6lo'})



In [12]:

    
people.describe()









    Out[12]:







  
    
      
      6lo
      caldav
      3gv6
      martini
      hip
      lmap
      90all
      casm
      97-newcomers
      aaa-doctors
      ...
      ietf-announce
      raven
      supa
      imapext
      77attendees
      ieee-ietf-coord
      meta-model
      92all
      openpgp
      arp222
    
  
  
    
      count
      248.000000
      24.000000
      64.000000
      68.000000
      159.000000
      175.000000
      11.000000
      28.000000
      4.000000
      107.000000
      ...
      207.000000
      136.000000
      139.000000
      97.000000
      160.000000
      119.000000
      6.000000
      13.000000
      1451.000000
      11.000000
    
    
      mean
      10.479839
      4.041667
      6.171875
      31.044118
      11.081761
      16.788571
      2.272727
      5.714286
      2.000000
      11.598131
      ...
      85.183575
      8.235294
      11.978417
      14.587629
      2.543750
      8.949580
      2.333333
      2.384615
      5.337698
      3.272727
    
    
      std
      29.060171
      4.037640
      8.233795
      55.723313
      33.983236
      50.301915
      1.737292
      7.402988
      2.160247
      48.594949
      ...
      702.130008
      27.255237
      27.795179
      29.020378
      2.799027
      32.589342
      2.250926
      2.501282
      26.098838
      4.173510
    
    
      min
      1.000000
      1.000000
      0.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      0.000000
      1.000000
      ...
      1.000000
      0.000000
      1.000000
      1.000000
      1.000000
      1.000000
      0.000000
      1.000000
      1.000000
      1.000000
    
    
      25%
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      0.750000
      1.000000
      ...
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      50%
      2.000000
      2.000000
      2.000000
      5.000000
      2.000000
      3.000000
      1.000000
      2.500000
      1.500000
      2.000000
      ...
      2.000000
      2.000000
      3.000000
      3.000000
      1.000000
      2.000000
      1.500000
      1.000000
      1.000000
      2.000000
    
    
      75%
      8.000000
      6.500000
      8.250000
      28.750000
      5.000000
      7.500000
      3.500000
      7.250000
      2.750000
      4.500000
      ...
      5.500000
      6.000000
      7.000000
      12.000000
      3.000000
      6.000000
      3.500000
      3.000000
      2.000000
      3.000000
    
    
      max
      278.000000
      15.000000
      34.000000
      250.000000
      289.000000
      462.000000
      6.000000
      27.000000
      5.000000
      392.000000
      ...
      8951.000000
      282.000000
      193.000000
      185.000000
      16.000000
      345.000000
      6.000000
      10.000000
      524.000000
      15.000000
    
  

8 rows × 955 columns



In [46]:

    
# not sure how the index ended up with NaN values, but need to change them to strings here so additional steps will work
new_index = people.index.fillna('missing')
people.index = new_index

Split out the email address and header name from the From header we started with.



In [47]:

    
froms = pd.Series(people.index)
emails = froms.apply(lambda x: email.utils.parseaddr(x)[1])
emails.index = people.index
names = froms.apply(lambda x: email.utils.parseaddr(x)[0])
names.index = people.index
people['email'] = emails
people['name'] = names

Let's create some summary statistical columns.



In [48]:

    
people['Total Messages'] = people[list_columns].sum(axis=1)
people['Number of Groups'] = people[list_columns].count(axis=1)
people['Median Messages per Group'] = people[list_columns].median(axis=1)



In [49]:

    
people['Total Messages'].sum()









    Out[49]:





1944019.0

In this corpus, 101,510 "people" sent a combined total of 1.2 million messages. Most people sent only 1 message.

Participation patterns

The vast majority of people send only a few messages, and to only a couple of lists. (These histograms use a log axis for Y, without which you couldn't even see the columns besides the first.)



In [22]:

    
people[['Total Messages']].plot(kind='hist', bins=100, logy=True, logx=False)
people[['Number of Groups']].plot(kind='hist', bins=100, logy=True, logx=False)









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fee281d4590>

Let's limit our analysis for now to people who have sent at least 5 messages. We will also create log base 10 versions of our summary columns for easier graphing later.



In [23]:

    
working = people[people['Total Messages'] > 5]

working['Total Messages (log)'] = np.log10(working['Total Messages'])
working['Number of Groups (log)'] = np.log10(working['Number of Groups'])









    



/home/lem/.local/lib/python2.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/home/lem/.local/lib/python2.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.

The median number of messages that a user sends to a group is also heavily weighted towards a small number, but the curve doesn't seem to drop off in the same extreme manner. There is a non-random tendency to send some messages to a group?



In [24]:

    
working[['Median Messages per Group']].plot(kind='hist', bins=100, logy=True)









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fee2b12d450>

Is there a relationship between the number of groups that a user has sent messages to and the number of messages that user has sent (total, or the median number to groups)?



In [25]:

    
working.plot.scatter('Number of Groups','Total Messages', xlim=(1,300), ylim=(1,20000), logx=False, logy=True)









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fee2b76f150>

It appears that there are interesting outliers here. Some who send a couple messages each to a large number of groups, but then a separate group of outliers that sends lots of messages and to lots of groups. That might be an elite component worthy of separate analysis.

A density graph will show, however, that while there are people who send many messages to a small number of groups, still, most people are clustered around sending few messages, to few groups.



In [26]:

    
sns.jointplot(x='Number of Groups',y='Total Messages (log)', data=working, kind="kde", xlim=(0,50), ylim=(0,3));

Relationships between groups and participants

Can we learn implicit relationships between groups based on the messaging patterns of participants?

PCA

We want to work with just the data of people and how many messages they sent to each group.



In [27]:

    
df = people[people['Total Messages'] > 5]

df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)

Principal Component Analysis (PCA) will seek to explain the most variance in the samples (participants) based on the features (messages sent to different lists). Let's try with two components and see what PCA sees as the most distinguishing dimensions of IETF participation.



In [28]:

    
import sklearn
from sklearn.decomposition import PCA

scaled = sklearn.preprocessing.maxabs_scale(df)

pca = PCA(n_components=2, whiten=True)
pca.fit(scaled)









    Out[28]:





PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=True)



In [29]:

    
components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
components_frame









    Out[29]:







  
    
      
      6lo
      caldav
      3gv6
      martini
      hip
      lmap
      90all
      casm
      97-newcomers
      aaa-doctors
      ...
      ietf-announce
      raven
      supa
      imapext
      77attendees
      ieee-ietf-coord
      meta-model
      92all
      openpgp
      arp222
    
  
  
    
      0
      0.046593
      0.007496
      0.034607
      0.029775
      0.012952
      0.024247
      0.062401
      0.047514
      0.012421
      0.017786
      ...
      0.013914
      0.000477
      0.024968
      0.022945
      0.149100
      0.020810
      0.014023
      0.049329
      0.004796
      0.002724
    
    
      1
      -0.028740
      -0.002030
      -0.012836
      -0.018841
      -0.015708
      -0.019847
      0.169830
      -0.010719
      0.011904
      -0.011855
      ...
      -0.009133
      0.000118
      -0.018471
      -0.017098
      -0.028839
      -0.006234
      -0.009671
      0.128938
      -0.005413
      -0.000900
    
  

2 rows × 955 columns



In [30]:

    
for i, row in components_frame.iterrows():
    print('\nComponent %d' % i)
    r = row.sort_values(ascending=False)
    print('Most positive correlation:\n %s' % r[:5].index.values)
    print('Most negative correlation:\n %s' % r[-5:].index.values)









    



Component 0
Most positive correlation:
 ['93attendees' '88attendees' '77attendees' '87attendees' 'bofchairs']
Most negative correlation:
 ['tap' 'eos' 'dmarc-report' 'web' 'spam']

Component 1
Most positive correlation:
 ['89all' '90all' '91all' '82all' '94all']
Most negative correlation:
 ['ippm' 'rtgwg' 'i-d-announce' 'l2vpn' 'l3vpn']

Component 0 is mostly routing (Layer 3 and Layer 2 VPNs, the routing area working group, interdomain routing. (IP Performance/Measurement seems different -- is it related?)

Component 1 is all Internet area groups, mostly related to IPv6, and specifically different groups working on mobility-related extensions to IPv6.

When data was unscaled, PCA components seemed to connect to ops and ipv6, a significantly different result.

For our two components, we can see which features are most positively correlated and which are most negatively correlated. On positive correlation, looking up these groups, it seems like there is some meaningful coherence here. On Component 0, we see groups in the "ops" area: groups related to the management, configuration and measurement of networks. On the other component, we see groups in the Internet and transport areas: groups related to IPv6, the transport area and PSTN transport.

That we see such different results when the data is first scaled by each feature perhaps suggests that the initial analysis was just picking up on the largest groups.



In [31]:

    
pca.explained_variance_









    Out[31]:





array([0.00354831, 0.0025176 ])

The explained variance by our components seems extremely tiny.

With two components (or the two most significant components), we can attempt a basic visualization as a scatter plot.



In [32]:

    
component_df = pd.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(2)], index=df.index)
component_df.plot.scatter(x='PCA0',y='PCA1')









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fee2b0e9a10>

And with a larger number of components?



In [33]:

    
pca = PCA(n_components=10, whiten=True)
pca.fit(scaled)
components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
for i, row in components_frame.iterrows():
    print('\nComponent %d' % i)
    r = row.sort_values(ascending=False)
    print('Most positive correlation:\n %s' % r[:5].index.values)
    print('Most negative correlation:\n %s' % r[-5:].index.values)









    



Component 0
Most positive correlation:
 ['93attendees' '88attendees' '77attendees' '87attendees' 'bofchairs']
Most negative correlation:
 ['tap' 'eos' 'dmarc-report' 'web' 'spam']

Component 1
Most positive correlation:
 ['89all' '90all' '91all' '82all' '94all']
Most negative correlation:
 ['ippm' 'rtgwg' 'i-d-announce' 'l2vpn' 'l3vpn']

Component 2
Most positive correlation:
 ['l3vpn' 'l2vpn' 'adslmib' 'i-d-announce' 'psamp-text']
Most negative correlation:
 ['100attendees' '96attendees' '88attendees' '97attendees' '93attendees']

Component 3
Most positive correlation:
 ['88attendees' 'ngtrans' '94attendees' '96attendees' '93attendees']
Most negative correlation:
 ['websec' 'happiana' 'art' 'http-auth' 'apps-discuss']

Component 4
Most positive correlation:
 ['97attendees' '96attendees' 'rtgwg' '99attendees' 'rtg-yang-coord']
Most negative correlation:
 ['monami6' '68attendees' 'mip6' '77attendees' '72attendees']

Component 5
Most positive correlation:
 ['ianaplan' 'iasa20' 'v6ops' 'mtgvenue' 'ipv6']
Most negative correlation:
 ['martini' '87attendees' '81attendees' 'rai' 'dispatch']

Component 6
Most positive correlation:
 ['72attendees' 'opsawg' 'netconf' 'mib-doctors' 'supa']
Most negative correlation:
 ['94attendees' '99attendees' '96attendees' '100attendees' '97attendees']

Component 7
Most positive correlation:
 ['dispatch' 'rai' 'p2psip' 'martini' 'avtext']
Most negative correlation:
 ['ietf-message-headers' 'hubmib' 'happiana' 'psamp-text' 'apps-discuss']

Component 8
Most positive correlation:
 ['72attendees' 'idr' '81attendees' '74attendees' '75attendees']
Most negative correlation:
 ['bofchairs' 'sipcore' 'martini' 'rai' 'dispatch']

Component 9
Most positive correlation:
 ['tools-development' 'ietf-sow' 'agenda-tool' 'ccg' 'iola-wgcharter-tool']
Most negative correlation:
 ['mcic' 'vpn4dc' 'wgguide' 'apps-discuss' '77attendees']

There are definitely subject domain areas in these lists (the last one, for example, on groups related to phone calls and emergency services). Also interesting is the presence of some meta-topics, like mtgvenue or policy or iasa20 (an IETF governance topic).

Future work: we might be able to use this sparse matrix of participation in different lists to provide recommendations of similarity. "People who send messages to the same mix of groups you send to also like this other list" or "People who like this list, also often like this list".

Betweenness, PageRank and graph visualization

Because we have people and the groups they send to, we can construct a bipartite graph.

We'll use just the top 5000 people, in order to make complicated calculations run faster.



In [34]:

    
df = people.sort_values(by="Total Messages",ascending=False)[:5000]
df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)



In [35]:

    
import networkx as nx

G = nx.Graph()

for group in df.columns:
    G.add_node(group,type="group")
    
for name, data in df.iterrows():
    G.add_node(name,type="person")
    
    for group, weight in data.items():
        if weight > 0:
            G.add_edge(name,group,weight=weight)



In [36]:

    
nx.is_bipartite(G)









    Out[36]:





True

Yep, it is bipartite! Now, we can export a graph file for use in visualization software Gephi.



In [37]:

    
nx.write_gexf(G,'ietf-participation-bipartite.gexf')



In [38]:

    
people_nodes, group_nodes = nx.algorithms.bipartite.sets(G)









    



---------------------------------------------------------------------------
AmbiguousSolution                         Traceback (most recent call last)
<ipython-input-38-9df133334611> in <module>()
----> 1 people_nodes, group_nodes = nx.algorithms.bipartite.sets(G)

/home/lem/.local/lib/python2.7/site-packages/networkx/algorithms/bipartite/basic.pyc in sets(G, top_nodes)
    204         if not is_connected(G):
    205             msg = 'Disconnected graph: Ambiguous solution for bipartite sets.'
--> 206             raise nx.AmbiguousSolution(msg)
    207         c = color(G)
    208         X = {n for n, is_top in c.items() if is_top}

AmbiguousSolution: Disconnected graph: Ambiguous solution for bipartite sets.

We can calculate the "PageRank" of each person and group, using the weights (number of messages) between groups and people to distribute a kind of influence.



In [39]:

    
pr = nx.pagerank(G, weight="weight")



In [40]:

    
nx.set_node_attributes(G, "pagerank", pr)









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-76db8672f77d> in <module>()
----> 1 nx.set_node_attributes(G, "pagerank", pr)

/home/lem/.local/lib/python2.7/site-packages/networkx/classes/function.pyc in set_node_attributes(G, values, name)
    652         except AttributeError:  # `values` is a constant
    653             for n in G:
--> 654                 G.nodes[n][name] = values
    655     else:  # `values` must be dict of dict
    656         for n, d in values.items():

TypeError: unhashable type: 'dict'



In [ ]:

    
sorted([node for node in list(G.nodes(data=True)) 
        if node[1]['type'] == 'group'], 
       key=lambda x: x[1]['pagerank'], 
       reverse =True)[:10]



In [41]:

    
sorted([node for node in list(G.nodes(data=True)) 
        if node[1]['type'] == 'person'], 
       key=lambda x: x[1]['pagerank'], 
       reverse =True)[:10]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-41-0b71f2221c5b> in <module>()
      2         if node[1]['type'] == 'person'], 
      3        key=lambda x: x[1]['pagerank'],
----> 4        reverse =True)[:10]

<ipython-input-41-0b71f2221c5b> in <lambda>(x)
      1 sorted([node for node in list(G.nodes(data=True)) 
      2         if node[1]['type'] == 'person'], 
----> 3        key=lambda x: x[1]['pagerank'],
      4        reverse =True)[:10]

KeyError: 'pagerank'

However, PageRank is probably less informative than usual here, because this is a bipartite, non-directed graph. Instead, let's calculate a normalized, closeness centrality specific to bipartite graphs.



In [42]:

    
person_nodes = [node[0] for node in G.nodes(data=True) if node[1]['type'] == 'person']

NB: Slow operation for large graphs.



In [43]:

    
cc = nx.algorithms.bipartite.centrality.closeness_centrality(G, person_nodes, normalized=True)



In [44]:

    
for node, value in list(cc.items()):
    if type(node) not in [str, str]:
        print(node)
        print(value)



In [45]:

    
del cc[14350.0] # remove a spurious node value









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-45-00ab5d6aabb5> in <module>()
----> 1 del cc[14350.0] # remove a spurious node value

KeyError: 14350.0



In [ ]:

    
nx.set_node_attributes(G, "closeness", cc)



In [ ]:

    
sorted([node for node in list(G.nodes(data=True)) 
        if node[1]['type'] == 'person'], 
       key=lambda x: x[1]['closeness'], 
       reverse=True)[:25]

The people with the highest closeness centrality are the ones that have the most co-affiliation with every other person, or the shortest path to every other person. Automated accounts are, as we might expect, extremely high on this measure -- they're used to send announcements of publications and do so to basically every group. The individual people highest ranked on this measure include Stephen Farrell, Jari Arkko, Ben Campbell -- long-time participants with leadership roles. The highest ranked woman is Alissa Cooper, current Chair of the IETF.

TODO: calculating bi-cliques (the people who all are connected to the same group) and then measuring correlation in bi-cliques (people who belong to many of the same groups) could allow for analysis of cohesive subgroups and a different network analysis/visualization. See Borgatti, S.P. and Halgin, D. In press. “Analyzing Affiliation Networks”. In Carrington, P. and Scott, J. (eds) The Sage Handbook of Social Network Analysis. Sage Publications. http://www.steveborgatti.com/papers/bhaffiliations.pdf



In [ ]:

	6lo
count	248.000000
mean	10.479839
std	29.060171
min	1.000000
25%	1.000000
50%	2.000000
75%	8.000000
max	278.000000

	6lo	caldav	3gv6	martini	hip	lmap	90all	casm	97-newcomers	aaa-doctors	...	ietf-announce	raven	supa	imapext	77attendees	ieee-ietf-coord	meta-model	92all	openpgp	arp222
count	248.000000	24.000000	64.000000	68.000000	159.000000	175.000000	11.000000	28.000000	4.000000	107.000000	...	207.000000	136.000000	139.000000	97.000000	160.000000	119.000000	6.000000	13.000000	1451.000000	11.000000
mean	10.479839	4.041667	6.171875	31.044118	11.081761	16.788571	2.272727	5.714286	2.000000	11.598131	...	85.183575	8.235294	11.978417	14.587629	2.543750	8.949580	2.333333	2.384615	5.337698	3.272727
std	29.060171	4.037640	8.233795	55.723313	33.983236	50.301915	1.737292	7.402988	2.160247	48.594949	...	702.130008	27.255237	27.795179	29.020378	2.799027	32.589342	2.250926	2.501282	26.098838	4.173510
min	1.000000	1.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	1.000000	...	1.000000	0.000000	1.000000	1.000000	1.000000	1.000000	0.000000	1.000000	1.000000	1.000000
25%	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.750000	1.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
50%	2.000000	2.000000	2.000000	5.000000	2.000000	3.000000	1.000000	2.500000	1.500000	2.000000	...	2.000000	2.000000	3.000000	3.000000	1.000000	2.000000	1.500000	1.000000	1.000000	2.000000
75%	8.000000	6.500000	8.250000	28.750000	5.000000	7.500000	3.500000	7.250000	2.750000	4.500000	...	5.500000	6.000000	7.000000	12.000000	3.000000	6.000000	3.500000	3.000000	2.000000	3.000000
max	278.000000	15.000000	34.000000	250.000000	289.000000	462.000000	6.000000	27.000000	5.000000	392.000000	...	8951.000000	282.000000	193.000000	185.000000	16.000000	345.000000	6.000000	10.000000	524.000000	15.000000

	6lo	caldav	3gv6	martini	hip	lmap	90all	casm	97-newcomers	aaa-doctors	...	ietf-announce	raven	supa	imapext	77attendees	ieee-ietf-coord	meta-model	92all	openpgp	arp222
0	0.046593	0.007496	0.034607	0.029775	0.012952	0.024247	0.062401	0.047514	0.012421	0.017786	...	0.013914	0.000477	0.024968	0.022945	0.149100	0.020810	0.014023	0.049329	0.004796	0.002724
1	-0.028740	-0.002030	-0.012836	-0.018841	-0.015708	-0.019847	0.169830	-0.010719	0.011904	-0.011855	...	-0.009133	0.000118	-0.018471	-0.017098	-0.028839	-0.006234	-0.009671	0.128938	-0.005413	-0.000900