Everything is a network. Assortativity is an interesting property of networks. It is the tendency of nodes in a network to be attached to other nodes that are similar in some way. In social networks, this is sometimes called "homophily."
One kind of assortativity that is particularly descriptive of network topology is degree assortativity. This is what it sounds like: the assortativity (tendency of nodes to attach to other nodes that are similar) of degree (the number of edges a node has).
A suggestive observation by Newman (2002) is that social networks such as academic coauthorship networks and film collaborations tend to have positive degree assortativity, while technical and biological networks tend to have negative degree assortativity. Another way of saying this is that they are disassortatively mixed. This has implications for the ways we model these networks forming as well as the robustness of these networks to the removal of nodes.
Looking at open source software collaboration as a sociotechnical system, we can ask whether and to what extent the networks of activity are assortatively mixed. Are these networks more like social networks or technical networks? Or are they something in between?
One kind of network that we can extract from open source project data are networks of email replies from public mailing lists. Mailing lists and discussion forums are often the first point of contact for new community members and can be the site of non-technical social processes that are necessary for the maintenance of the community. Of all the communications media used in coordinating the cooperative work of open source development, mailing lists are the most "social".
We are going to look at the mailing lists associated with a number of open source and on-line collaborative projects. We will construct for each list a network for which nodes are email senders (identified by their email address) and edges are the number of times a sender has replied directly to another participant on the list. Keep in mind that these are public discussions and that in a sense every reply is sent to everybody.
In [24]:
from bigbang.archive import Archive
urls = [#"analytics",
"conferences",
"design",
"education",
"gendergap",
"historic",
"hot",
"ietf-privacy",
"ipython-dev",
"ipython-user",
"languages",
"maps-l",
"numpy-discussion",
"playground",
"potlatch-dev",
"python-committers",
"python-dev",
"scipy-dev",
"scipy-user",
"social-media",
"spambayes",
#"wikien-l",
"wikimedia-l"]
archives= [(url,Archive(url,archive_dir="../archives")) for url in urls]
archives = dict(archives)
The above code reads in preprocessed email archive data. These mailing lists are from a variety of different sources:
List name | Project | Description |
---|---|---|
analytics | Wikimedia | |
conferences | Python | |
design | Wikimedia | |
education | Wikimedia | |
gendergap | Wikimedia | |
historic | OpenStreetMap | |
hot | OpenStreetMap | Humanitarian OpenStreetMap Team |
ietf-privacy | IETF | |
ipython-dev | IPython | Developer's list |
ipython-user | IPython | User's list |
languages | Wikimedia | |
maps-l | Wikimedia | |
numpy-discussion | Numpy | |
playground | Python | |
potlatch-dev | OpenStreetMap | |
python-committers | Python | |
python-dev | Python | |
scipy-dev | SciPy | Developer's list |
scipy-user | SciPy | User's list |
social-media | Wikimedia | |
spambayes | Python | |
wikien-l | Wikimedia | English language Wikipedia |
wikimedia-l | Wikimedia |
In [25]:
import bigbang.graph as graph
igs = dict([(k,graph.messages_to_interaction_graph(v.data)) for (k,v) in archives.items()])
In [26]:
igs
Out[26]:
Now we have processed the mailing lists into interaction graphs based on replies. This is what those graphs look like:
In [20]:
import networkx as nx
def draw_interaction_graph(ig):
pos = nx.graphviz_layout(ig,prog='neato')
node_size = [data['sent'] * 4 for name,data in ig.nodes(data=True)]
nx.draw(ig,
pos,
node_size = node_size,
node_color = 'b',
alpha = 0.4,
font_size=18,
font_weight='bold'
)
# edge width is proportional to replies sent
edgewidth=[d['weight'] for (u,v,d) in ig.edges(data=True)]
#overlay edges with width based on weight
nx.draw_networkx_edges(ig,pos,alpha=0.5,width=edgewidth,edge_color='r')
In [21]:
%matplotlib inline
In [22]:
import matplotlib.pyplot as plt
In [27]:
plt.figure(550,figsize=(12.5, 7.5))
for ln,ig in igs.items():
print ln
try:
plt.subplot(550 + i)
#print nx.degree_assortativity_coefficient(ig)
draw_interaction_graph(ig)
except:
print 'plotting failure'
plt.show()
Well, that didn't work out so well...
I guess I should just go on to compute the assortativity directly.
This is every mailing list, with the total number of nodes and its degree assortativity computed.
In [31]:
for ln,ig in igs.items():
print ln, len(ig.nodes()), nx.degree_assortativity_coefficient(ig,weight='weight')
Maybe it will be helpful to compare these values to those in the Newman, 2002 paper:
On the whole, with a few exceptions, these reply networks wind up looking much more like technical or biological networks than the social networks of coauthorship and collaboration. Why is this the case?
One explanation is that the mechanism at work in creating these kinds of "interaction" networks over time is very different from the mechanism for creating collaboration or coauthorship networks. These networks are derived from real communications over time in projects actively geared towards encouraging new members and getting the most out of collaborations. Perhaps these kinds of assortativity numbers are typical in projects with leaders who have inclusivity as a priority.
Another possible explanation is that these interaction networks are mirroring the structures of the technical systems that these communities are built around. There is a theory of institutional isomorphism that can be tested in this case, where social and technical institutions are paired.
Look at each project domain (IPython, Wikimedia, OSM, etc.) separately but include multiple lists from each and look at assortativity within list as well as across list. This would get at how the cyberinfrastructure topology affects the social topology of the communities that use it.
Use a more systematic sampling of email lists to get a typology of those lists with high and low assortativity. Figure out qualitatively what the differences in structure might mean (can always go in and read the emails).
Build a generative graph model that with high probability creates networks with this kind of structure (apparently the existing models don't do thise well.) Test its fit across many interaction graphs, declare victory for science of modeling on-line collaboration.
In [ ]: