This work was done by Harsh Gupta as part of his internship at The Center for Internet & Society India
In [1]:
import bigbang.mailman as mailman
import bigbang.process as process
from bigbang.archive import Archive
import pandas as pd
import datetime
from commonregex import CommonRegex
import matplotlib.pyplot as plt
%matplotlib inline
Encrypted Media Extension (EME) is the controvertial draft standard at W3C which aims to aims to prevent copyright infrigement in digital video but opens up door for lots of issues regarding security, accessibility, privacy and interoperability. This notebook tries to analyze if the interests of the important stakeholders were well represented in the debate that happened on public-html
mailing list of W3C.
Any emails with EME
, Encrypted Media
or Digital Rights Managagement
in the subject line is considered to about EME. Then each of the participant is categorized on the basis of region of the world they belong to and their employeer's interest to the debate. Notes about the participants can be found here.
Look up their personal website and social media accounts (Twitter, LinkedIn, Github) and see if it mentions the country they live in. (Works in Most of the cases)
If the person's email has uses a country specific top level domain, assume that as the country
If github profile is available look up the timezone on last 5 commits.
For people who have moved from their home country consider the country where they live now.
Look up their personal website and social media accounts (Twitter, LinkedIn, Github) and see if it mentions the employer and categorize accordingly.
People who work on Accessibility, Privacy or Security but also fit into first three categories are categorized in one of the first three categories. For example someone who works on privacy in Google will be placed in "DRM platform provider" instead of "Privacy".
If no other category can be assigned, then assign "None of the Above"
In [2]:
def filter_messages(df, column, keywords):
filters = []
for keyword in keywords:
filters.append(df[column].str.contains(keyword, case=False))
return df[reduce(lambda p, q: p | q, filters)]
In [3]:
# Get the Archieves
pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options
mlist = mailman.open_list_archives("https://lists.w3.org/Archives/Public/public-html/", archive_dir="./archives")
# The spaces around eme are **very** important otherwise it can catch things like "emerging", "implement" etc
eme_messages = filter_messages(mlist, 'Subject', [' EME ', 'Encrypted Media', 'Digital Rights Managagement'])
eme_activites = Archive.get_activity(Archive(eme_messages))
In [4]:
eme_activites.sum(0).sum()
Out[4]:
In [5]:
# XXX: Bugzilla might also contain discussions
eme_activites.drop("bugzilla@jessica.w3.org", axis=1, inplace=True)
In [6]:
# Remove Dupicate senders
levdf = process.sorted_matrix(eme_activites)
consolidates = []
# gather pairs of names which have a distance of less than 10
for col in levdf.columns:
for index, value in levdf.loc[levdf[col] < 10, col].iteritems():
if index != col: # the name shouldn't be a pair for itself
consolidates.append((col, index))
# Handpick special cases which aren't covered with string matching
consolidates.extend([(u'Kornel Lesi\u0144ski <kornel@geekhood.net>',
u'wrong string <kornel@geekhood.net>'),
(u'Charles McCathie Nevile <chaals@yandex-team.ru>',
u'Charles McCathieNevile <chaals@opera.com>')])
eme_activites = process.consolidate_senders_activity(eme_activites, consolidates)
In [7]:
sender_categories = pd.read_csv('people_tag.csv',delimiter=',', encoding="utf-8-sig")
# match sender using email only
sender_categories['email'] = map(lambda x: CommonRegex(x).emails[0].lower(), sender_categories['name_email'])
sender_categories.index = sender_categories['email']
cat_dicts = {
"region":{
1: "Asia",
2: "Australia and New Zealand",
3: "Europe",
4: "Africa",
5: "North America",
6: "South America"
},
"work":{
1: "Foss Browser Developer",
2: "Content Provider",
3: "DRM platform provider",
4: "Accessibility",
5: "Security Researcher",
6: "Other W3C Empoyee",
7: "Privacy",
8: "None of the above"
}
}
In [8]:
def get_cat_val_func(cat):
"""
Given category type, returns a function which gives the category value for a sender.
"""
def _get_cat_val(sender):
try:
sender_email = CommonRegex(sender).emails[0].lower()
return cat_dicts[cat][sender_categories.loc[sender_email][cat]]
except KeyError:
return "Unknow"
return _get_cat_val
In [9]:
grouped = eme_activites.groupby(get_cat_val_func("region"), axis=1)
print("Emails sent per region\n")
print(grouped.sum().sum())
print("Total emails: %s" % grouped.sum().sum().sum())
In [10]:
print("Participants per region")
for group in grouped.groups:
print "%s: %s" % (group,len(grouped.get_group(group).sum()))
print("Total participants: %s" % len(eme_activites.columns))
Notice that there is absolutely no one from Asia, Africa or South America. This is important because the DRM laws, attitude towards IP vary considerably across the world.
In [11]:
grouped = eme_activites.groupby(get_cat_val_func("work"), axis=1)
print("Emails sent per work category")
print(grouped.sum().sum())
In [12]:
print("Participants per work category")
for group in grouped.groups:
print "%s: %s" % (group,len(grouped.get_group(group).sum()))