This work was done by Harsh Gupta as part of his internship at The Center for Internet & Society India



In [1]:

    
import bigbang.mailman as mailman
import bigbang.process as process
from bigbang.archive import Archive


import pandas as pd
import datetime

from commonregex import CommonRegex

import matplotlib.pyplot as plt
%matplotlib inline

Encrypted Media Extension Diversity Analysis

Encrypted Media Extension (EME) is the controvertial draft standard at W3C which aims to aims to prevent copyright infrigement in digital video but opens up door for lots of issues regarding security, accessibility, privacy and interoperability. This notebook tries to analyze if the interests of the important stakeholders were well represented in the debate that happened on public-html mailing list of W3C.

Methodology

Any emails with EME, Encrypted Media or Digital Rights Managagement in the subject line is considered to about EME. Then each of the participant is categorized on the basis of region of the world they belong to and their employeer's interest to the debate. Notes about the participants can be found here.

Region Methodology:

Look up their personal website and social media accounts (Twitter, LinkedIn, Github) and see if it mentions the country they live in. (Works in Most of the cases)
If the person's email has uses a country specific top level domain, assume that as the country
If github profile is available look up the timezone on last 5 commits.
For people who have moved from their home country consider the country where they live now.

Work Methodology

Look up their personal website and social media accounts (Twitter, LinkedIn, Github) and see if it mentions the employer and categorize accordingly.
People who work on Accessibility, Privacy or Security but also fit into first three categories are categorized in one of the first three categories. For example someone who works on privacy in Google will be placed in "DRM platform provider" instead of "Privacy".
If no other category can be assigned, then assign "None of the Above"

Other Notes

Google's position is very interesting, it is DRM provider as a browser manufacturer but also a content provider in Youtube and fair number of Google Employers are against EME due to other concerns. I've categorized Christian as Content provider because he works on Youtube, and I've placed everyone else as DRM provider.



In [2]:

    
def filter_messages(df, column, keywords):
    filters = []
    for keyword in keywords:
        filters.append(df[column].str.contains(keyword, case=False))

    return df[reduce(lambda p, q: p | q, filters)]



In [3]:

    
# Get the Archieves
pd.options.display.mpl_style = 'default'  # pandas has a set of preferred graph formatting options

mlist = mailman.open_list_archives("https://lists.w3.org/Archives/Public/public-html/", archive_dir="./archives") 

# The spaces around eme are **very** important otherwise it can catch things like "emerging", "implement" etc
eme_messages = filter_messages(mlist, 'Subject', [' EME ', 'Encrypted Media', 'Digital Rights Managagement'])
eme_activites = Archive.get_activity(Archive(eme_messages))









    



Opening 69 archive files






    



/home/hargup/code/bigbang/bigbang/archive.py:74: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  self.data.sort(columns='Date', inplace=True)



In [4]:

    
eme_activites.sum(0).sum()









    Out[4]:





474.0



In [5]:

    
# XXX: Bugzilla might also contain discussions
eme_activites.drop("bugzilla@jessica.w3.org", axis=1, inplace=True)



In [6]:

    
# Remove Dupicate senders
levdf = process.sorted_matrix(eme_activites)

consolidates = []
# gather pairs of names which have a distance of less than 10
for col in levdf.columns:
  for index, value in levdf.loc[levdf[col] < 10, col].iteritems():
        if index != col: # the name shouldn't be a pair for itself
            consolidates.append((col, index))
            
# Handpick special cases which aren't covered with string matching
consolidates.extend([(u'Kornel Lesi\u0144ski <kornel@geekhood.net>',
                      u'wrong string <kornel@geekhood.net>'),
                     (u'Charles McCathie Nevile <chaals@yandex-team.ru>',
                      u'Charles McCathieNevile <chaals@opera.com>')])

eme_activites = process.consolidate_senders_activity(eme_activites, consolidates)



In [7]:

    
sender_categories = pd.read_csv('people_tag.csv',delimiter=',', encoding="utf-8-sig")

# match sender using email only
sender_categories['email'] = map(lambda x: CommonRegex(x).emails[0].lower(), sender_categories['name_email'])

sender_categories.index = sender_categories['email']
cat_dicts = {
    "region":{
        1: "Asia",
        2: "Australia and New Zealand",
        3: "Europe",
        4: "Africa",
        5: "North America",
        6: "South America"
    },
    "work":{
        1: "Foss Browser Developer",
        2: "Content Provider",
        3: "DRM platform provider",
        4: "Accessibility",
        5: "Security Researcher",
        6: "Other W3C Empoyee",
        7: "Privacy",
        8: "None of the above"

    }
}



In [8]:

    
def get_cat_val_func(cat):
    """
    Given category type, returns a function which gives the category value for a sender.
    """
    def _get_cat_val(sender):
        try:
            sender_email = CommonRegex(sender).emails[0].lower()
            return cat_dicts[cat][sender_categories.loc[sender_email][cat]]
        except KeyError:
            return "Unknow"
    return _get_cat_val



In [9]:

    
grouped = eme_activites.groupby(get_cat_val_func("region"), axis=1)
print("Emails sent per region\n")
print(grouped.sum().sum())
print("Total emails: %s" % grouped.sum().sum().sum())









    



Emails sent per region

Australia and New Zealand     16
Europe                       146
North America                310
dtype: float64
Total emails: 472.0



In [10]:

    
print("Participants per region")
for group in grouped.groups:
    print "%s: %s" % (group,len(grouped.get_group(group).sum()))
print("Total participants: %s" % len(eme_activites.columns))









    



Participants per region
Europe: 13
North America: 30
Australia and New Zealand: 5
Total participants: 48

Notice that there is absolutely no one from Asia, Africa or South America. This is important because the DRM laws, attitude towards IP vary considerably across the world.



In [11]:

    
grouped = eme_activites.groupby(get_cat_val_func("work"), axis=1)
print("Emails sent per work category")
print(grouped.sum().sum())









    



Emails sent per work category
Accessibility              47
Content Provider          186
DRM platform provider     100
Foss Browser Developer     56
None of the above          71
Other W3C Empoyee          10
Privacy                     2
dtype: float64



In [12]:

    
print("Participants per work category")
for group in grouped.groups:
    print "%s: %s" % (group,len(grouped.get_group(group).sum()))









    



Participants per work category
Privacy: 2
Foss Browser Developer: 5
Accessibility: 4
Other W3C Empoyee: 3
DRM platform provider: 15
Content Provider: 9
None of the above: 10