Company Contributions

This notebook attempts to compute number of contributions (currently just PRs) by various companies
It relies on github_users.json to map users to affiliation
- github_users.json Is outdated and needs to be updated



In [1]:

    
# NOTE: The RuntimeWarnings (if any) are harmless. See ContinuumIO/anaconda-issues#6678.
from pandas.io import gbq
import pandas as pd



In [2]:

    
import getpass
import subprocess
# Configuration Variables. Modify as desired.

PROJECT = subprocess.check_output(["gcloud", "config", "get-value", "project"]).strip().decode()

Setup Authorization

If you are using a service account run %%bash

Activate Service Account provided by Kubeflow.

gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}

If you are running using user credentials

gcloud auth application-default login



In [26]:

    
import datetime
month = datetime.datetime.now().month
year = datetime.datetime.now().year

num_months = 12

months = []
for i in range(num_months):
    months.append("\"{0}{1:02}\"".format(year, month))
    month -= 1
    if month == 0:
        month = 12
        year -=1

Read in user affiliations

github_users.json is produced using CNCF scripts
There can be multiple entries for a user showing their company & affiliation during different time periods



In [27]:

    
import json
import os
import requests
if not os.path.exists(".cache"):
    os.makedirs(".cache")
    

users_file = os.path.join(".cache", "github_users.json")

if not os.path.exists(users_file):
    url = "https://github.com/kubeflow/community/blob/master/devstats/data/github_users.json?raw=true"

    r = requests.get(url, allow_redirects=True)
    
    with open(users_file, "wb") as hf:
        hf.write(r.content)

with open(users_file) as hf:    
    data = json.load(hf)
users=pd.DataFrame(data)
users = users[["login", "company"]]



In [28]:

    
# Dedupe companies
c = ["cisco", "datawire", "google", "ibm", "intel", "teradata", "red hat"]
known_companies = dict(zip(c,c))
known_companies["redhat"] = "red hat"
def normalize_company(name):
    if name is None:
        return "None"
    name = name.strip().lower().strip("!").strip("@")
        
     
    for k, v in known_companies.items():
        if k in name:
            return v
    return name

users["company"] = users["company"].apply(normalize_company)

Users can have multiple entries
We pick the first non None entry
TODO(jlewi) We should find a better way to combine multiple entries



In [29]:

    
def combine_company(names):
    for i in names:
        if i != "None":
            return i
    return None

user_map= users.groupby("login")["company"].apply(combine_company)

# You can now look up users as user_map[actor]
user_map["jlewi"]









    Out[29]:





'google'

Attribute PRs to Companies

We use BigQuery to get a list of PRs
We then map each PR to the company of its author



In [30]:

    
query = """
SELECT
    DATE(created_at) AS pr_date,
    actor.id,
    actor.login
  FROM `githubarchive.month.*`
  WHERE
    _TABLE_SUFFIX IN ({0})
    AND type = 'PullRequestEvent'
    AND org.login = 'kubeflow'
    AND JSON_EXTRACT(payload, '$.action') IN ('"opened"')
""".format(",".join(months))

prs=gbq.read_gbq(str(query), dialect='standard', project_id=PROJECT)









    



/home/jovyan/.local/lib/python3.6/site-packages/pandas_gbq/gbq.py:555: UserWarning: A progress bar was requested, but there was an error loading the tqdm library. Please install tqdm to use the progress bar functionality.
  progress_bar_type=progress_bar_type,



In [31]:

    
p=pd.Series(data=prs["id"].values,index=prs["pr_date"])
p=p.sort_index()



In [32]:

    
prs









    Out[32]:







  
    
      
      pr_date
      id
      login
    
  
  
    
      0
      2019-05-01
      4784270
      gabrielwen
    
    
      1
      2019-05-01
      4739316
      Jeffwan
    
    
      2
      2019-05-01
      1497445
      derekhh
    
    
      3
      2019-05-01
      3724388
      cheyang
    
    
      4
      2019-05-02
      3631320
      animeshsingh
    
    
      ...
      ...
      ...
      ...
    
    
      5986
      2019-06-23
      146453
      kkasravi
    
    
      5987
      2019-07-03
      2348602
      IronPan
    
    
      5988
      2020-01-16
      38020283
      kubeflow-bot
    
    
      5989
      2020-01-16
      37601826
      kunmingg
    
    
      5990
      2020-01-16
      20940800
      MATRIX4284
    
  

5991 rows × 3 columns



In [33]:

    
prs["company"] = user_map[prs["login"]].values



In [34]:

    
d=prs[["pr_date", "company"]]
d["count"]=1









    



/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [35]:

    
pr_counts = d.pivot_table("count", columns="company", index="pr_date", aggfunc="sum", fill_value=0)



In [46]:

    
# Some solutions here: https://stackoverflow.com/questions/46470743/how-to-efficiently-compute-a-rolling-unique-count-in-a-pandas-time-series
# Need to figure out how to do a time based window

counts = pr_counts.rolling('28d').sum()

Make a plot

Plot a subset of companies



In [128]:

    
names = ["google", "cisco", "microsoft", "arrikto", "ibm", "seldon"]
companies_df = counts[names]
companies_df["day"] = companies_df.index
companies = pd.melt(companies_df, value_vars=names, id_vars=["day"])

chart = alt.Chart(companies, title= "PRs")
point = chart.mark_point().encode(
  x= alt.X('day', title = "Day"),
  y=alt.Y("value", title="# PRs"),
  color="company",
)

point.interactive()









    



/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until






    Out[128]:

	pr_date	id	login
0	2019-05-01	4784270	gabrielwen
1	2019-05-01	4739316	Jeffwan
2	2019-05-01	1497445	derekhh
3	2019-05-01	3724388	cheyang
4	2019-05-02	3631320	animeshsingh
...	...	...	...
5986	2019-06-23	146453	kkasravi
5987	2019-07-03	2348602	IronPan
5988	2020-01-16	38020283	kubeflow-bot
5989	2020-01-16	37601826	kunmingg
5990	2020-01-16	20940800	MATRIX4284