Fetch GitHub Issues and Compute Embeddings

This notebook downloads GitHub Issues and then computes the embeddings using a trained model
issues_loader.ipynb is a very similar notebook
- That notebook however just uses the IssuesLoader class as way of hard coding some paths.

Running this Notebook

This notebook was last run on [gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0]
Resource specs
- CPU 15
- RAM 32Gi
If kernel dies while computing embeddings it could be because you run out of memory

Compute: This notebook was run on a p3.8xlarge on AWS

Tesla V100 GPU, 32 vCPUs 244GB of Memory



In [1]:

    
import logging
import os
from pathlib import Path
import sys

logging.basicConfig(format='%(message)s')
logging.getLogger().setLevel(logging.INFO)

home = str(Path.home())

# Installing the python packages locally doesn't appear to have them automatically
# added the path so we need to manually add the directory
local_py_path = os.path.join(home, ".local/lib/python3.6/site-packages")

for p in [local_py_path, os.path.abspath("../../py")]:
    if p not in sys.path:
      logging.info("Adding %s to python path", p)
      # Insert at front because we want to override any installed packages
      sys.path.insert(0, p)









    



Adding /home/jovyan/git_kubeflow-code-intelligence/py to python path



In [2]:

    
!pip3 install --user --upgrade -r ../requirements.txt









    



Requirement already up-to-date: appnope==0.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 2)) (0.1.0)
Requirement already up-to-date: attrs==19.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 3)) (19.1.0)
Requirement already up-to-date: backcall==0.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 4)) (0.1.0)
Requirement already up-to-date: beautifulsoup4==4.7.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 5)) (4.7.1)
Requirement already up-to-date: bleach==3.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 6)) (3.1.0)
Requirement already up-to-date: blis==0.2.4 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 7)) (0.2.4)
Requirement already up-to-date: bottleneck==1.2.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 8)) (1.2.1)
Requirement already up-to-date: bs4==0.0.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 9)) (0.0.1)
Requirement already up-to-date: cachetools==3.1.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 10)) (3.1.1)
Requirement already up-to-date: certifi==2019.3.9 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 11)) (2019.3.9)
Requirement already up-to-date: cffi==1.12.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 12)) (1.12.3)
Requirement already up-to-date: chardet==3.0.4 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 13)) (3.0.4)
Requirement already up-to-date: click==7.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 14)) (7.0)
Requirement already up-to-date: cycler==0.10.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 15)) (0.10.0)
Requirement already up-to-date: cymem==2.0.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 16)) (2.0.2)
Requirement already up-to-date: cytoolz==0.9.0.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 17)) (0.9.0.1)
Requirement already up-to-date: decorator==4.4.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 18)) (4.4.0)
Requirement already up-to-date: defusedxml==0.6.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 19)) (0.6.0)
Requirement already up-to-date: dill==0.3.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 20)) (0.3.0)
Requirement already up-to-date: entrypoints==0.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 21)) (0.3)
Requirement already up-to-date: fastai==1.0.53.post3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 22)) (1.0.53.post3)
Requirement already up-to-date: fastprogress==0.1.21 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 23)) (0.1.21)
Requirement already up-to-date: flask-session==0.3.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 24)) (0.3.1)
Requirement already up-to-date: flask==1.0.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 25)) (1.0.2)
Requirement already up-to-date: ftfy==4.4.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 26)) (4.4.3)
Requirement already up-to-date: gcsfs==0.2.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 27)) (0.2.1)
Requirement already up-to-date: github3.py>=1.3.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 28)) (1.3.0)
Requirement already up-to-date: google-auth-oauthlib in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 29)) (0.4.1)
Requirement already up-to-date: google-auth in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 30)) (1.13.1)
Requirement already up-to-date: google-cloud-bigquery in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 31)) (1.24.0)
Requirement already up-to-date: gunicorn==19.9.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 32)) (19.9.0)
Requirement already up-to-date: html5lib==1.0.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 33)) (1.0.1)
Requirement already up-to-date: idna==2.8 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 34)) (2.8)
Requirement already up-to-date: ijson==2.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 35)) (2.3)
Requirement already up-to-date: ipdb==0.12 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 36)) (0.12)
Requirement already up-to-date: ipykernel==5.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 37)) (5.1.0)
Requirement already up-to-date: ipython-genutils==0.2.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 38)) (0.2.0)
Requirement already up-to-date: ipython==7.5.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 39)) (7.5.0)
Requirement already up-to-date: ipywidgets==7.4.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 40)) (7.4.2)
Requirement already up-to-date: itsdangerous==1.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 41)) (1.1.0)
Requirement already up-to-date: jedi==0.13.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 42)) (0.13.3)
Requirement already up-to-date: jinja2==2.10.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 43)) (2.10.1)
Requirement already up-to-date: joblib==0.13.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 44)) (0.13.2)
Requirement already up-to-date: jsonschema==3.0.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 45)) (3.0.1)
Requirement already up-to-date: JSON-log-formatter==0.2.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 46)) (0.2.0)
Requirement already up-to-date: jupyter-client==5.2.4 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 47)) (5.2.4)
Requirement already up-to-date: jupyter-core==4.4.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 48)) (4.4.0)
Requirement already up-to-date: kiwisolver==1.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 49)) (1.1.0)
Requirement already up-to-date: markupsafe==1.1.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 50)) (1.1.1)
Requirement already up-to-date: matplotlib==3.0.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 51)) (3.0.3)
Requirement already up-to-date: mdparse==0.13 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 52)) (0.13)
Requirement already up-to-date: mistune==0.8.4 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 53)) (0.8.4)
Requirement already up-to-date: more-itertools==8.2.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 54)) (8.2.0)
Requirement already up-to-date: murmurhash==1.0.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 55)) (1.0.2)
Requirement already up-to-date: mwparserfromhell==0.5.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 56)) (0.5.3)
Requirement already up-to-date: nbconvert==5.5.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 57)) (5.5.0)
Requirement already up-to-date: nbformat==4.4.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 58)) (4.4.0)
Requirement already up-to-date: networkx==2.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 59)) (2.3)
Requirement already up-to-date: notebook==5.7.8 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 60)) (5.7.8)
Requirement already up-to-date: numexpr==2.6.9 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 61)) (2.6.9)
Requirement already up-to-date: numpy==1.16.4 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 62)) (1.16.4)
Requirement already up-to-date: oauthlib==3.0.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 63)) (3.0.1)
Requirement already up-to-date: packaging==19.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 64)) (19.0)
Requirement already up-to-date: pandas==0.24.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 65)) (0.24.2)
Requirement already up-to-date: pandas-gbq in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 66)) (0.13.1)
Requirement already up-to-date: pandocfilters==1.4.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 67)) (1.4.2)
Requirement already up-to-date: parso==0.4.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 68)) (0.4.0)
Requirement already up-to-date: passlib==1.7.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 69)) (1.7.1)
Requirement already up-to-date: pexpect==4.7.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 70)) (4.7.0)
Requirement already up-to-date: pickleshare==0.7.5 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 71)) (0.7.5)
Requirement already up-to-date: pillow==6.0.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 72)) (6.0.0)
Requirement already up-to-date: plac==0.9.6 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 73)) (0.9.6)
Requirement already up-to-date: preshed==2.0.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 74)) (2.0.1)
Requirement already up-to-date: prometheus-client==0.6.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 75)) (0.6.0)
Requirement already up-to-date: prompt-toolkit==2.0.9 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 76)) (2.0.9)
Requirement already up-to-date: ptyprocess==0.6.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 77)) (0.6.0)
Requirement already up-to-date: pyasn1-modules==0.2.5 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 78)) (0.2.5)
Requirement already up-to-date: pyasn1==0.4.5 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 79)) (0.4.5)
Requirement already up-to-date: pycparser==2.19 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 80)) (2.19)
Requirement already up-to-date: pyemd==0.5.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 81)) (0.5.1)
Requirement already up-to-date: pygments==2.4.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 82)) (2.4.2)
Requirement already up-to-date: pyparsing==2.4.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 83)) (2.4.0)
Requirement already up-to-date: pyphen==0.9.5 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 84)) (0.9.5)
Requirement already up-to-date: pyrsistent==0.15.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 85)) (0.15.2)
Requirement already up-to-date: python-dateutil==2.8.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 86)) (2.8.0)
Requirement already up-to-date: python-levenshtein==0.12.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 87)) (0.12.0)
Requirement already up-to-date: pytz==2019.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 88)) (2019.1)
Requirement already up-to-date: pyyaml==5.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 89)) (5.1)
Requirement already up-to-date: pyzmq==18.0.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 90)) (18.0.1)
Requirement already up-to-date: regex==2019.6.8 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 91)) (2019.6.8)
Requirement already up-to-date: requests-oauthlib==1.2.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 92)) (1.2.0)
Requirement already up-to-date: requests==2.22.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 93)) (2.22.0)
Requirement already up-to-date: rsa==4.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 94)) (4.0)
Requirement already up-to-date: scikit-learn==0.20.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 95)) (0.20.3)
Requirement already up-to-date: scipy==1.2.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 96)) (1.2.1)
Requirement already up-to-date: send2trash==1.5.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 97)) (1.5.0)
Requirement already up-to-date: six>=1.13.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 98)) (1.14.0)
Requirement already up-to-date: soupsieve==1.9.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 99)) (1.9.1)
Requirement already up-to-date: spacy==2.1.4 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 100)) (2.1.4)
Requirement already up-to-date: srsly==0.0.7 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 101)) (0.0.7)
Requirement already up-to-date: terminado==0.8.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 102)) (0.8.2)
Requirement already up-to-date: testpath==0.4.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 103)) (0.4.2)
Requirement already up-to-date: textacy==0.7.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 104)) (0.7.1)
Requirement already up-to-date: thinc==7.0.4 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 105)) (7.0.4)
Requirement already up-to-date: timeout-decorator==0.4.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 106)) (0.4.1)
Requirement already up-to-date: toolz==0.9.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 107)) (0.9.0)
Requirement already up-to-date: tornado==6.0.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 108)) (6.0.2)
Requirement already up-to-date: torch==1.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 109)) (1.1.0)
Requirement already up-to-date: tqdm==4.32.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 110)) (4.32.2)
Requirement already up-to-date: traitlets==4.3.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 111)) (4.3.2)
Requirement already up-to-date: typing==3.6.6 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 112)) (3.6.6)
Requirement already up-to-date: urllib3==1.24.3 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 113)) (1.24.3)
Requirement already up-to-date: wasabi==0.2.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 114)) (0.2.2)
Requirement already up-to-date: wcwidth==0.1.7 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 115)) (0.1.7)
Requirement already up-to-date: webencodings==0.5.1 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 116)) (0.5.1)
Requirement already up-to-date: werkzeug==0.15.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 117)) (0.15.2)
Requirement already up-to-date: widgetsnbextension==3.4.2 in /home/jovyan/.local/lib/python3.6/site-packages (from -r ../requirements.txt (line 118)) (3.4.2)
Requirement already satisfied, skipping upgrade: torchvision in /home/jovyan/.local/lib/python3.6/site-packages (from fastai==1.0.53.post3->-r ../requirements.txt (line 22)) (0.5.0)
Requirement already satisfied, skipping upgrade: nvidia-ml-py3 in /home/jovyan/.local/lib/python3.6/site-packages (from fastai==1.0.53.post3->-r ../requirements.txt (line 22)) (7.352.0)
Requirement already satisfied, skipping upgrade: dataclasses; python_version < "3.7" in /home/jovyan/.local/lib/python3.6/site-packages (from fastai==1.0.53.post3->-r ../requirements.txt (line 22)) (0.7)
Requirement already satisfied, skipping upgrade: uritemplate>=3.0.0 in /home/jovyan/.local/lib/python3.6/site-packages (from github3.py>=1.3.0->-r ../requirements.txt (line 28)) (3.0.1)
Requirement already satisfied, skipping upgrade: jwcrypto>=0.5.0 in /home/jovyan/.local/lib/python3.6/site-packages (from github3.py>=1.3.0->-r ../requirements.txt (line 28)) (0.7)
Requirement already satisfied, skipping upgrade: setuptools>=40.3.0 in /home/jovyan/.local/lib/python3.6/site-packages (from google-auth->-r ../requirements.txt (line 30)) (46.1.3)
Requirement already satisfied, skipping upgrade: protobuf>=3.6.0 in /home/jovyan/.local/lib/python3.6/site-packages (from google-cloud-bigquery->-r ../requirements.txt (line 31)) (3.11.3)
Requirement already satisfied, skipping upgrade: google-resumable-media<0.6dev,>=0.5.0 in /home/jovyan/.local/lib/python3.6/site-packages (from google-cloud-bigquery->-r ../requirements.txt (line 31)) (0.5.0)
Requirement already satisfied, skipping upgrade: google-cloud-core<2.0dev,>=1.1.0 in /home/jovyan/.local/lib/python3.6/site-packages (from google-cloud-bigquery->-r ../requirements.txt (line 31)) (1.3.0)
Requirement already satisfied, skipping upgrade: google-api-core<2.0dev,>=1.15.0 in /home/jovyan/.local/lib/python3.6/site-packages (from google-cloud-bigquery->-r ../requirements.txt (line 31)) (1.16.0)
Requirement already satisfied, skipping upgrade: pydata-google-auth in /home/jovyan/.local/lib/python3.6/site-packages (from pandas-gbq->-r ../requirements.txt (line 66)) (0.3.0)
Requirement already satisfied, skipping upgrade: cryptography>=1.5 in /home/jovyan/.local/lib/python3.6/site-packages (from jwcrypto>=0.5.0->github3.py>=1.3.0->-r ../requirements.txt (line 28)) (2.9)
Requirement already satisfied, skipping upgrade: googleapis-common-protos<2.0dev,>=1.6.0 in /home/jovyan/.local/lib/python3.6/site-packages (from google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery->-r ../requirements.txt (line 31)) (1.51.0)
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.



In [3]:

    
from bs4 import BeautifulSoup
import requests
from fastai.core import parallel, partial

from collections import Counter
from tqdm import tqdm_notebook
import torch
from code_intelligence import embeddings
from code_intelligence import graphql
from code_intelligence import gcs_util
from google.cloud import storage

Get a list of Kubeflow REPOs

You will need to either set a GitHub token or use a GitHub App in order to call the API
TODO(jlewi): This is no longer really necessary since we are using BigQuery now to fetch the data we can query by org



In [194]:

    
if not os.getenv("GITHUB_TOKEN"):
    logging.warning(f"No GitHub token set defaulting to hardcode list of Kubeflow repositories")
    
    # The list of repos can be updated using the else block
    repo_names = ['arena', 'batch-predict', 'caffe2-operator', 'chainer-operator', 'code-intelligence', 'common', 'community', 'crd-validation', 'example-seldon', 'examples', 'fairing', 'features', 'frontend', 'homebrew-cask', 'homebrew-core', 'internal-acls', 'katib', 'kfctl', 'kfp-tekton', 'kfserving', 'kubebench', 'kubeflow', 'manifests', 'marketing-materials', 'metadata', 'mpi-operator', 'mxnet-operator', 'pipelines', 'pytorch-operator', 'reporting', 'testing', 'tf-operator', 'triage-issues', 'website', 'xgboost-operator']
else:
    gh_client = graphql.GraphQLClient()
        
    repo_query="""query repoQuery($org: String!) {
       organization(login: $org) {
        repositories(first:100) {
          totalCount 
          edges {
            node {
              name
            }
          }
        }
      }
    }
    """
    variables = {
        "org": "kubeflow",
    }
    results = gh_client.run_query(repo_query, variables)
    repo_nodes = graphql.unpack_and_split_nodes(results, ["data", "organization", "repositories", "edges"])
    repo_names = [n["name"] for n in repo_nodes]

    ",".join([f"'{n}'" for n in sorted(repo_names)])
    names_str = ", ".join([f"'{n}'" for n in sorted(repo_names)])
    print(f"[{names_str}]")









    



GraphQLClient is defaulting to FixedAccessTokenGenerator based on environment variables. This is deprecated. Caller should explicitly pass in a instance via header_generator. Traceback:
<function extract_stack at 0x7f31376bd6a8>






    



['arena', 'batch-predict', 'caffe2-operator', 'chainer-operator', 'code-intelligence', 'common', 'community', 'crd-validation', 'example-seldon', 'examples', 'fairing', 'features', 'frontend', 'homebrew-cask', 'homebrew-core', 'internal-acls', 'katib', 'kfctl', 'kfp-tekton', 'kfserving', 'kubebench', 'kubeflow', 'manifests', 'marketing-materials', 'metadata', 'mpi-operator', 'mxnet-operator', 'pipelines', 'pytorch-operator', 'reporting', 'testing', 'tf-operator', 'triage-issues', 'website', 'xgboost-operator']

Get The Data



In [5]:

    
import pandas as pd
from inference import InferenceWrapper

Load Model Artifacts (Download from GC if not on local)

We need to load the model used to compute embeddings



In [6]:

    
from pathlib import Path
from urllib import request as request_url

def pass_through(x):
    return x

model_url = 'https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_22zkdqlr.pkl'
inference_wrapper = embeddings.load_model_artifact(model_url)

Warning: The below cell benefits tremendously from parallelism, the more cores your machine has the better

The code will fail if you aren't running with a GPU

Get the Data Using BigQuery

We can use BigQuery to fetch the data from the GitHub Archive
Here is a list of GitHub Event Types
- We need to consider both IssuesEvent and IssueCommentEvent
At the time of this writing 2020/04/08 there are approximately 137K events in Kubeflow and it takes O(30) seconds to fetch all of them.
TODO
- It looks like when we transfer a repo or maybe an issue we end up with duplicate entries with diffferent URLs (original and new one). We should look into dedupping those



In [7]:

    
from pandas.io import gbq
import subprocess 
# TODO(jlewi): Get the project using fairing?
PROJECT = subprocess.check_output(["gcloud", "config", "get-value", "project"]).strip().decode()



In [166]:

    
# TODO(jlewi): This code should now be a function in embeddings/github_bigquery.py
query = """SELECT          
          JSON_EXTRACT(payload, '$.issue.html_url') as html_url,
          JSON_EXTRACT(payload, '$.issue.title') as title,
          JSON_EXTRACT(payload, '$.issue.body') as body,
          JSON_EXTRACT(payload, "$.issue.labels") as labels,
          JSON_EXTRACT(payload, "$.issue.updated_at") as updated_at,
          org.login,
          type,
      FROM `githubarchive.month.20*`
      WHERE  (type="IssuesEvent" or type="IssueCommentEvent") and org.login = 'kubeflow'"""
issues_and_pulls=gbq.read_gbq(query, dialect='standard', project_id=PROJECT)









    



  Elapsed 6.5 s. Waiting...
  Elapsed 7.64 s. Waiting...
  Elapsed 8.77 s. Waiting...
  Elapsed 9.89 s. Waiting...
  Elapsed 11.01 s. Waiting...
  Elapsed 12.13 s. Waiting...
  Elapsed 13.25 s. Waiting...
  Elapsed 14.36 s. Waiting...
Downloading: 100%|██████████| 138044/138044 [00:33<00:00, 2964.10rows/s]
Total time taken 49.03 s.
Finished at 2020-04-11 22:04:11.

Pull request comments also get included so we need to filter those out



In [167]:

    
import re
pattern = re.compile(".*issues/[\d]+")
issues_index = issues_and_pulls["html_url"].apply(lambda x: pattern.match(x) is not None)
issues=issues_and_pulls[issues_index]

We need to group the events by issue and then select the most recent event for each issue as that should have the most up to date labels for each issue
TODO(jlewi): We should look for the most recent event in the dataset and then have some alert if the age exceeds some limit as that indicates the data isn't up to date.



In [168]:

    
latest_issues = issues.groupby("html_url", as_index=False).apply(lambda x: x.sort_values(["updated_at"]).iloc[-1])



In [169]:

    
# Example of fetching a specific issue
# This allows easy spot checking of the data
some_issue = "https://github.com/kubeflow/kubeflow/issues/4916"
test_issue = latest_issues.loc[latest_issues["html_url"]==f'"{some_issue}"']
test_issue









    Out[169]:







  
    
      
      html_url
      title
      body
      labels
      updated_at
      login
      type
    
  
  
    
      4310
      "https://github.com/kubeflow/kubeflow/issues/4...
      "Open Data Hub & Kubeflow relationship"
      "/kind question\r\n\r\nHi all,\r\n\r\nAs some ...
      [{"id":1182962369,"node_id":"MDU6TGFiZWwxMTgyO...
      "2020-04-08T23:06:36Z"
      kubeflow
      IssueCommentEvent

We need to parse the labels which are json and get the names



In [170]:

    
import json
def get_labels(x):
    d = json.loads(x)
    return [i["name"] for i in d]

latest_issues["parsed_labels"] = latest_issues["labels"].apply(get_labels)

We need to deserialize the json strings to remove escaping



In [171]:

    
for f in ["html_url", "title", "body"]:
    latest_issues[f] = latest_issues[f].apply(lambda x : json.loads(x))

Compute Embeddings

For each repo compute the embeddings and save to GCS
TODO(jlewi): Can we use the metadata storage to keep track of artifacts?



In [230]:

    
input_data = latest_issues[["title", "body"]]



In [231]:

    
issue_embeddings = inference_wrapper.df_to_embedding(input_data)









    



Model inference: 0 / 7848
Model inference: 100 / 7848
Model inference: 200 / 7848
Model inference: 300 / 7848
Model inference: 400 / 7848
Model inference: 500 / 7848
Model inference: 600 / 7848
Model inference: 700 / 7848
Model inference: 800 / 7848
Model inference: 900 / 7848
Model inference: 1000 / 7848
Model inference: 1100 / 7848
Model inference: 1200 / 7848
Model inference: 1300 / 7848
Model inference: 1400 / 7848
Model inference: 1500 / 7848
Model inference: 1600 / 7848
Model inference: 1700 / 7848
Model inference: 1800 / 7848
Model inference: 1900 / 7848
Model inference: 2000 / 7848
Model inference: 2100 / 7848
Model inference: 2200 / 7848
Model inference: 2300 / 7848
Model inference: 2400 / 7848
Model inference: 2500 / 7848
Model inference: 2600 / 7848
Model inference: 2700 / 7848
Model inference: 2800 / 7848
Model inference: 2900 / 7848
Model inference: 3000 / 7848
Model inference: 3100 / 7848
Model inference: 3200 / 7848
Model inference: 3300 / 7848
Model inference: 3400 / 7848
Model inference: 3500 / 7848
Model inference: 3600 / 7848
Model inference: 3700 / 7848
Model inference: 3800 / 7848
Model inference: 3900 / 7848
Model inference: 4000 / 7848
Model inference: 4100 / 7848
Model inference: 4200 / 7848
Model inference: 4300 / 7848
Model inference: 4400 / 7848
Model inference: 4500 / 7848
Model inference: 4600 / 7848
Model inference: 4700 / 7848
Model inference: 4800 / 7848
Model inference: 4900 / 7848
Model inference: 5000 / 7848
Model inference: 5100 / 7848
Model inference: 5200 / 7848
Model inference: 5300 / 7848
Model inference: 5400 / 7848
Model inference: 5500 / 7848
Model inference: 5600 / 7848
Model inference: 5700 / 7848
Model inference: 5800 / 7848
Model inference: 5900 / 7848
Model inference: 6000 / 7848
Model inference: 6100 / 7848
Model inference: 6200 / 7848
Model inference: 6300 / 7848
Model inference: 6400 / 7848
Model inference: 6500 / 7848
Model inference: 6600 / 7848
Model inference: 6700 / 7848
Model inference: 6800 / 7848
Model inference: 6900 / 7848
Model inference: 7000 / 7848
Model inference: 7100 / 7848
Model inference: 7200 / 7848
Model inference: 7300 / 7848
Model inference: 7400 / 7848
Model inference: 7500 / 7848
Model inference: 7600 / 7848
Model inference: 7700 / 7848
Model inference: 7800 / 7848
CUDA out of memory, the new batch size is 50
Model inference: 7800 / 7848
CUDA out of memory, the new batch size is 25
Model inference: 7800 / 7848
Model inference: 7825 / 7848



In [232]:

    
issue_embeddings.shape









    Out[232]:





(7848, 2400)

Sanity Check the embeddings

We want to make sure the embeddings are computed the same way as during inference time
During inference IssueLabelerPredict.predict_labels_for_issue calls embeddings.get_issue_text to fetch the body and title
We call embeddings.get_issue_text one of the issues to make sure it matches the data in the dataframe from which we compute the embeddings
This calls the /text on the embeddings microservice
TODO(https://github.com/kubeflow/code-intelligence/issues/126) The label bot microservice needs to be updated to actually use the GraphQL API to match this code. Hopefully, in the interim the model is robust to slight deviations caused by the differences in whitespace



In [233]:

    
from code_intelligence import util as code_intelligence_util



In [234]:

    
issue_index = 1020
logging.info(f"Fetching issue {latest_issues.iloc[issue_index]['html_url']}")
issue_owner, issue_repo, issue_num = code_intelligence_util.parse_issue_url(latest_issues.iloc[issue_index]["html_url"].strip("\""))









    



Fetching issue https://github.com/kubeflow/katib/issues/1062



In [235]:

    
some_issue_data = embeddings.get_issue(latest_issues.iloc[issue_index]["html_url"], gh_client)



In [224]:

    
some_issue_data









    Out[224]:





{'__typename': 'Issue',
 'author': {'__typename': 'User', 'login': 'andreyvelich'},
 'id': 'MDU6SXNzdWU1Njc2MjYwNjg=',
 'title': 'Save Suggestion state after deployment is deleted',
 'body': "/kind feature\r\n\r\nKatib should have functionality to save Suggestion state somewhere besides Suggestion pod. \r\nSome users would like to resume Experiments, but they don't want to have always running Suggestion deployment. For example we can use PV.\r\n\r\nWe can use `ResumeExperiment` flag from here: https://github.com/kubeflow/katib/issues/1061 to specify resuming experiment mechanism.\r\n\r\n/cc @johnugeorge @gaocegege @hougangliu @richardsliu \r\n",
 'url': 'https://github.com/kubeflow/katib/issues/1062',
 'state': 'OPEN',
 'labels': {'totalCount': 1, 'edges': [{'node': {'name': 'kind/feature'}}]}}



In [236]:

    
print(latest_issues.iloc[issue_index]["title"])
print(some_issue_data["title"])
print(latest_issues.iloc[issue_index]["body"])
print(some_issue_data["body"])
some_issue_data["title"] == latest_issues.iloc[issue_index]["title"]
some_issue_data["body"] == latest_issues.iloc[issue_index]["body"]









    



Save Suggestion state after deployment is deleted
Save Suggestion state after deployment is deleted
/kind feature

Katib should have functionality to save Suggestion state somewhere besides Suggestion pod. 
Some users would like to resume Experiments, but they don't want to have always running Suggestion deployment. For example we can use PV.

We can use `ResumeExperiment` flag from here: https://github.com/kubeflow/katib/issues/1061 to specify resuming experiment mechanism.

/cc @johnugeorge @gaocegege @hougangliu @richardsliu 

/kind feature

Katib should have functionality to save Suggestion state somewhere besides Suggestion pod. 
Some users would like to resume Experiments, but they don't want to have always running Suggestion deployment. For example we can use PV.

We can use `ResumeExperiment` flag from here: https://github.com/kubeflow/katib/issues/1061 to specify resuming experiment mechanism.

/cc @johnugeorge @gaocegege @hougangliu @richardsliu 







    Out[236]:





True

Compare the embeddings computed in this notebook to the embeddings computed using inference_wrapper



In [237]:

    
dict_for_embeddings = inference_wrapper.process_dict(some_issue_data)



In [238]:

    
inference_wrapper.get_pooled_features(dict_for_embeddings['text']).detach().cpu().numpy()









    Out[238]:





array([[-0.006017, -0.113368, -0.04294 ,  0.124138, ..., -0.017082, -0.245775, -0.077704,  0.084911]], dtype=float32)



In [239]:

    
issue_embeddings[issue_index,:]









    Out[239]:





array([-0.006017, -0.113368, -0.04294 ,  0.124138, ..., -0.017082, -0.245775, -0.077705,  0.084911], dtype=float32)

Save the issues and embeddings to an HDF5 file



In [263]:

    
import h5py
import datetime

now = code_intelligence_util.now().isoformat()



In [268]:

    
git_tag = subprocess.check_output(["git", "describe", "--tags", "--always", "--dirty"]).decode().strip()
file_name = f"kubeflow_issue_embeddings_{now}.hdf5"
local_file = os.path.join(home, file_name)



In [269]:

    
latest_issues.to_hdf(local_file, "issues", mode="a")









    



/home/jovyan/.local/lib/python3.6/site-packages/pandas/core/generic.py:2377: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['html_url', 'title', 'body', 'labels', 'updated_at', 'login', 'type', 'parsed_labels']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)



In [270]:

    
h5_file = h5py.File(local_file, mode="a")



In [271]:

    
h5_file.create_dataset("issue_embeddings", data=issue_embeddings)









    Out[271]:





<HDF5 dataset "issue_embeddings": shape (7848, 2400), type "<f4">



In [272]:

    
# store some metadata
h5_file.attrs["file"] = "Get-GitHub-Issues.ipynb"
h5_file.attrs["git-tag"] = git_tag



In [273]:

    
h5_file.close()

Save Embeddings to GCS



In [274]:

    
embeddings_file = os.path.join(embeddings_dir, file_name)
if gcs_util.check_gcs_object(embeddings_file):
    logging.info(f"File {embeddings_file} exists")
else:    
    logging.info(f"Copying {local_file} to {embeddings_file}")         
    gcs_util.copy_to_gcs(local_file, embeddings_file)









    



Copying /home/jovyan/kubeflow_issue_embeddings_2020-04-11T17:15:10.000876-07:00.hdf5 to gs://repo-embeddings/kubeflow/2020_0428/kubeflow_issue_embeddings_2020-04-11T17:15:10.000876-07:00.hdf5



In [67]:

    
embeddings_file









    Out[67]:





'gs://repo-embeddings/kubeflow/2020_0428/kubeflow_embeddings_200410_162421.h5'

Notes

It takes 4min to retrieve embeddings and labels for Kubeflow\Kubeflow this time can likely be brought down to 1 minute by batching the text instead of feeding the language model one by one.