This notebook implements an end-to-end Semantic Code Search on top of Kubeflow - given an input query string, get a list of code snippets semantically similar to the query string.
NOTE: If you haven't already, see kubeflow/examples/code_search for instructions on how to get this notebook,.
In [ ]:
%%bash
echo "Pip Version Info: " && python2 --version && python2 -m pip --version && echo
echo "Google Cloud SDK Info: " && gcloud --version && echo
echo "Ksonnet Version Info: " && ks version && echo
echo "Kubectl Version Info: " && kubectl version
In [ ]:
! python2 -m pip install -U pip
In [ ]:
# Code Search dependencies
! python2 -m pip install --user https://github.com/kubeflow/batch-predict/tarball/master
! python2 -m pip install --user -r src/requirements.txt
In [ ]:
# BigQuery Cell Dependencies
! python2 -m pip install --user pandas-gbq
In [ ]:
# NOTE: The RuntimeWarnings (if any) are harmless. See ContinuumIO/anaconda-issues#6678.
from pandas.io import gbq
Set the following variables
In [ ]:
import getpass
import subprocess
# Configuration Variables. Modify as desired.
PROJECT = subprocess.check_output(["gcloud", "config", "get-value", "project"]).strip()
# Dataflow Related Variables.
TARGET_DATASET = 'code_search'
WORKING_DIR = 'gs://{0}_code_search/workingDir'.format(PROJECT)
KS_ENV=getpass.getuser()
# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env PROJECT $PROJECT
%env TARGET_DATASET $TARGET_DATASET
%env WORKING_DIR $WORKING_DIR
In [ ]:
%%bash
# Activate Service Account provided by Kubeflow.
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
Additionally, to interact with the underlying cluster, we configure kubectl.
In [ ]:
%%bash
kubectl config set-cluster kubeflow --server=https://kubernetes.default --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
kubectl config set-credentials jupyter --token "$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
kubectl config set-context kubeflow --cluster kubeflow --user jupyter
kubectl config use-context kubeflow
Collectively, these allow us to interact with Google Cloud Services as well as the Kubernetes Cluster directly to submit TFJobs and execute Dataflow pipelines.
In [ ]:
%%bash
cd kubeflow
# Update Ksonnet to point to the Kubernetes Cluster
ks env add code-search
# Update Ksonnet to use the namespace where kubeflow is deployed. By default it's 'kubeflow'
ks env set code-search --namespace=kubeflow
# Update the Working Directory of the application
sed -i'' "s,gs://example/prefix,${WORKING_DIR}," components/params.libsonnet
# FIXME(sanyamkapoor): This command completely replaces previous configurations.
# Hence, using string replacement in file.
# ks param set t2t-code-search workingDir ${WORKING_DIR}
This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next.
WARNING: The table is large and the query can take a few minutes to complete.
In [ ]:
query = """
SELECT
MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,
c.content
FROM
`bigquery-public-data.github_repos.files` AS f
JOIN
`bigquery-public-data.github_repos.contents` AS c
ON
f.id = c.id
JOIN (
--this part of the query makes sure repo is watched at least twice since 2017
SELECT
repo
FROM (
SELECT
repo.name AS repo
FROM
`githubarchive.year.2017`
WHERE
type="WatchEvent"
UNION ALL
SELECT
repo.name AS repo
FROM
`githubarchive.month.2018*`
WHERE
type="WatchEvent" )
GROUP BY
1
HAVING
COUNT(*) >= 2 ) AS r
ON
f.repo_name = r.repo
WHERE
f.path LIKE '%.py' AND --with python extension
c.size < 15000 AND --get rid of ridiculously long files
REGEXP_CONTAINS(c.content, r'def ') --contains function definition
GROUP BY
c.content
LIMIT
10
"""
gbq.read_gbq(query, dialect='standard', project_id=PROJECT)
To get started define your experiment
Set the following values
hparams_set: The set of hyperparameters to use; see some suggestions here.
Configure your ksonnet environment to use your experiment
Define the following global parameters
Here's an example of what the contents should look like
workingDir: "gs://code-search-demo/20181104",
dataDir: "gs://code-search-demo/20181104/data",
project: "code-search-demo",
experiment: "demo-trainer-11-07-dist-sync-gpu",
In this step, we use Google Cloud Dataflow to preprocess the data.
code_search.dataflow.cli.preprocess_github_dataset that submits the Dataflow job
In [ ]:
%%bash
bq mk ${PROJECT}:${TARGET_DATASET}
In [ ]:
%%bash
cd kubeflow
ks param set --env=code-search submit-preprocess-job targetDataset ${TARGET_DATASET}
ks apply code-search -c submit-preprocess-job
When completed successfully, this should create a dataset in BigQuery named target_dataset. Additionally, it also dumps CSV files into data_dir which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query.
In [ ]:
query = """
SELECT *
FROM
{}.token_pairs
LIMIT
10
""".format(TARGET_DATASET)
gbq.read_gbq(query, dialect='standard', project_id=PROJECT)
This pipeline also writes a set of CSV files which contain function and docstring pairs delimited by a comma. Here, we list a subset of them.
In [ ]:
%%bash
LIMIT=10
gsutil ls ${WORKING_DIR}/data/*.csv | head -n ${LIMIT}
In [ ]:
%%bash
cd kubeflow
ks show code-search -c t2t-code-search-datagen
In [ ]:
%%bash
cd kubeflow
ks apply code-search -c t2t-code-search-datagen
Once this job finishes, the data directory should have a vocabulary file and a list of TFRecords prefixed by the problem name which in our case is github_function_docstring_extended. Here, we list a subset of them.
In [ ]:
%%bash
LIMIT=10
gsutil ls ${WORKING_DIR}/data/vocab*
gsutil ls ${WORKING_DIR}/data/*train* | head -n ${LIMIT}
In [ ]:
%%bash
cd kubeflow
ks show code-search -c t2t-code-search-trainer
In [ ]:
%%bash
cd kubeflow
ks apply code-search -c t2t-code-search-trainer
This will generate TensorFlow model checkpoints which is illustrated below.
In [ ]:
%%bash
gsutil ls ${WORKING_DIR}/output/*ckpt*
In [ ]:
%%bash
cd kubeflow
ks show code-search -c t2t-code-search-exporter
In [ ]:
%%bash
cd kubeflow
ks apply code-search -c t2t-code-search-exporter
Once completed, this will generate a TensorFlow SavedModel which we will further use for both online (via TF Serving) and offline inference (via Kubeflow Batch Prediction).
In [ ]:
%%bash
gsutil ls ${WORKING_DIR}/output/export/Servo
In [ ]:
%%bash --out EXPORT_DIR_LS
gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE "([0-9]+)/$"
In [ ]:
# WARNING: This routine will fail if no export has been completed successfully.
MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\n') if ts])
# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env MODEL_VERSION $MODEL_VERSION
In [ ]:
%%bash
cd kubeflow
ks apply code-search -c submit-code-embeddings-job
When completed successfully, this should create another table in the same BigQuery dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below.
In [ ]:
query = """
SELECT *
FROM
{}.function_embeddings
LIMIT
10
""".format(TARGET_DATASET)
gbq.read_gbq(query, dialect='standard', project_id=PROJECT)
The pipeline also generates a set of CSV files which will be useful to generate the search index.
In [ ]:
%%bash
LIMIT=10
gsutil ls ${WORKING_DIR}/data/*index*.csv | head -n ${LIMIT}
In [ ]:
%%bash
cd kubeflow
ks show code-search -c search-index-creator
In [ ]:
%%bash
cd kubeflow
ks apply code-search -c search-index-creator
Using the CSV files generated from the previous step, this creates an index using NMSLib. A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created.
In [ ]:
%%bash
gsutil ls ${WORKING_DIR}/code_search_index*
The web app includes two pieces
A Flask app that serves a simple UI for sending queries
A TF Serving instance to compute the embeddings for search queries
The ksonnet components for the web app are in a separate ksonnet application ks-web-app
We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using Tensorflow Serving.
Here are sample contents
gs://code-search-demo/models/20181107-dist-sync-gpu/export/:
gs://code-search-demo/models/20181107-dist-sync-gpu/export/
gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/:
gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/
gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/saved_model.pbtxt
gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/:
gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/
gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/variables.data-00000-of-00001
gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/variables.index
TFServing expects the modelBasePath to consist of numeric subdirectories corresponding to different versions of the model.
In [ ]:
%%bash
cd ks-web-app
ks param set --env=code-search modelBasePath ${MODEL_BASE_PATH}
ks show code-search -c query-embed-server
In [ ]:
%%bash
cd ks-web-app
ks apply code-search -c query-embed-server
We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed.
In [ ]:
%%bash
cd ks-web-app
ks param set --env=code-search search-index-server lookupFile ${LOOKUP_FILE}
ks param set --env=code-search search-index-server indexFile ${INDEX_FILE}
ks show code-search -c search-index-server
In [ ]:
%%bash
cd kubeflow
ks apply code-search -c search-index-server
The service should now be available at FQDN of the Kubeflow cluster at path /code-search/.