Code Search on Kubeflow

This notebook implements an end-to-end Semantic Code Search on top of Kubeflow - given an input query string, get a list of code snippets semantically similar to the query string.

NOTE: If you haven't already, see kubeflow/examples/code_search for instructions on how to get this notebook,.

Install dependencies

Let us install all the Python dependencies. Note that everything must be done with Python 2. This will take a while the first time.

Verify Version Information


In [ ]:
%%bash

echo "Pip Version Info: " && python2 --version && python2 -m pip --version && echo
echo "Google Cloud SDK Info: " && gcloud --version && echo
echo "Ksonnet Version Info: " && ks version && echo
echo "Kubectl Version Info: " && kubectl version

Install Pip Packages


In [ ]:
! python2 -m pip install -U pip

In [ ]:
# Code Search dependencies
! python2 -m pip install --user https://github.com/kubeflow/batch-predict/tarball/master
! python2 -m pip install --user -r src/requirements.txt

In [ ]:
# BigQuery Cell Dependencies
! python2 -m pip install --user pandas-gbq

In [ ]:
# NOTE: The RuntimeWarnings (if any) are harmless. See ContinuumIO/anaconda-issues#6678.
from pandas.io import gbq

Configure Variables

  • This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps.
  • Set the following variables

    • PROJECT: Set this to the GCP project you want to use
      • If gcloud has a project set this will be used by default
      • To use a different project or if gcloud doesn't have a project set you will need to configure one explicitly
    • WORKING_DIR: Override this if you don't want to use the default configured below
    • KS_ENV: Set this to the name of the ksonnet environment you want to create

In [ ]:
import getpass
import subprocess
# Configuration Variables. Modify as desired.

PROJECT = subprocess.check_output(["gcloud", "config", "get-value", "project"]).strip()

# Dataflow Related Variables.
TARGET_DATASET = 'code_search'
WORKING_DIR = 'gs://{0}_code_search/workingDir'.format(PROJECT)
KS_ENV=getpass.getuser()

# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env PROJECT $PROJECT
%env TARGET_DATASET $TARGET_DATASET
%env WORKING_DIR $WORKING_DIR

Setup Authorization

In a Kubeflow cluster on GKE, we already have the Google Application Credentials mounted onto each Pod. We can simply point gcloud to activate that service account.


In [ ]:
%%bash

# Activate Service Account provided by Kubeflow.
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}

Additionally, to interact with the underlying cluster, we configure kubectl.


In [ ]:
%%bash

kubectl config set-cluster kubeflow --server=https://kubernetes.default --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
kubectl config set-credentials jupyter --token "$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
kubectl config set-context kubeflow --cluster kubeflow --user jupyter
kubectl config use-context kubeflow

Collectively, these allow us to interact with Google Cloud Services as well as the Kubernetes Cluster directly to submit TFJobs and execute Dataflow pipelines.

Setup Ksonnet Application

We now point the Ksonnet application to the underlying Kubernetes cluster.


In [ ]:
%%bash

cd kubeflow

# Update Ksonnet to point to the Kubernetes Cluster
ks env add code-search

# Update Ksonnet to use the namespace where kubeflow is deployed. By default it's 'kubeflow'
ks env set code-search --namespace=kubeflow

# Update the Working Directory of the application
sed -i'' "s,gs://example/prefix,${WORKING_DIR}," components/params.libsonnet

# FIXME(sanyamkapoor): This command completely replaces previous configurations.
# Hence, using string replacement in file.
# ks param set t2t-code-search workingDir ${WORKING_DIR}

View Github Files

This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next.

WARNING: The table is large and the query can take a few minutes to complete.


In [ ]:
query = """
  SELECT
    MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,
    c.content
  FROM
    `bigquery-public-data.github_repos.files` AS f
  JOIN
    `bigquery-public-data.github_repos.contents` AS c
  ON
    f.id = c.id
  JOIN (
      --this part of the query makes sure repo is watched at least twice since 2017
    SELECT
      repo
    FROM (
      SELECT
        repo.name AS repo
      FROM
        `githubarchive.year.2017`
      WHERE
        type="WatchEvent"
      UNION ALL
      SELECT
        repo.name AS repo
      FROM
        `githubarchive.month.2018*`
      WHERE
        type="WatchEvent" )
    GROUP BY
      1
    HAVING
      COUNT(*) >= 2 ) AS r
  ON
    f.repo_name = r.repo
  WHERE
    f.path LIKE '%.py' AND --with python extension
    c.size < 15000 AND --get rid of ridiculously long files
    REGEXP_CONTAINS(c.content, r'def ') --contains function definition
  GROUP BY
    c.content
  LIMIT
    10
"""

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

Define an experiment

  • This solution consists of multiple jobs and servers that need to share parameters
  • To facilitate this we use experiments.libsonnet to define sets of parameters
  • Each set of parameters has a name corresponding to a key in the dictionary defined in experiments.libsonnet
  • You configure an experiment by defining a set of experiments in experiments.libsonnet
  • You then set the global parameter experiment to the name of the experiment defining the parameters you want to use.

To get started define your experiment

  • Create a new entry containing a set of values to be used for your experiment
  • Pick a suitable name for your experiment.
  • Set the following values

    • outputDir: The GCS directory where the output should be written
    • train_steps: Numer oftraining steps
    • eval_steps: Number of steps to be used for eval
    • hparams_set: The set of hyperparameters to use; see some suggestions here.

      • transformer_tiny can be used to train a very small model suitable for ensuring the code works.
    • project: Set this to the GCP project you can use
    • modelDir:
      • After training your model set this to a GCS directory containing the export model
      • e.g gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/
    • problem: set this to "kf_github_function_docstring",
    • model: set this "kf_similarity_transformer",
    • lookupFile: set this to the GCS location of the CSV produced by the job to create the nmslib index of the embeddings for all GitHub data
    • *indexFile:* set this to the GCS location of the nmslib index for all the data in GitHub

Configure your ksonnet environment to use your experiment

  • Open kubeflow/environments/${ENVIRONMENT}/globals.libsonnet
  • Define the following global parameters

    • experiment: Name of your experiment; should correspond to a key defined in experiments.libsonnet
    • project: Set this to the GCP project you can use
    • dataDir: The data directory to be used by T2T
    • workingDir: Working directory
  • Here's an example of what the contents should look like

    workingDir: "gs://code-search-demo/20181104",
    dataDir: "gs://code-search-demo/20181104/data",
    project: "code-search-demo",
    experiment: "demo-trainer-11-07-dist-sync-gpu",

Pre-Processing Github Files

In this step, we use Google Cloud Dataflow to preprocess the data.

  • We use a K8s Job to run a python program code_search.dataflow.cli.preprocess_github_dataset that submits the Dataflow job
  • Once the job has been created it can be monitored using the Dataflow console
  • The parameter target_dataset specifies a BigQuery dataset to write the data to

Create the BigQuery dataset


In [ ]:
%%bash
bq mk ${PROJECT}:${TARGET_DATASET}

Submit the Dataflow Job


In [ ]:
%%bash

cd kubeflow
ks param set --env=code-search submit-preprocess-job targetDataset ${TARGET_DATASET}
ks apply code-search -c submit-preprocess-job

When completed successfully, this should create a dataset in BigQuery named target_dataset. Additionally, it also dumps CSV files into data_dir which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query.


In [ ]:
query = """
  SELECT * 
  FROM 
    {}.token_pairs
  LIMIT
    10
""".format(TARGET_DATASET)

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

This pipeline also writes a set of CSV files which contain function and docstring pairs delimited by a comma. Here, we list a subset of them.


In [ ]:
%%bash

LIMIT=10

gsutil ls ${WORKING_DIR}/data/*.csv | head -n ${LIMIT}

Prepare Dataset for Training

We will use t2t-datagen to convert the transformed data above into the TFRecord format.

TIP: Use ks show to view the Resource Spec submitted.


In [ ]:
%%bash

cd kubeflow

ks show code-search -c t2t-code-search-datagen

In [ ]:
%%bash

cd kubeflow

ks apply code-search -c t2t-code-search-datagen

Once this job finishes, the data directory should have a vocabulary file and a list of TFRecords prefixed by the problem name which in our case is github_function_docstring_extended. Here, we list a subset of them.


In [ ]:
%%bash

LIMIT=10

gsutil ls ${WORKING_DIR}/data/vocab*
gsutil ls ${WORKING_DIR}/data/*train* | head -n ${LIMIT}

Execute Tensorflow Training

Once, the TFRecords are generated, we will use t2t-trainer to execute the training.


In [ ]:
%%bash

cd kubeflow

ks show code-search -c t2t-code-search-trainer

In [ ]:
%%bash

cd kubeflow

ks apply code-search -c t2t-code-search-trainer

This will generate TensorFlow model checkpoints which is illustrated below.


In [ ]:
%%bash

gsutil ls ${WORKING_DIR}/output/*ckpt*

Export Tensorflow Model

We now use t2t-exporter to export the TFModel.


In [ ]:
%%bash

cd kubeflow

ks show code-search -c t2t-code-search-exporter

In [ ]:
%%bash

cd kubeflow

ks apply code-search -c t2t-code-search-exporter

Once completed, this will generate a TensorFlow SavedModel which we will further use for both online (via TF Serving) and offline inference (via Kubeflow Batch Prediction).


In [ ]:
%%bash

gsutil ls ${WORKING_DIR}/output/export/Servo

Compute Function Embeddings

In this step, we will use the exported model above to compute function embeddings via another Dataflow pipeline. A Python 2 module code_search.dataflow.cli.create_function_embeddings has been provided for this purpose. A list of all possible arguments can be seen below.

Configuration

First, select a Exported Model version from the ${WORKING_DIR}/output/export/Servo as seen above. This should be name of a folder with UNIX Seconds Timestamp like 1533685294. Below, we automatically do that by selecting the folder which represents the latest timestamp.


In [ ]:
%%bash --out EXPORT_DIR_LS

gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE "([0-9]+)/$"

In [ ]:
# WARNING: This routine will fail if no export has been completed successfully.
MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\n') if ts])

# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env MODEL_VERSION $MODEL_VERSION
  • Modify experiments.libsonnet and set modelDir to the directory computed above

Run the Dataflow Job for Function Embeddings


In [ ]:
%%bash

cd kubeflow
ks apply code-search -c submit-code-embeddings-job

When completed successfully, this should create another table in the same BigQuery dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below.


In [ ]:
query = """
  SELECT * 
  FROM 
    {}.function_embeddings
  LIMIT
    10
""".format(TARGET_DATASET)

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

The pipeline also generates a set of CSV files which will be useful to generate the search index.


In [ ]:
%%bash

LIMIT=10

gsutil ls ${WORKING_DIR}/data/*index*.csv | head -n ${LIMIT}

Create Search Index

We now create the Search Index from the computed embeddings. This facilitates k-Nearest Neighbor search to for semantically similar results.


In [ ]:
%%bash

cd kubeflow

ks show code-search -c search-index-creator

In [ ]:
%%bash

cd kubeflow

ks apply code-search -c search-index-creator

Using the CSV files generated from the previous step, this creates an index using NMSLib. A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created.


In [ ]:
%%bash

gsutil ls ${WORKING_DIR}/code_search_index*

Deploy the Web App

  • The included web app provides a simple way for users to submit queries
  • The web app includes two pieces

    • A Flask app that serves a simple UI for sending queries

      • The flask app also uses nmslib to provide fast lookups
    • A TF Serving instance to compute the embeddings for search queries

  • The ksonnet components for the web app are in a separate ksonnet application ks-web-app

    • A second web app is used so that we can optionally use ArgoCD to keep the serving components up to date.

Deploy an Inference Server

We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using Tensorflow Serving.

  • You need to set the parameter modelBasePath to the GCS directory where the model was exported
  • This will be a directory produced by the export model step
    • e.g gs://code-search-demo/models/20181107-dist-sync-gpu/export/
  • Here are sample contents

    gs://code-search-demo/models/20181107-dist-sync-gpu/export/:
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/
    
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/:
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/saved_model.pbtxt
    
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/:
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/variables.data-00000-of-00001
    gs://code-search-demo/models/20181107-dist-sync-gpu/export/1541712907/variables/variables.index
  • TFServing expects the modelBasePath to consist of numeric subdirectories corresponding to different versions of the model.

  • Each subdirectory will contain the saved model in protocol buffer along with the weights.

In [ ]:
%%bash

cd ks-web-app

ks param set --env=code-search modelBasePath ${MODEL_BASE_PATH}
ks show code-search -c query-embed-server

In [ ]:
%%bash

cd ks-web-app

ks apply code-search -c query-embed-server

Deploy Search UI

We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed.

  • We need to configure the index server to use the lookup file and CSV produced by the create search index step above
  • The values will be the values of the parameters lookupFile and indexFile that you set in experiments.libsonnet

In [ ]:
%%bash

cd ks-web-app
ks param set --env=code-search search-index-server lookupFile ${LOOKUP_FILE}
ks param set --env=code-search search-index-server indexFile ${INDEX_FILE}
ks show code-search -c search-index-server

In [ ]:
%%bash

cd kubeflow

ks apply code-search -c search-index-server

The service should now be available at FQDN of the Kubeflow cluster at path /code-search/.