Sample for Kubeflow TFJob SDK

This is a sample for Kubeflow TFJob SDK kubeflow-tfjob.

The notebook shows how to use Kubeflow TFJob SDK to create, get, wait, check and delete tfjob.


In [1]:
from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container

from kubeflow.tfjob import constants
from kubeflow.tfjob import utils
from kubeflow.tfjob import V1ReplicaSpec
from kubeflow.tfjob import V1TFJob
from kubeflow.tfjob import V1TFJobSpec
from kubeflow.tfjob import TFJobClient

Define namespace where tfjob needs to be created to. If not specified, below function defines namespace to the current one where SDK is running in the cluster, otherwise it will deploy to default namespace.


In [2]:
namespace = utils.get_default_target_namespace()

Define TFJob

The demo only creates a worker of TFJob to run mnist sample.


In [3]:
container = V1Container(
    name="tensorflow",
    image="gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0",
    command=[
        "python",
        "/var/tf_mnist/mnist_with_summaries.py",
        "--log_dir=/train/logs", "--learning_rate=0.01",
        "--batch_size=150"
        ]
)

worker = V1ReplicaSpec(
    replicas=2,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

chief = V1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

ps = V1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

tfjob = V1TFJob(
    api_version="kubeflow.org/v1",
    kind="TFJob",
    metadata=V1ObjectMeta(name="mnist",namespace=namespace),
    spec=V1TFJobSpec(
        clean_pod_policy="None",
        tf_replica_specs={"Worker": worker,
                          "Chief": chief,
                          "PS": ps}
    )
)

Create TFJob


In [4]:
tfjob_client = TFJobClient()
tfjob_client.create(tfjob, namespace=namespace)


Out[4]:
{'apiVersion': 'kubeflow.org/v1',
 'kind': 'TFJob',
 'metadata': {'creationTimestamp': '2020-01-10T06:05:17Z',
  'generation': 1,
  'name': 'mnist',
  'namespace': 'default',
  'resourceVersion': '24815779',
  'selfLink': '/apis/kubeflow.org/v1/namespaces/default/tfjobs/mnist',
  'uid': '2d0ad671-336f-11ea-b6a8-00000a1001ee'},
 'spec': {'cleanPodPolicy': 'None',
  'tfReplicaSpecs': {'Chief': {'replicas': 1,
    'restartPolicy': 'Never',
    'template': {'spec': {'containers': [{'command': ['python',
         '/var/tf_mnist/mnist_with_summaries.py',
         '--log_dir=/train/logs',
         '--learning_rate=0.01',
         '--batch_size=150'],
        'image': 'gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0',
        'name': 'tensorflow'}]}}},
   'PS': {'replicas': 1,
    'restartPolicy': 'Never',
    'template': {'spec': {'containers': [{'command': ['python',
         '/var/tf_mnist/mnist_with_summaries.py',
         '--log_dir=/train/logs',
         '--learning_rate=0.01',
         '--batch_size=150'],
        'image': 'gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0',
        'name': 'tensorflow'}]}}},
   'Worker': {'replicas': 2,
    'restartPolicy': 'Never',
    'template': {'spec': {'containers': [{'command': ['python',
         '/var/tf_mnist/mnist_with_summaries.py',
         '--log_dir=/train/logs',
         '--learning_rate=0.01',
         '--batch_size=150'],
        'image': 'gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0',
        'name': 'tensorflow'}]}}}}}}

Get the created TFJob


In [5]:
tfjob_client.get('mnist', namespace=namespace)


Out[5]:
{'apiVersion': 'kubeflow.org/v1',
 'kind': 'TFJob',
 'metadata': {'creationTimestamp': '2020-01-10T06:05:17Z',
  'generation': 1,
  'name': 'mnist',
  'namespace': 'default',
  'resourceVersion': '24815814',
  'selfLink': '/apis/kubeflow.org/v1/namespaces/default/tfjobs/mnist',
  'uid': '2d0ad671-336f-11ea-b6a8-00000a1001ee'},
 'spec': {'cleanPodPolicy': 'None',
  'tfReplicaSpecs': {'Chief': {'replicas': 1,
    'restartPolicy': 'Never',
    'template': {'spec': {'containers': [{'command': ['python',
         '/var/tf_mnist/mnist_with_summaries.py',
         '--log_dir=/train/logs',
         '--learning_rate=0.01',
         '--batch_size=150'],
        'image': 'gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0',
        'name': 'tensorflow'}]}}},
   'PS': {'replicas': 1,
    'restartPolicy': 'Never',
    'template': {'spec': {'containers': [{'command': ['python',
         '/var/tf_mnist/mnist_with_summaries.py',
         '--log_dir=/train/logs',
         '--learning_rate=0.01',
         '--batch_size=150'],
        'image': 'gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0',
        'name': 'tensorflow'}]}}},
   'Worker': {'replicas': 2,
    'restartPolicy': 'Never',
    'template': {'spec': {'containers': [{'command': ['python',
         '/var/tf_mnist/mnist_with_summaries.py',
         '--log_dir=/train/logs',
         '--learning_rate=0.01',
         '--batch_size=150'],
        'image': 'gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0',
        'name': 'tensorflow'}]}}}}},
 'status': {'conditions': [{'lastTransitionTime': '2020-01-10T06:05:17Z',
    'lastUpdateTime': '2020-01-10T06:05:17Z',
    'message': 'TFJob mnist is created.',
    'reason': 'TFJobCreated',
    'status': 'True',
    'type': 'Created'}],
  'replicaStatuses': {'Chief': {}, 'PS': {}, 'Worker': {}},
  'startTime': '2020-01-10T06:05:18Z'}}

Get the TFJob status, check if the TFJob has been started.


In [6]:
tfjob_client.get_job_status('mnist', namespace=namespace)


Out[6]:
'Created'

Wait for the specified job to finish


In [7]:
tfjob_client.wait_for_job('mnist', namespace=namespace, watch=True)


NAME                           STATE                TIME                          
mnist                          Created              2020-01-10T06:05:17Z          
mnist                          Running              2020-01-10T06:05:29Z          
mnist                          Running              2020-01-10T06:05:29Z          
mnist                          Running              2020-01-10T06:05:29Z          
mnist                          Running              2020-01-10T06:05:29Z          
mnist                          Running              2020-01-10T06:05:29Z          
mnist                          Running              2020-01-10T06:05:29Z          
mnist                          Succeeded            2020-01-10T06:07:49Z          

Check if the TFJob succeeded


In [8]:
tfjob_client.is_job_succeeded('mnist', namespace=namespace)


Out[8]:
True

Get the TFJob training logs.


In [9]:
tfjob_client.get_logs('mnist', namespace=namespace)


The logs of Pod mnist-chief-0:
 WARNING:tensorflow:From /var/tf_mnist/mnist_with_summaries.py:39: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please use urllib or similar directly.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2020-01-10 06:05:42.333504: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.147
Accuracy at step 10: 0.7369
Accuracy at step 20: 0.8666
Accuracy at step 30: 0.9027
Accuracy at step 40: 0.9117
Accuracy at step 50: 0.9221
Accuracy at step 60: 0.9214
Accuracy at step 70: 0.9266
Accuracy at step 80: 0.934
Accuracy at step 90: 0.9322
Adding run metadata for 99
Accuracy at step 100: 0.9389
Accuracy at step 110: 0.9356
Accuracy at step 120: 0.9429
Accuracy at step 130: 0.9481
Accuracy at step 140: 0.9526
Accuracy at step 150: 0.9476
Accuracy at step 160: 0.9509
Accuracy at step 170: 0.9483
Accuracy at step 180: 0.9491
Accuracy at step 190: 0.9533
Adding run metadata for 199
Accuracy at step 200: 0.9536
Accuracy at step 210: 0.9456
Accuracy at step 220: 0.9542
Accuracy at step 230: 0.957
Accuracy at step 240: 0.9548
Accuracy at step 250: 0.951
Accuracy at step 260: 0.9529
Accuracy at step 270: 0.9567
Accuracy at step 280: 0.9558
Accuracy at step 290: 0.9573
Adding run metadata for 299
Accuracy at step 300: 0.9496
Accuracy at step 310: 0.9596
Accuracy at step 320: 0.9551
Accuracy at step 330: 0.9539
Accuracy at step 340: 0.9639
Accuracy at step 350: 0.9616
Accuracy at step 360: 0.9574
Accuracy at step 370: 0.9579
Accuracy at step 380: 0.9644
Accuracy at step 390: 0.965
Adding run metadata for 399
Accuracy at step 400: 0.9637
Accuracy at step 410: 0.9655
Accuracy at step 420: 0.9654
Accuracy at step 430: 0.9668
Accuracy at step 440: 0.9698
Accuracy at step 450: 0.9649
Accuracy at step 460: 0.965
Accuracy at step 470: 0.9617
Accuracy at step 480: 0.9674
Accuracy at step 490: 0.9686
Adding run metadata for 499
Accuracy at step 500: 0.9684
Accuracy at step 510: 0.965
Accuracy at step 520: 0.9665
Accuracy at step 530: 0.9682
Accuracy at step 540: 0.9607
Accuracy at step 550: 0.967
Accuracy at step 560: 0.9641
Accuracy at step 570: 0.9706
Accuracy at step 580: 0.9675
Accuracy at step 590: 0.9691
Adding run metadata for 599
Accuracy at step 600: 0.9668
Accuracy at step 610: 0.964
Accuracy at step 620: 0.9665
Accuracy at step 630: 0.9713
Accuracy at step 640: 0.9673
Accuracy at step 650: 0.9635
Accuracy at step 660: 0.9643
Accuracy at step 670: 0.9632
Accuracy at step 680: 0.9602
Accuracy at step 690: 0.9621
Adding run metadata for 699
Accuracy at step 700: 0.9592
Accuracy at step 710: 0.9618
Accuracy at step 720: 0.965
Accuracy at step 730: 0.9658
Accuracy at step 740: 0.9611
Accuracy at step 750: 0.961
Accuracy at step 760: 0.9677
Accuracy at step 770: 0.9651
Accuracy at step 780: 0.9659
Accuracy at step 790: 0.9655
Adding run metadata for 799
Accuracy at step 800: 0.9637
Accuracy at step 810: 0.9662
Accuracy at step 820: 0.9687
Accuracy at step 830: 0.9705
Accuracy at step 840: 0.9694
Accuracy at step 850: 0.9712
Accuracy at step 860: 0.9684
Accuracy at step 870: 0.9698
Accuracy at step 880: 0.9723
Accuracy at step 890: 0.9699
Adding run metadata for 899
Accuracy at step 900: 0.9699
Accuracy at step 910: 0.9681
Accuracy at step 920: 0.97
Accuracy at step 930: 0.9719
Accuracy at step 940: 0.9724
Accuracy at step 950: 0.9673
Accuracy at step 960: 0.9684
Accuracy at step 970: 0.9693
Accuracy at step 980: 0.9712
Accuracy at step 990: 0.9719
Adding run metadata for 999

Delete the TFJob


In [10]:
tfjob_client.delete('mnist', namespace=namespace)


Out[10]:
{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'mnist',
  'group': 'kubeflow.org',
  'kind': 'tfjobs',
  'uid': '2d0ad671-336f-11ea-b6a8-00000a1001ee'}}

In [ ]: