Sample for Kubeflow PyTorchJob SDK

This is a sample for Kubeflow PyTorchJob SDK kubeflow-pytorchjob.

The notebook shows how to use Kubeflow PyTorchJob SDK to create, get, wait, check and delete PyTorchJob.


In [1]:
from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container
from kubernetes.client import V1ResourceRequirements

from kubeflow.pytorchjob import constants
from kubeflow.pytorchjob import utils
from kubeflow.pytorchjob import V1ReplicaSpec
from kubeflow.pytorchjob import V1PyTorchJob
from kubeflow.pytorchjob import V1PyTorchJobSpec
from kubeflow.pytorchjob import PyTorchJobClient

Define namespace where pytorchjob needs to be created to. If not specified, below function defines namespace to the current one where SDK is running in the cluster, otherwise it will deploy to default namespace.


In [2]:
namespace = utils.get_default_target_namespace()

Define PyTorchJob

The demo only creates a worker of PyTorchJob to run mnist sample.


In [3]:
container = V1Container(
    name="pytorch",
    image="gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0",
    args=["--backend","gloo"]
)

master = V1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

worker = V1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

pytorchjob = V1PyTorchJob(
    api_version="kubeflow.org/v1",
    kind="PyTorchJob",
    metadata=V1ObjectMeta(name="pytorch-dist-mnist-gloo",namespace=namespace),
    spec=V1PyTorchJobSpec(
        clean_pod_policy="None",
        pytorch_replica_specs={"Master": master,
                               "Worker": worker}
    )
)

Create PyTorchJob


In [4]:
pytorchjob_client = PyTorchJobClient()
pytorchjob_client.create(pytorchjob)


Out[4]:
{'apiVersion': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'creationTimestamp': '2020-01-14T09:21:44Z',
  'generation': 1,
  'name': 'pytorch-dist-mnist-gloo',
  'namespace': 'default',
  'resourceVersion': '949015',
  'selfLink': '/apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-dist-mnist-gloo',
  'uid': '47f9dc9a-36af-11ea-beb5-00163e01f7d2'},
 'spec': {'cleanPodPolicy': 'None',
  'pytorchReplicaSpecs': {'Master': {'replicas': 1,
    'restartPolicy': 'OnFailure',
    'template': {'spec': {'containers': [{'args': ['--backend', 'gloo'],
        'image': 'gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0',
        'name': 'pytorch'}]}}},
   'Worker': {'replicas': 1,
    'restartPolicy': 'OnFailure',
    'template': {'spec': {'containers': [{'args': ['--backend', 'gloo'],
        'image': 'gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0',
        'name': 'pytorch'}]}}}}}}

Get the created PyTorchJob


In [5]:
pytorchjob_client.get('pytorch-dist-mnist-gloo')


Out[5]:
{'apiVersion': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'creationTimestamp': '2020-01-14T09:21:44Z',
  'generation': 1,
  'name': 'pytorch-dist-mnist-gloo',
  'namespace': 'default',
  'resourceVersion': '949030',
  'selfLink': '/apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-dist-mnist-gloo',
  'uid': '47f9dc9a-36af-11ea-beb5-00163e01f7d2'},
 'spec': {'cleanPodPolicy': 'None',
  'pytorchReplicaSpecs': {'Master': {'replicas': 1,
    'restartPolicy': 'OnFailure',
    'template': {'spec': {'containers': [{'args': ['--backend', 'gloo'],
        'image': 'gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0',
        'name': 'pytorch'}]}}},
   'Worker': {'replicas': 1,
    'restartPolicy': 'OnFailure',
    'template': {'spec': {'containers': [{'args': ['--backend', 'gloo'],
        'image': 'gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0',
        'name': 'pytorch'}]}}}}},
 'status': {'conditions': [{'lastTransitionTime': '2020-01-14T09:21:44Z',
    'lastUpdateTime': '2020-01-14T09:21:44Z',
    'message': 'PyTorchJob pytorch-dist-mnist-gloo is created.',
    'reason': 'PyTorchJobCreated',
    'status': 'True',
    'type': 'Created'}],
  'replicaStatuses': {'Master': {}, 'Worker': {}},
  'startTime': '2020-01-14T09:21:44Z'}}

Get the PyTorchJob status, check if the PyTorchJob has been started.


In [6]:
pytorchjob_client.get_job_status('pytorch-dist-mnist-gloo', namespace=namespace)


Out[6]:
'Created'

Wait for the specified PyTorchJob to finish


In [7]:
pytorchjob_client.wait_for_job('pytorch-dist-mnist-gloo', namespace=namespace, watch=True)


NAME                           STATE                TIME                          
pytorch-dist-mnist-gloo        Created              2020-01-14T09:21:44Z          
pytorch-dist-mnist-gloo        Running              2020-01-14T09:23:45Z          
pytorch-dist-mnist-gloo        Running              2020-01-14T09:23:45Z          
pytorch-dist-mnist-gloo        Running              2020-01-14T09:23:45Z          
pytorch-dist-mnist-gloo        Succeeded            2020-01-14T09:28:31Z          

Check if the PyTorchJob succeeded


In [8]:
pytorchjob_client.is_job_succeeded('pytorch-dist-mnist-gloo', namespace=namespace)


Out[8]:
True

Get the PyTorchJob training logs.


In [9]:
pytorchjob_client.get_logs('pytorch-dist-mnist-gloo', namespace=namespace)


The logs of Pod pytorch-dist-mnist-gloo-master-0:
 Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Train Epoch: 1 [0/60000 (0%)]	loss=2.3000
Train Epoch: 1 [640/60000 (1%)]	loss=2.2135
Train Epoch: 1 [1280/60000 (2%)]	loss=2.1704
Train Epoch: 1 [1920/60000 (3%)]	loss=2.0766
Train Epoch: 1 [2560/60000 (4%)]	loss=1.8679
Train Epoch: 1 [3200/60000 (5%)]	loss=1.4135
Train Epoch: 1 [3840/60000 (6%)]	loss=1.0003
Train Epoch: 1 [4480/60000 (7%)]	loss=0.7762
Train Epoch: 1 [5120/60000 (9%)]	loss=0.4598
Train Epoch: 1 [5760/60000 (10%)]	loss=0.4860
Train Epoch: 1 [6400/60000 (11%)]	loss=0.4389
Train Epoch: 1 [7040/60000 (12%)]	loss=0.4084
Train Epoch: 1 [7680/60000 (13%)]	loss=0.4602
Train Epoch: 1 [8320/60000 (14%)]	loss=0.4289
Train Epoch: 1 [8960/60000 (15%)]	loss=0.3990
Train Epoch: 1 [9600/60000 (16%)]	loss=0.3850
Train Epoch: 1 [10240/60000 (17%)]	loss=0.2985
Train Epoch: 1 [10880/60000 (18%)]	loss=0.5031
Train Epoch: 1 [11520/60000 (19%)]	loss=0.5235
Train Epoch: 1 [12160/60000 (20%)]	loss=0.3379
Train Epoch: 1 [12800/60000 (21%)]	loss=0.3667
Train Epoch: 1 [13440/60000 (22%)]	loss=0.4503
Train Epoch: 1 [14080/60000 (23%)]	loss=0.3043
Train Epoch: 1 [14720/60000 (25%)]	loss=0.3589
Train Epoch: 1 [15360/60000 (26%)]	loss=0.3320
Train Epoch: 1 [16000/60000 (27%)]	loss=0.4406
Train Epoch: 1 [16640/60000 (28%)]	loss=0.3641
Train Epoch: 1 [17280/60000 (29%)]	loss=0.3170
Train Epoch: 1 [17920/60000 (30%)]	loss=0.2014
Train Epoch: 1 [18560/60000 (31%)]	loss=0.4985
Train Epoch: 1 [19200/60000 (32%)]	loss=0.3264
Train Epoch: 1 [19840/60000 (33%)]	loss=0.1198
Train Epoch: 1 [20480/60000 (34%)]	loss=0.1904
Train Epoch: 1 [21120/60000 (35%)]	loss=0.1424
Train Epoch: 1 [21760/60000 (36%)]	loss=0.3143
Train Epoch: 1 [22400/60000 (37%)]	loss=0.1494
Train Epoch: 1 [23040/60000 (38%)]	loss=0.2901
Train Epoch: 1 [23680/60000 (39%)]	loss=0.4670
Train Epoch: 1 [24320/60000 (41%)]	loss=0.2151
Train Epoch: 1 [24960/60000 (42%)]	loss=0.1521
Train Epoch: 1 [25600/60000 (43%)]	loss=0.2240
Train Epoch: 1 [26240/60000 (44%)]	loss=0.2629
Train Epoch: 1 [26880/60000 (45%)]	loss=0.2330
Train Epoch: 1 [27520/60000 (46%)]	loss=0.2630
Train Epoch: 1 [28160/60000 (47%)]	loss=0.2126
Train Epoch: 1 [28800/60000 (48%)]	loss=0.1327
Train Epoch: 1 [29440/60000 (49%)]	loss=0.2789
Train Epoch: 1 [30080/60000 (50%)]	loss=0.0947
Train Epoch: 1 [30720/60000 (51%)]	loss=0.1280
Train Epoch: 1 [31360/60000 (52%)]	loss=0.2458
Train Epoch: 1 [32000/60000 (53%)]	loss=0.3394
Train Epoch: 1 [32640/60000 (54%)]	loss=0.1527
Train Epoch: 1 [33280/60000 (55%)]	loss=0.0901
Train Epoch: 1 [33920/60000 (57%)]	loss=0.1451
Train Epoch: 1 [34560/60000 (58%)]	loss=0.1994
Train Epoch: 1 [35200/60000 (59%)]	loss=0.2171
Train Epoch: 1 [35840/60000 (60%)]	loss=0.0633
Train Epoch: 1 [36480/60000 (61%)]	loss=0.1369
Train Epoch: 1 [37120/60000 (62%)]	loss=0.1160
Train Epoch: 1 [37760/60000 (63%)]	loss=0.2355
Train Epoch: 1 [38400/60000 (64%)]	loss=0.0634
Train Epoch: 1 [39040/60000 (65%)]	loss=0.1062
Train Epoch: 1 [39680/60000 (66%)]	loss=0.1608
Train Epoch: 1 [40320/60000 (67%)]	loss=0.1101
Train Epoch: 1 [40960/60000 (68%)]	loss=0.1775
Train Epoch: 1 [41600/60000 (69%)]	loss=0.2285
Train Epoch: 1 [42240/60000 (70%)]	loss=0.0737
Train Epoch: 1 [42880/60000 (71%)]	loss=0.1562
Train Epoch: 1 [43520/60000 (72%)]	loss=0.2775
Train Epoch: 1 [44160/60000 (74%)]	loss=0.1418
Train Epoch: 1 [44800/60000 (75%)]	loss=0.1163
Train Epoch: 1 [45440/60000 (76%)]	loss=0.1221
Train Epoch: 1 [46080/60000 (77%)]	loss=0.0768
Train Epoch: 1 [46720/60000 (78%)]	loss=0.1950
Train Epoch: 1 [47360/60000 (79%)]	loss=0.0706
Train Epoch: 1 [48000/60000 (80%)]	loss=0.2091
Train Epoch: 1 [48640/60000 (81%)]	loss=0.1380
Train Epoch: 1 [49280/60000 (82%)]	loss=0.0950
Train Epoch: 1 [49920/60000 (83%)]	loss=0.1070
Train Epoch: 1 [50560/60000 (84%)]	loss=0.1194
Train Epoch: 1 [51200/60000 (85%)]	loss=0.1447
Train Epoch: 1 [51840/60000 (86%)]	loss=0.0662
Train Epoch: 1 [52480/60000 (87%)]	loss=0.0239
Train Epoch: 1 [53120/60000 (88%)]	loss=0.2622
Train Epoch: 1 [53760/60000 (90%)]	loss=0.0928
Train Epoch: 1 [54400/60000 (91%)]	loss=0.1297
Train Epoch: 1 [55040/60000 (92%)]	loss=0.1907
Train Epoch: 1 [55680/60000 (93%)]	loss=0.0347
Train Epoch: 1 [56320/60000 (94%)]	loss=0.0354
Train Epoch: 1 [56960/60000 (95%)]	loss=0.0770
Train Epoch: 1 [57600/60000 (96%)]	loss=0.1175
Train Epoch: 1 [58240/60000 (97%)]	loss=0.1919
Train Epoch: 1 [58880/60000 (98%)]	loss=0.2053
Train Epoch: 1 [59520/60000 (99%)]	loss=0.0639

accuracy=0.9664


Delete the PyTorchJob


In [10]:
pytorchjob_client.delete('pytorch-dist-mnist-gloo')


Out[10]:
{'kind': 'Status',
 'apiVersion': 'v1',
 'metadata': {},
 'status': 'Success',
 'details': {'name': 'pytorch-dist-mnist-gloo',
  'group': 'kubeflow.org',
  'kind': 'pytorchjobs',
  'uid': '47f9dc9a-36af-11ea-beb5-00163e01f7d2'}}

In [ ]: