Skip to main content
Version: 1.1.0

Run TensorFlow Jobs

This guide gives an overview of how to set up training-operator and how to run a Tensorflow job with YuniKorn scheduler. The training-operator is a unified training operator maintained by Kubeflow. It not only supports TensorFlow but also PyTorch, XGboots, etc.

Install training-operator

You can use the following command to install training operator in kubeflow namespace by default. If you have problems with installation, please refer to this doc for details.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"

Prepare the docker image

Before you start running a TensorFlow job on Kubernetes, you'll need to build the docker image.

  1. Download files from deployment/examples/tfjob
  2. To build this docker image with the following command
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .

Run a TensorFlow job

Here is a TFJob yaml for MNIST example.

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: dist-mnist-for-e2e-test
namespace: kubeflow
spec:
tfReplicaSpecs:
PS:
replicas: 2
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
Worker:
replicas: 4
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0

Create the TFJob

kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml

You can view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI, please read the document here.

tf-job-on-ui