README.md

Running Distributed TensorFlow on Kubernetes

This directory contains a template for running distributed TensorFlow on Kubernetes.

You must be running Kubernetes 1.3 or above. If you are running an earlier version, the DNS addon must be enabled. See the Google Container Engine if you want to quickly setup a Kubernetes cluster from scratch.
Jinja templates must be installed.

Follow the instructions for creating the training program in the parent README.
Follow the instructions for building and pushing the Docker image in the Docker README.
Copy the template file:

cp kubernetes/template.yaml.jinja myjob.template.jinja

Edit the myjob.template.jinja file to edit job parameters. At the minimum, you'll want to specify name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir. You may optionally specify credential_secret_name and credential_secret_key if you need to read and write to Google Cloud Storage. See the Google Cloud Storage section below.
Run the job:

python render_template.py myjob.template.jinja | kubectl create -f -

If you later want to stop the job, then run:

python render_template.py myjob.template.jinja | kubectl delete -f -

To support reading and writing to Google Cloud Storage, you need to set up a Kubernetes secret with the credentials.

Set up a service account and download the JSON file.
Add the JSON file as a Kubernetes secret. Replace [json_filename] with the name of the downloaded file:

kubectl create secret generic credential --from-file=[json_filename]

In your template, set credential_secret_name to "credential" (as specified above) and credential_secret_key to the "[json_filename]" in the template.