This directory contains a template for running distributed TensorFlow on Kubernetes.
-
You must be running Kubernetes 1.3 or above. If you are running an earlier version, the DNS addon must be enabled. See the Google Container Engine if you want to quickly setup a Kubernetes cluster from scratch.
-
Jinja templates must be installed.
-
Follow the instructions for creating the training program in the parent README.
-
Follow the instructions for building and pushing the Docker image in the Docker README.
-
Copy the template file:
cp kubernetes/template.yaml.jinja myjob.template.jinja
-
Edit the
myjob.template.jinja
file to edit job parameters. At the minimum, you'll want to specifyname
,image
,worker_replicas
,ps_replicas
,script
,data_dir
, andtrain_dir
. You may optionally specifycredential_secret_name
andcredential_secret_key
if you need to read and write to Google Cloud Storage. See the Google Cloud Storage section below. -
Run the job:
python render_template.py myjob.template.jinja | kubectl create -f -
If you later want to stop the job, then run:
python render_template.py myjob.template.jinja | kubectl delete -f -
To support reading and writing to Google Cloud Storage, you need to set up a Kubernetes secret with the credentials.
-
Set up a service account and download the JSON file.
-
Add the JSON file as a Kubernetes secret. Replace
[json_filename]
with the name of the downloaded file:
kubectl create secret generic credential --from-file=[json_filename]
- In your template, set
credential_secret_name
to"credential"
(as specified above) andcredential_secret_key
to the"[json_filename]"
in the template.