You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.
Currently, we are attempting to create a new method to setup multinode processing in a Spark Cluster with GPUs to reduce training time [1, 2]. To validate the hypothesis that the changes needed were only in the communication protocol between Tensorflow Server and the clusters (which if validated, will be straightforward to implement), we developed a method to distribute the workload on a 96 cores single node cluster.
With our experiment, we were not able to validate our hypothesis and found that Tensorflow Server was just replicating the same process in all threads, using all datasets in each one, instead of smart usage of the batches, leading just to a multiplication of the training time, contrary to our expectations and the references provided above.
Moreover, this increase in training time was found also using the native inbuilt K8s deployment but we are not sure that our K8s setup was properly done and we will repeat the test in a self-managed K8s setup in AWS using AWS implementation/wrapper of Coach.
We have two possible explanations for these findings:
We are using TensorFlow Server wrongly and this fork could help (related with this MR).
The TensorFlow Server alone is not able to distribute the workload and the Kubernetes Orchestrator plays a role in the communication/sync methods. In this case, AWS implementation will show a reduction in the training time.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello everyone,
Currently, we are attempting to create a new method to setup multinode processing in a Spark Cluster with GPUs to reduce training time [1, 2]. To validate the hypothesis that the changes needed were only in the communication protocol between Tensorflow Server and the clusters (which if validated, will be straightforward to implement), we developed a method to distribute the workload on a 96 cores single node cluster.
With our experiment, we were not able to validate our hypothesis and found that Tensorflow Server was just replicating the same process in all threads, using all datasets in each one, instead of smart usage of the batches, leading just to a multiplication of the training time, contrary to our expectations and the references provided above.
Moreover, this increase in training time was found also using the native inbuilt K8s deployment but we are not sure that our K8s setup was properly done and we will repeat the test in a self-managed K8s setup in AWS using AWS implementation/wrapper of Coach.
We have two possible explanations for these findings:
The text was updated successfully, but these errors were encountered: