Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job keeps running even after successful deployment and execution #213

Closed
herambgadgil opened this issue Jul 27, 2020 · 4 comments
Closed

Comments

@herambgadgil
Copy link

Hello,

I am trying to deploy a test job from my RStudio Desktop to GCP AI platform. I am able to successfully deploy the job after the suggested ammendement to .\library\cloudml\cloudml\cloudml\deploy.py file with line.decode('utf-8'); but the job keeps on running and consuming the resources even when it is successfully completed. I see the output in gs://bucket/r-cloudml/runs/auto-generated-job-id/iris.rds along with gs://bucket/r-cloudml/runs/auto-generated-job-id/tfruns.d/completed file value set at TRUE. Attaching the test directory which has cloudml_init.R file that executes the code r-keras-tensorflow.zip

One more thing - it doesn't take the jobId provided in the job.yml file and auto-generates it.

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252    LC_MONETARY=English_India.1252
[4] LC_NUMERIC=C                   LC_TIME=English_India.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3    tinytex_0.20   xfun_0.12
cloudml::gcloud_version()
$`Google Cloud SDK`
[1] ‘301.0.0’

$beta
[1] ‘2020.7.10’

$bq
[1] ‘2.0.58’

$core
[1] ‘2020.7.10’

$gsutil
[1] ‘4.51’
pythonVersion: 3.7
runtimeVersion: 2.1

Thanks in advance!
Heramb

@herambgadgil
Copy link
Author

Found a work-able solution. The problem was with the below chunk in path-to-library/cloudml/cloudml/cloudml/deploy.py

# Stream output from subprocess to console.
for line in iter(process.stdout.readline, ""):
    sys.stdout.write(line.decode('utf-8'))

Once the execution is completed, this does not does not halt and hence enters a continuous loop.

Resolution : comment out the above chunk from deploy.py and it will give you a successful execution.
Downside : you won't be able to see step-by-step installation progress and hence won't get a hint from logs if there is an error in the script. But below chunk will ensure the check on successful execution. If there is an error in the script, it will keep on running endlessly.

# Finalize the process.
stdout, stderr = process.communicate()

# Detect a non-zero exit code.
if process.returncode != 0:
  fmt = "Command %s failed: exit code %s"
  print(fmt % (commands, process.returncode))
else:
  print("Command %s ran successfully." % (commands, ))

Note : Novice in python and cloud environment. Not sure, if this is a best way to go

@javierluraschi
Copy link
Contributor

Tried aa similar fix, can you try it with:

remotes::install_github("rstudio/cloudml")

Thanks!

@herambgadgil
Copy link
Author

@javierluraschi That works! Thanks a lot! :-)

Just couple more points. If I provide a custom job id and a storage location in job.yml and place it in the working directory, cloudml_train() doesn't recognize it and takes the default job_id (cloudml_datetimestamp) and a default storage location

### job.yml

jobId: local-r-heramb
storage: gs://data-science-storage-bucket/
custom_commands: ~

This is not a very critical issue. Just for the highlight

Thanks once again for resolving this! Appreciate it very much

-Heramb

@javierluraschi
Copy link
Contributor

Thanks, will take us longer to address #214 but let me push these critical updates to CRAN first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants