Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Training ImageNet #135

Open
wwells opened this issue Apr 1, 2018 · 24 comments
Open

Support for Training ImageNet #135

wwells opened this issue Apr 1, 2018 · 24 comments

Comments

@wwells
Copy link

wwells commented Apr 1, 2018

Hi CloudML team. Thanks for a great package!

I've been seeing a new issue in the last 24 hrs where submitted jobs fail to run stalling out on install. I tested on a few different models (including ones that ran successfully previously), and am seeing them all stop at the same point during the build.

https://issuetracker.google.com/issues/77356837

screen shot 2018-04-01 at 8 47 36 am

@jjallaire
Copy link
Member

I am seeing the same issue (albeit it halts at a different point during install so I don't think the glue package has anything to do with this). I added a note to this effect to the CloudML ticket, will follow up with our contacts there as well.

cc @javierluraschi

@jjallaire
Copy link
Member

In my case I just noticed that the job hung for 25 minutes (on something that normally takes less than a second) and then completed as expected. Here's the gap in the log:

screen shot 2018-04-01 at 9 53 56 am

Are your jobs hanging perpetually or do they eventually complete?

@jjallaire
Copy link
Member

I've submitted another job and it is hanging at the exact same spot (installation of purrr)

@wwells
Copy link
Author

wwells commented Apr 1, 2018

Thanks for the quick response (and the great book on Deep Learning with R!).

I let a small test job run for c. 10 hrs last night (4 gpu request) and then stopped it when I woke up this morning to prevent additional expense. It didn't complete. Ran two more test jobs this morning before submitting this issue, and they all stalled at the same exact place, so I stopped them after 10 min of stalling or so. Just tried one more time and same stalling point (for me at least - installing glue).

@guoqingxu
Copy link

Can you send the project and job ids to [email protected] please? Also please provide us the setup.py (and other packages you were trying to install). We will try to repro and locate the issue. This might be caused by conflicts of dependencies. Thanks!

@wwells
Copy link
Author

wwells commented Apr 1, 2018

Thanks. Project and job ids sent. FWIW - used the default settings for cloudml_train() in R (no setup.py or config files), passing either 'complex_model_m_gpu' or 'standard_gpu' asmaster_type. Was just trying to just confirm my pipeline and data generators were set correctly before customizing further.

@guoqingxu
Copy link

In the logs of the 10 hour job, following the log entry 'Installing glue (1.2.0) ...'. , there's a log entry 'Retrying request, attempt #1...', which was from "/tools/google-cloud-sdk/platform/gsutil/gslib/util.py". Can you provide more details how the package glue was supposed to be installed? Was the package from GCS bucket or from a public repository (sorry, I have no knowledge on how R works with TensorFlow)? It looks like gsutil failed to fetch the package but no timeout was set on this operation.

@jjallaire
Copy link
Member

@javierluraschi You may need to fill in some of the gaps here regarding what's actually going on under the hood.

@jjallaire
Copy link
Member

I'll wait for @javierluraschi to weight in with definitive details, but my understanding is that "Installing glue (1.2.0) ..." is doing nothing more than attempting to interact with a mounted volume (perhaps copying files, perhaps just symlinking). @javierluraschi and @kevinushey I think the code being executed is here: https://github.com/rstudio/packrat/blob/3c2ba63c5101422b85250c3ce077ad73e5b9ea2a/R/restore.R#L403

@kevinushey
Copy link
Collaborator

If the glue package isn't available in the (Packrat) cache, Packrat will go out to CRAN to install it (assuming glue itself was indeed installed from a CRAN mirror using install.packages()). If the system has no internet connection, it's possible that the R session can hang on a subsequent call to available.packages() or install.packages(). Could that be the case here?

If the package is already available in the Packrat cache then the above analysis should be irrelevant though -- just want to point it out in case it's a possibility.

@guoqingxu
Copy link

The workers of CMLE training job have internet access. For instance, it's very common that users install extra python packages via pip.

Was there any GCS access involved? Or was the log entry about GCS retrying just red herring? How can we enable more verbose logging to get more info?

@javierluraschi
Copy link
Contributor

@wwells just tried running a simple job which succeeded, would you mind running MNIST as well? You can find the file path under system.file("examples/mnist/train.R", package = "cloudml").

Running MNIST should inform if this is broken for any training script or under particular configurations, once we know this, I can help scope what the issue might be.

BTW. We have a Travis job which runs every week and validates training, I've changed this to run daily to help catch breaking changes with ease.

@jjallaire
Copy link
Member

@javierluraschi The issue I saw was similar but not identical. The job would pause for 25 minutes in the middle of packrat installation (when it was doing no more than symlinking packages from the shared bucket). In the two trials I ran it paused for exactly 25 minutes at exactly the same point in training (installation of purrr, see above)

@jjallaire
Copy link
Member

I just re-ran the same job again and there was no delay / hanging during package installation.

@wwells Let us know if you are still seeing this behavior on your end.

@wwells
Copy link
Author

wwells commented Apr 2, 2018

Thanks @javierluraschi. I can confirm I was able to run the MNIST train.R successfully.

Unfortunately, I just tried to resubmit my other job and it's stopping at the load glue spot again...

screen shot 2018-04-02 at 5 37 09 pm

I'm going to kill the job. Here's what I'm trying to submit. It's just a dummy job, but I'm trying to confirm that my cloud storage data generators are correctly configured.
https://gist.github.com/wwells/0d3d8170a60323efead0b2711952d967

cloudml_train("imagenetTrain.R", "standard_gpu")

@guoqingxu
Copy link

@wwells Did the MNIST job use GPU VM? I'm trying to find the diff between MNIST and the failed job. We load TensorFlow with GPU support only on GPU VMs.

@wwells
Copy link
Author

wwells commented Apr 2, 2018

@guoqingxu - no it did not. i submitted the MNIST job as the default standard CPU flavor. fwiw, Saturday March 31, c. mid-day, I did submit a job cloudml_2018_04_01_000701068 with master type complex_model_m_gpu that completed as expected. that was the last one...

@guoqingxu
Copy link

@wwells So basically we can conclude that the issue only happens on GPU VM. @jjallaire Can you please verify if the MNIST sample works on GPU VM?

@wwells
Copy link
Author

wwells commented Apr 3, 2018

Thanks @guoqingxu. Apologies - I corrected my last post. The cloudml_2018_04_01_000701068 job that did run successfully was complex_model_m_gpu not cpu. Just successfully ran the MNIST job on a GPU. cloudml_train("train.R", "standard_gpu")

@javierluraschi
Copy link
Contributor

@wwells I get an access denied to the imagenet dataset, but the R script does run which means that the copy succeeds under my account...

screen shot 2018-04-02 at 9 09 51 pm

Where can I get the imagenet dataset from? Are you following https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow perhaps?

However, this makes me think that there might be some network restriction, storage size restriction, etc. that might be blocking the transfer from your account.

As a start, we could try deleting the R cloudml staging-directory, from your Google Storage account you should see an r-cloudml folder as follows:

screen shot 2018-04-02 at 9 12 37 pm

If you don't have models nor previous job results that you need to keep around, I would manually delete the r-cloudml folder and retry training. This wouldn't really explain why a file download hangs unless the logs are incomplete. In any case, this is the only state R keeps between jobs so is worth a shot deleting this folder to retrain with a clean job.

@wwells
Copy link
Author

wwells commented Apr 3, 2018

Thanks @javierluraschi. Deleting that folder seemed to do the trick for getting past R package installation. The job seems to be exhibiting similar behavior to the 10 hr failed job on a different step.

I'm going to stop the job. It's not obvious to me that the issue is the data_generator calling out to the bucket. But if it is: The image_net bucket is set as regional storage in US-Central1. I selected this as this is the region where my K80 GPU quota was increased to 4.

Should this be Multi-Regional Storage? Do I need to transfer contents to a new bucket to facilitate use with CloudML?

screen shot 2018-04-03 at 11 14 50 am

ImageNet

The image_net bucket isn't public, but is comprised of the data from the ImageNet 2012 dataset (~145 GiB, http://www.image-net.org/challenges/LSVRC/2012/nonpub-downloads). Scripts to download and organize were in https://github.com/wwells/CUNY_DATA_698/tree/master/ImageNetCNN. I also setup a smaller development pipeline using the TinyImagenet Dataset (https://tiny-imagenet.herokuapp.com/).

FWIW - this would be a great dataset for Google to host as open data so others don't have to handle the setup. Stanford Vision Lab's servers are definitely getting punished with traffic.

@javierluraschi
Copy link
Contributor

@wwells the logs seem to show that this is hanging while compiling the model, which should be unrelated to where the data is. However, I would still recommend rerunning your script against the imagenet subset, maybe 0.1% or so of the original 145GiB to check whether this is or not a data-size related problem.

@wwells
Copy link
Author

wwells commented Apr 8, 2018

Thanks @javierluraschi. I was able to work this again. Here's an overview of issues to help with diagnosis.

1: Successful dummy train over the subset dataset, but when the same script was run on the full dataset (using yml Flag method) it failed. Curiously, it stopped around the installing glue step again.

screen shot 2018-04-07 at 8 15 34 pm

2: I then copied the cloud-ml staging bucket elsewhere, deleted, and attempted to rerun the train over the full dataset again. This time, it got past package installation but stalled in a different place. I stopped the job after 15 hrs, and 10 hrs with out any further logs. There's no indication from the logs that the errors should have anything to do with which bucket I'm pointing to.

screen shot 2018-04-08 at 11 20 16 am

@javierluraschi javierluraschi changed the title Failure to load Glue Package Support for Training ImageNet May 29, 2018
@Z-ingdotnet
Copy link

Z-ingdotnet commented Jan 17, 2020

Hi CloudML team. Thanks for a great package!

I've been seeing a new issue in the last 24 hrs where submitted jobs fail to run stalling out on install. I tested on a few different models (including ones that ran successfully previously), and am seeing them all stop at the same point during the build.

https://issuetracker.google.com/issues/77356837

screen shot 2018-04-01 at 8 47 36 am

Hi wwells,
I too also experiencing this issue. I'm only submitting a example job MNIST and it was stopped with the same error.
The gcloud_install() was successful and the SDK also runs fine. Also the GCP configuration was no problem running within R through initiation function gcloud_init().

I could not figure out why and I'm just wondering if you have come across a solution.

Furthermore, I would be appreciated if you can share how you obtain the log sessions shown in the images you attached as in my R terminal I only see the error code

Much appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants