Support for Training ImageNet #135

wwells · 2018-04-01T12:49:13Z

Hi CloudML team. Thanks for a great package!

I've been seeing a new issue in the last 24 hrs where submitted jobs fail to run stalling out on install. I tested on a few different models (including ones that ran successfully previously), and am seeing them all stop at the same point during the build.

https://issuetracker.google.com/issues/77356837

jjallaire · 2018-04-01T13:11:01Z

I am seeing the same issue (albeit it halts at a different point during install so I don't think the glue package has anything to do with this). I added a note to this effect to the CloudML ticket, will follow up with our contacts there as well.

cc @javierluraschi

jjallaire · 2018-04-01T13:56:07Z

In my case I just noticed that the job hung for 25 minutes (on something that normally takes less than a second) and then completed as expected. Here's the gap in the log:

Are your jobs hanging perpetually or do they eventually complete?

jjallaire · 2018-04-01T13:58:04Z

I've submitted another job and it is hanging at the exact same spot (installation of purrr)

wwells · 2018-04-01T15:37:33Z

Thanks for the quick response (and the great book on Deep Learning with R!).

I let a small test job run for c. 10 hrs last night (4 gpu request) and then stopped it when I woke up this morning to prevent additional expense. It didn't complete. Ran two more test jobs this morning before submitting this issue, and they all stalled at the same exact place, so I stopped them after 10 min of stalling or so. Just tried one more time and same stalling point (for me at least - installing glue).

guoqingxu · 2018-04-01T21:13:03Z

Can you send the project and job ids to [email protected] please? Also please provide us the setup.py (and other packages you were trying to install). We will try to repro and locate the issue. This might be caused by conflicts of dependencies. Thanks!

wwells · 2018-04-01T23:32:53Z

Thanks. Project and job ids sent. FWIW - used the default settings for cloudml_train() in R (no setup.py or config files), passing either 'complex_model_m_gpu' or 'standard_gpu' asmaster_type. Was just trying to just confirm my pipeline and data generators were set correctly before customizing further.

guoqingxu · 2018-04-02T00:47:55Z

In the logs of the 10 hour job, following the log entry 'Installing glue (1.2.0) ...'. , there's a log entry 'Retrying request, attempt #1...', which was from "/tools/google-cloud-sdk/platform/gsutil/gslib/util.py". Can you provide more details how the package glue was supposed to be installed? Was the package from GCS bucket or from a public repository (sorry, I have no knowledge on how R works with TensorFlow)? It looks like gsutil failed to fetch the package but no timeout was set on this operation.

jjallaire · 2018-04-02T01:18:55Z

@javierluraschi You may need to fill in some of the gaps here regarding what's actually going on under the hood.

jjallaire · 2018-04-02T11:43:34Z

I'll wait for @javierluraschi to weight in with definitive details, but my understanding is that "Installing glue (1.2.0) ..." is doing nothing more than attempting to interact with a mounted volume (perhaps copying files, perhaps just symlinking). @javierluraschi and @kevinushey I think the code being executed is here: https://github.com/rstudio/packrat/blob/3c2ba63c5101422b85250c3ce077ad73e5b9ea2a/R/restore.R#L403

kevinushey · 2018-04-02T16:21:47Z

If the glue package isn't available in the (Packrat) cache, Packrat will go out to CRAN to install it (assuming glue itself was indeed installed from a CRAN mirror using install.packages()). If the system has no internet connection, it's possible that the R session can hang on a subsequent call to available.packages() or install.packages(). Could that be the case here?

If the package is already available in the Packrat cache then the above analysis should be irrelevant though -- just want to point it out in case it's a possibility.

guoqingxu · 2018-04-02T16:29:08Z

The workers of CMLE training job have internet access. For instance, it's very common that users install extra python packages via pip.

Was there any GCS access involved? Or was the log entry about GCS retrying just red herring? How can we enable more verbose logging to get more info?

javierluraschi · 2018-04-02T17:00:02Z

@wwells just tried running a simple job which succeeded, would you mind running MNIST as well? You can find the file path under system.file("examples/mnist/train.R", package = "cloudml").

Running MNIST should inform if this is broken for any training script or under particular configurations, once we know this, I can help scope what the issue might be.

BTW. We have a Travis job which runs every week and validates training, I've changed this to run daily to help catch breaking changes with ease.

jjallaire · 2018-04-02T17:10:53Z

@javierluraschi The issue I saw was similar but not identical. The job would pause for 25 minutes in the middle of packrat installation (when it was doing no more than symlinking packages from the shared bucket). In the two trials I ran it paused for exactly 25 minutes at exactly the same point in training (installation of purrr, see above)

jjallaire · 2018-04-02T17:13:49Z

I just re-ran the same job again and there was no delay / hanging during package installation.

@wwells Let us know if you are still seeing this behavior on your end.

wwells · 2018-04-02T21:40:33Z

Thanks @javierluraschi. I can confirm I was able to run the MNIST train.R successfully.

Unfortunately, I just tried to resubmit my other job and it's stopping at the load glue spot again...

I'm going to kill the job. Here's what I'm trying to submit. It's just a dummy job, but I'm trying to confirm that my cloud storage data generators are correctly configured.
https://gist.github.com/wwells/0d3d8170a60323efead0b2711952d967

cloudml_train("imagenetTrain.R", "standard_gpu")

guoqingxu · 2018-04-02T23:32:53Z

@wwells Did the MNIST job use GPU VM? I'm trying to find the diff between MNIST and the failed job. We load TensorFlow with GPU support only on GPU VMs.

wwells · 2018-04-02T23:50:26Z

@guoqingxu - no it did not. i submitted the MNIST job as the default standard CPU flavor. fwiw, Saturday March 31, c. mid-day, I did submit a job cloudml_2018_04_01_000701068 with master type complex_model_m_gpu that completed as expected. that was the last one...

guoqingxu · 2018-04-03T00:00:24Z

@wwells So basically we can conclude that the issue only happens on GPU VM. @jjallaire Can you please verify if the MNIST sample works on GPU VM?

wwells · 2018-04-03T00:15:48Z

Thanks @guoqingxu. Apologies - I corrected my last post. The cloudml_2018_04_01_000701068 job that did run successfully was complex_model_m_gpu not cpu. Just successfully ran the MNIST job on a GPU. cloudml_train("train.R", "standard_gpu")

javierluraschi · 2018-04-03T04:24:14Z

@wwells I get an access denied to the imagenet dataset, but the R script does run which means that the copy succeeds under my account...

Where can I get the imagenet dataset from? Are you following https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow perhaps?

However, this makes me think that there might be some network restriction, storage size restriction, etc. that might be blocking the transfer from your account.

As a start, we could try deleting the R cloudml staging-directory, from your Google Storage account you should see an r-cloudml folder as follows:

If you don't have models nor previous job results that you need to keep around, I would manually delete the r-cloudml folder and retry training. This wouldn't really explain why a file download hangs unless the logs are incomplete. In any case, this is the only state R keeps between jobs so is worth a shot deleting this folder to retrain with a clean job.

wwells · 2018-04-03T15:18:58Z

Thanks @javierluraschi. Deleting that folder seemed to do the trick for getting past R package installation. The job seems to be exhibiting similar behavior to the 10 hr failed job on a different step.

I'm going to stop the job. It's not obvious to me that the issue is the data_generator calling out to the bucket. But if it is: The image_net bucket is set as regional storage in US-Central1. I selected this as this is the region where my K80 GPU quota was increased to 4.

Should this be Multi-Regional Storage? Do I need to transfer contents to a new bucket to facilitate use with CloudML?

ImageNet

The image_net bucket isn't public, but is comprised of the data from the ImageNet 2012 dataset (~145 GiB, http://www.image-net.org/challenges/LSVRC/2012/nonpub-downloads). Scripts to download and organize were in https://github.com/wwells/CUNY_DATA_698/tree/master/ImageNetCNN. I also setup a smaller development pipeline using the TinyImagenet Dataset (https://tiny-imagenet.herokuapp.com/).

FWIW - this would be a great dataset for Google to host as open data so others don't have to handle the setup. Stanford Vision Lab's servers are definitely getting punished with traffic.

javierluraschi · 2018-04-04T18:55:48Z

@wwells the logs seem to show that this is hanging while compiling the model, which should be unrelated to where the data is. However, I would still recommend rerunning your script against the imagenet subset, maybe 0.1% or so of the original 145GiB to check whether this is or not a data-size related problem.

wwells · 2018-04-08T15:26:00Z

Thanks @javierluraschi. I was able to work this again. Here's an overview of issues to help with diagnosis.

1: Successful dummy train over the subset dataset, but when the same script was run on the full dataset (using yml Flag method) it failed. Curiously, it stopped around the installing glue step again.

2: I then copied the cloud-ml staging bucket elsewhere, deleted, and attempted to rerun the train over the full dataset again. This time, it got past package installation but stalled in a different place. I stopped the job after 15 hrs, and 10 hrs with out any further logs. There's no indication from the logs that the errors should have anything to do with which bucket I'm pointing to.

Z-ingdotnet · 2020-01-17T07:44:48Z

Hi CloudML team. Thanks for a great package!

I've been seeing a new issue in the last 24 hrs where submitted jobs fail to run stalling out on install. I tested on a few different models (including ones that ran successfully previously), and am seeing them all stop at the same point during the build.

https://issuetracker.google.com/issues/77356837

Hi wwells,
I too also experiencing this issue. I'm only submitting a example job MNIST and it was stopped with the same error.
The gcloud_install() was successful and the SDK also runs fine. Also the GCP configuration was no problem running within R through initiation function gcloud_init().

I could not figure out why and I'm just wondering if you have come across a solution.

Furthermore, I would be appreciated if you can share how you obtain the log sessions shown in the images you attached as in my R terminal I only see the error code

Much appreciated

javierluraschi changed the title ~~Failure to load Glue Package~~ Support for Training ImageNet May 29, 2018

javierluraschi added the enhancement label May 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Training ImageNet #135

Support for Training ImageNet #135

wwells commented Apr 1, 2018

jjallaire commented Apr 1, 2018

jjallaire commented Apr 1, 2018

jjallaire commented Apr 1, 2018

wwells commented Apr 1, 2018

guoqingxu commented Apr 1, 2018

wwells commented Apr 1, 2018

guoqingxu commented Apr 2, 2018

jjallaire commented Apr 2, 2018

jjallaire commented Apr 2, 2018

kevinushey commented Apr 2, 2018

guoqingxu commented Apr 2, 2018

javierluraschi commented Apr 2, 2018

jjallaire commented Apr 2, 2018

jjallaire commented Apr 2, 2018

wwells commented Apr 2, 2018 •

edited

Loading

guoqingxu commented Apr 2, 2018

wwells commented Apr 2, 2018 •

edited

Loading

guoqingxu commented Apr 3, 2018

wwells commented Apr 3, 2018

javierluraschi commented Apr 3, 2018

wwells commented Apr 3, 2018

javierluraschi commented Apr 4, 2018

wwells commented Apr 8, 2018

Z-ingdotnet commented Jan 17, 2020 •

edited

Loading

Support for Training ImageNet #135

Support for Training ImageNet #135

Comments

wwells commented Apr 1, 2018

jjallaire commented Apr 1, 2018

jjallaire commented Apr 1, 2018

jjallaire commented Apr 1, 2018

wwells commented Apr 1, 2018

guoqingxu commented Apr 1, 2018

wwells commented Apr 1, 2018

guoqingxu commented Apr 2, 2018

jjallaire commented Apr 2, 2018

jjallaire commented Apr 2, 2018

kevinushey commented Apr 2, 2018

guoqingxu commented Apr 2, 2018

javierluraschi commented Apr 2, 2018

jjallaire commented Apr 2, 2018

jjallaire commented Apr 2, 2018

wwells commented Apr 2, 2018 • edited Loading

guoqingxu commented Apr 2, 2018

wwells commented Apr 2, 2018 • edited Loading

guoqingxu commented Apr 3, 2018

wwells commented Apr 3, 2018

javierluraschi commented Apr 3, 2018

wwells commented Apr 3, 2018

ImageNet

javierluraschi commented Apr 4, 2018

wwells commented Apr 8, 2018

Z-ingdotnet commented Jan 17, 2020 • edited Loading

wwells commented Apr 2, 2018 •

edited

Loading

wwells commented Apr 2, 2018 •

edited

Loading

Z-ingdotnet commented Jan 17, 2020 •

edited

Loading