-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Training ImageNet #135
Comments
I am seeing the same issue (albeit it halts at a different point during install so I don't think the glue package has anything to do with this). I added a note to this effect to the CloudML ticket, will follow up with our contacts there as well. |
I've submitted another job and it is hanging at the exact same spot (installation of purrr) |
Thanks for the quick response (and the great book on Deep Learning with R!). I let a small test job run for c. 10 hrs last night (4 gpu request) and then stopped it when I woke up this morning to prevent additional expense. It didn't complete. Ran two more test jobs this morning before submitting this issue, and they all stalled at the same exact place, so I stopped them after 10 min of stalling or so. Just tried one more time and same stalling point (for me at least - installing glue). |
Can you send the project and job ids to [email protected] please? Also please provide us the setup.py (and other packages you were trying to install). We will try to repro and locate the issue. This might be caused by conflicts of dependencies. Thanks! |
Thanks. Project and job ids sent. FWIW - used the default settings for |
In the logs of the 10 hour job, following the log entry 'Installing glue (1.2.0) ...'. , there's a log entry 'Retrying request, attempt #1...', which was from "/tools/google-cloud-sdk/platform/gsutil/gslib/util.py". Can you provide more details how the package glue was supposed to be installed? Was the package from GCS bucket or from a public repository (sorry, I have no knowledge on how R works with TensorFlow)? It looks like gsutil failed to fetch the package but no timeout was set on this operation. |
@javierluraschi You may need to fill in some of the gaps here regarding what's actually going on under the hood. |
I'll wait for @javierluraschi to weight in with definitive details, but my understanding is that "Installing glue (1.2.0) ..." is doing nothing more than attempting to interact with a mounted volume (perhaps copying files, perhaps just symlinking). @javierluraschi and @kevinushey I think the code being executed is here: https://github.com/rstudio/packrat/blob/3c2ba63c5101422b85250c3ce077ad73e5b9ea2a/R/restore.R#L403 |
If the If the package is already available in the Packrat cache then the above analysis should be irrelevant though -- just want to point it out in case it's a possibility. |
The workers of CMLE training job have internet access. For instance, it's very common that users install extra python packages via pip. Was there any GCS access involved? Or was the log entry about GCS retrying just red herring? How can we enable more verbose logging to get more info? |
@wwells just tried running a simple job which succeeded, would you mind running Running BTW. We have a Travis job which runs every week and validates training, I've changed this to run daily to help catch breaking changes with ease. |
@javierluraschi The issue I saw was similar but not identical. The job would pause for 25 minutes in the middle of packrat installation (when it was doing no more than symlinking packages from the shared bucket). In the two trials I ran it paused for exactly 25 minutes at exactly the same point in training (installation of purrr, see above) |
I just re-ran the same job again and there was no delay / hanging during package installation. @wwells Let us know if you are still seeing this behavior on your end. |
Thanks @javierluraschi. I can confirm I was able to run the Unfortunately, I just tried to resubmit my other job and it's stopping at the load glue spot again... I'm going to kill the job. Here's what I'm trying to submit. It's just a dummy job, but I'm trying to confirm that my cloud storage data generators are correctly configured.
|
@wwells Did the MNIST job use GPU VM? I'm trying to find the diff between MNIST and the failed job. We load TensorFlow with GPU support only on GPU VMs. |
@guoqingxu - no it did not. i submitted the MNIST job as the default standard CPU flavor. fwiw, Saturday March 31, c. mid-day, I did submit a job |
@wwells So basically we can conclude that the issue only happens on GPU VM. @jjallaire Can you please verify if the MNIST sample works on GPU VM? |
Thanks @guoqingxu. Apologies - I corrected my last post. The |
@wwells I get an access denied to the imagenet dataset, but the R script does run which means that the copy succeeds under my account... Where can I get the imagenet dataset from? Are you following https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow perhaps? However, this makes me think that there might be some network restriction, storage size restriction, etc. that might be blocking the transfer from your account. As a start, we could try deleting the R cloudml staging-directory, from your Google Storage account you should see an If you don't have models nor previous job results that you need to keep around, I would manually delete the |
Thanks @javierluraschi. Deleting that folder seemed to do the trick for getting past R package installation. The job seems to be exhibiting similar behavior to the 10 hr failed job on a different step. I'm going to stop the job. It's not obvious to me that the issue is the data_generator calling out to the bucket. But if it is: The image_net bucket is set as regional storage in US-Central1. I selected this as this is the region where my K80 GPU quota was increased to 4. Should this be Multi-Regional Storage? Do I need to transfer contents to a new bucket to facilitate use with CloudML? ImageNetThe image_net bucket isn't public, but is comprised of the data from the ImageNet 2012 dataset (~145 GiB, http://www.image-net.org/challenges/LSVRC/2012/nonpub-downloads). Scripts to download and organize were in https://github.com/wwells/CUNY_DATA_698/tree/master/ImageNetCNN. I also setup a smaller development pipeline using the TinyImagenet Dataset (https://tiny-imagenet.herokuapp.com/). FWIW - this would be a great dataset for Google to host as open data so others don't have to handle the setup. Stanford Vision Lab's servers are definitely getting punished with traffic. |
@wwells the logs seem to show that this is hanging while compiling the model, which should be unrelated to where the data is. However, I would still recommend rerunning your script against the imagenet subset, maybe 0.1% or so of the original 145GiB to check whether this is or not a data-size related problem. |
Thanks @javierluraschi. I was able to work this again. Here's an overview of issues to help with diagnosis. 1: Successful dummy train over the subset dataset, but when the same script was run on the full dataset (using yml Flag method) it failed. Curiously, it stopped around the 2: I then copied the cloud-ml staging bucket elsewhere, deleted, and attempted to rerun the train over the full dataset again. This time, it got past package installation but stalled in a different place. I stopped the job after 15 hrs, and 10 hrs with out any further logs. There's no indication from the logs that the errors should have anything to do with which bucket I'm pointing to. |
Hi wwells, I could not figure out why and I'm just wondering if you have come across a solution. Furthermore, I would be appreciated if you can share how you obtain the log sessions shown in the images you attached as in my R terminal I only see the error code Much appreciated |
Hi CloudML team. Thanks for a great package!
I've been seeing a new issue in the last 24 hrs where submitted jobs fail to run stalling out on install. I tested on a few different models (including ones that ran successfully previously), and am seeing them all stop at the same point during the build.
https://issuetracker.google.com/issues/77356837
The text was updated successfully, but these errors were encountered: