Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving checkpoint step gets stuck forever (intermitent problem) #1603

Open
rodrigo-f-nogueira opened this issue Oct 25, 2024 · 0 comments
Open

Comments

@rodrigo-f-nogueira
Copy link

Perhaps this is a problem related to Google Cloud Storage, but since last week, it has been very common to get stuck in the saving checkpoint step:

I1025 08:58:19.649921 139866418210816 checkpoints.py:1191] Restoring dataset iterator from 'train_ds-027-of-032'.
E1025 08:58:19.650164 139866418210816 dataset_iterator.py:130] DatasetIterator.load() is deprecated. Please use restore().
I1025 08:58:21.356897 139866418210816 train.py:393] Initialize/restore complete (4.48 seconds).
I1025 08:58:21.358641 139866418210816 evaluation.py:370] Initializing Evaluator for 'ds'
I1025 08:58:21.358845 139866418210816 evaluation.py:72] Task ds has no metrics defined; skipping eval.
W1025 08:58:21.358932 139866418210816 evaluation.py:386] No eval task with valid split and metric fn found. Skipping eval.
I1025 08:58:21.359025 139866418210816 train.py:568] Saving checkpoint before the training loop starts.  *** It doesn't pass this step ***

It is an intermittent problem, i.e., sometimes it saves, sometimes, it doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant