Skip to content

exp-workers failing without logs #10673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nv-pipo opened this issue Jan 16, 2025 · 0 comments
Open

exp-workers failing without logs #10673

nv-pipo opened this issue Jan 16, 2025 · 0 comments
Labels
A: experiments Related to dvc exp bug Did we break something? help wanted triage Needs to be triaged

Comments

@nv-pipo
Copy link

nv-pipo commented Jan 16, 2025

Bug Report

DVC EXP workers dying

Running multiple workers results in failed experiments and no logs

Description

Launching dvc queue start with parameter -j greater than 1 fails some experiments that shouldn't fail and these experiments will have no logs. Furthermore, sometimes the exp-worker dies with the failed experiments.

Reproduce

Example:

params.yaml

value: 1

dvc.yaml

stages:
  experiment_candles:
    cmd: sleep 5 ; echo DONE
    params:
      - params.yaml:
  1. git init
  2. dvc init
  3. Copy dvc.yaml
  4. Copy params.yaml
  5. git add *.yaml
  6. git commit -m "initial commit"
  7. Queue experiments
dvc exp run \
    --queue \
    --set-param "value=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99"
  1. Start 20 jobs
dvc queue start -j 20
  1. Check for failed jobs
dvc queue status | grep Failed
  1. Check logs of failed jobs
dvc queue logs ...

Note: that it doesn't always fail, so maybe you have to iterate starting at step 7.

Output sample

Image

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.59.0 (brew)
--------------------------
Platform: Python 3.13.1 on macOS-15.2-arm64-arm-64bit-Mach-O
Subprojects:
        dvc_data = 3.16.7
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.9
Supports:
        azure (adlfs = 2024.12.0, knack = 0.12.0, azure-identity = 1.19.0),
        gdrive (pydrive2 = 1.21.3),
        gs (gcsfs = 2024.12.0),
        hdfs (fsspec = 2024.12.0, pyarrow = 18.1.0),
        http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.12.0, boto3 = 1.35.93),
        ssh (sshfs = 2024.9.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.12.0)
Config:
        Global: /Users/pichurri/Library/Application Support/dvc
        System: /Users/pichurri/homebrew/share/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Users/pichurri/homebrew/var/cache/dvc/repo/7b5c17002f7a7963a4dc1afee2b961e2
@nv-pipo nv-pipo changed the title workers race condition exp-workers failing without logs Jan 16, 2025
@shcheklein shcheklein added bug Did we break something? triage Needs to be triaged A: experiments Related to dvc exp help wanted labels Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bug Did we break something? help wanted triage Needs to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants