Skip to content

dvc run exp --queue gives unclear error without committed pipeline files #10697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pwithams opened this issue Feb 28, 2025 · 2 comments
Open

Comments

@pwithams
Copy link

pwithams commented Feb 28, 2025

Bug Report

dvc exp run --queue: fails with "No such file or directory" on a cache path similar to .dvc/tmp/exps

Description

  1. It appears that dvc exp run --queue only works on DVC pipelines that have been previously committed to git
  2. The error from this is not clear

When running a queued experiment with dvc exp run --queue, the job is queued and can be started with dvc queue start. However, it will fail with an error similar to ERROR: unexpected error - [Errno 2] No such file or directory: '[path to repo]/.dvc/tmp/exps/tmpabc123/....

The same experiment can be successfully run with dvc repro and dvc exp run. It appears to work once the pipeline is committed with git, which suggests this is either the cause or related to the issue but there is no mention of this in the error message.

Also, once the pipeline is committed and a new uncommitted change made, it calls into question which version of the pipeline is being run - the committed version, or the "dirty" version in the current directory.

Other minor issues

These can be separate issues if required.

  • print statements are not shown in --follow unless explicitly flushed, though this may just be unavoidable celery behaviour
  • when running dvc queue logs [task] on task that requires some slow dvc checkout startup, it gives a "no logs available" message, but using --follow it gives ERROR: unexpected error - : [Errno 2] No such file or directory: '/[path to repo]/.dvc/tmp/exps/run/[uuid]/[uuid].json. The same command later succeeds, presumably once the job has actually started.
  • the UTC timestamps shown by dvc queue status move to MM DD, YYYY format on the next day, which hides helpful time info, especially if you don't work in UTC (i.e. this can happen during the day)

Reproduce

  1. Create a new repo: mkdir /tmp/example; cd /tmp/example; git init; dvc init;
  2. Create a pipeline: mkdir pipeline and copy the following as files:
    main.py
  
start_time = time.time()

while time.time() - start_time < 10:
    print("Running at", time.time())
    time.sleep(5)

dvc.yaml

stages:
  main:
    cmd: python3 main.py
  1. Run dvc repro: pipeline runs successfully
  2. Run dvc exp run: experiment cannot be run without an existing git commit (not really a problem in most repos, plus has a good error message)
  3. Run git commit -m "Setup repo" to commit init files but not the pipeline files to create a least one commit in the repo
  4. Run dvc exp run: experiment runs successfully
  5. Run dvc exp run --queue: command runs successfully
  6. Run dvc queue start: command runs successfully
  7. Run dvc queue logs [task name]: shows "ERROR: unexpected error - [Errno 2] No such file or directory"
  8. Run dvc queue status: task shown as "Failed"
  9. Commit pipeline files
  10. Re-run steps 7 and 8
  11. Run dvc queue logs [task name]: no error, task running as expected
  12. Run dvc queue status: task eventually shown as success
  13. Bonus: Run dvc queue logs [task name] --follow: note that print statements are not shown until end of task, unless sys.stdout.flush() is called

Expected

Either dvc exp run --queue should work without first committing the pipeline, or a clear error message should be shown indicating it needs to be committed first.

If a committed pipeline is required it should be clear whether the committed version or the current "dirty" version of the pipeline is being run.

Environment information

DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.6
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.6.1, boto3 = 1.34.131)
Config:
        Global: ~/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sdc
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/c8f65ca41ec45168d44ff0121e3c0037
@giorgiabarzan
Copy link

+1

@thomaskleiven
Copy link

We have the same issue, anyone have an idea of what's causing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants