Skip to content

dvc data status takes ≈1min, even when nothing has changed #10772

Open
@ryan-williams

Description

@ryan-williams

Bug Report

Description

hudcostreets/ctbk.dev contains 2,105 .parquet.dvc files (roughly: ≈15 datasets, each partitioned into ≈150 monthly files since 2013) totaling 62G.

dvc data status takes ≈1min (even on repeated incantations, when nothing has changed), apparently using just 1 thread. I also experimented with simple DVC pipelines, and some of those dvc commands also seemed to index the repo for ≈1min before doing anything.

Is this expected? I'd think DVC could observe file mtimes and skip files quickly on repeat invocations, at a minimum.

Reproduce

On an m6a.8xlarge instance with:

  • Ubuntu 24.04 amd64 (ami-0731becbf832f281e)
  • 384G EBS gp3 volume with ext4 FS
  • Python 3.12.9
# Clone / Install
git clone https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
git checkout b787b184
pip install -e .

# Before `pull`ing:
time dvc data status
# real	1m10.727s
# user	1m1.733s
# sys	0m7.713s
time dvc data status
# real	1m10.522s
# user	1m1.988s
# sys	0m7.733s

dvc pull  # ≈20mins, 62G downloaded to .dvc/cache + 62G applied in worktree

# After `pull`ing:
time dvc data status
# real	0m59.191s
# user	0m51.454s
# sys	0m7.089s
time dvc data status
# real	0m58.269s
# user	0m50.611s
# sys	0m7.263s

Expected

  • dvc data status should be ≈instantaneous when invoked a 2nd time.
  • If I've only changed one file, it should be correspondingly fast.
  • Indexing(?) the first time should be multi-threaded
    • user is ≈90% of real in the timings above, implying mostly single-threaded processing; monitoring in htop corroborates this.

Environment information

Output of dvc doctor:

$ dvc doctor
VC version: 3.60.0 (pip)
-------------------------
Platform: Python 3.12.9 on Linux-6.8.0-1029-aws-x86_64-with-glibc2.39
Subprojects:
	dvc_data = 3.16.10
	dvc_objects = 5.1.1
	dvc_render = 1.0.2
	dvc_task = 0.40.2
	scmrepo = 3.3.11
Supports:
	http (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
	https (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
	s3 (s3fs = 2025.5.1, boto3 = 1.37.3),
	ssh (sshfs = 2025.2.0)
Config:
	Global: /home/ubuntu/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, ssh
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/c6ba21dd4999cafbf0b94985971d7666

Additional Information (if any):

I'm working on an EXT4 EBS volume:

cat /etc/fstab | column -t -s$'\t'
# LABEL=cloudimg-rootfs  /          ext4   discard,commit=30,errors=remount-ro  0 1
# LABEL=BOOT             /boot      ext4   defaults                             0 2
# LABEL=UEFI             /boot/efi  vfat   umask=0077                           0 1

I also see similar performance on an M1 macbook APFS data volume.

xref #9428

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-managementRelated to dvc add/checkout/commit/move/removeperformanceimprovement over resource / time consuming tasks

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions