Open
Description
Bug Report
Description
hudcostreets/ctbk.dev contains 2,105 .parquet.dvc
files (roughly: ≈15 datasets, each partitioned into ≈150 monthly files since 2013) totaling 62G.
dvc data status
takes ≈1min (even on repeated incantations, when nothing has changed), apparently using just 1 thread. I also experimented with simple DVC pipelines, and some of those dvc
commands also seemed to index the repo for ≈1min before doing anything.
Is this expected? I'd think DVC could observe file mtimes and skip files quickly on repeat invocations, at a minimum.
Reproduce
On an m6a.8xlarge instance with:
- Ubuntu 24.04 amd64 (
ami-0731becbf832f281e
) - 384G EBS gp3 volume with ext4 FS
- Python 3.12.9
# Clone / Install
git clone https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
git checkout b787b184
pip install -e .
# Before `pull`ing:
time dvc data status
# real 1m10.727s
# user 1m1.733s
# sys 0m7.713s
time dvc data status
# real 1m10.522s
# user 1m1.988s
# sys 0m7.733s
dvc pull # ≈20mins, 62G downloaded to .dvc/cache + 62G applied in worktree
# After `pull`ing:
time dvc data status
# real 0m59.191s
# user 0m51.454s
# sys 0m7.089s
time dvc data status
# real 0m58.269s
# user 0m50.611s
# sys 0m7.263s
Expected
dvc data status
should be ≈instantaneous when invoked a 2nd time.- If I've only changed one file, it should be correspondingly fast.
- Indexing(?) the first time should be multi-threaded
user
is ≈90% ofreal
in the timings above, implying mostly single-threaded processing; monitoring inhtop
corroborates this.
Environment information
Output of dvc doctor
:
$ dvc doctor
VC version: 3.60.0 (pip)
-------------------------
Platform: Python 3.12.9 on Linux-6.8.0-1029-aws-x86_64-with-glibc2.39
Subprojects:
dvc_data = 3.16.10
dvc_objects = 5.1.1
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.11
Supports:
http (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
https (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
s3 (s3fs = 2025.5.1, boto3 = 1.37.3),
ssh (sshfs = 2025.2.0)
Config:
Global: /home/ubuntu/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, ssh
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/c6ba21dd4999cafbf0b94985971d7666
Additional Information (if any):
I'm working on an EXT4 EBS volume:
cat /etc/fstab | column -t -s$'\t'
# LABEL=cloudimg-rootfs / ext4 discard,commit=30,errors=remount-ro 0 1
# LABEL=BOOT /boot ext4 defaults 0 2
# LABEL=UEFI /boot/efi vfat umask=0077 0 1
I also see similar performance on an M1 macbook APFS data volume.
xref #9428