Skip to content

HuggingFace not using specified cache_dir #544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
deependujha opened this issue Apr 8, 2025 · 4 comments · Fixed by #560 or #569
Closed

HuggingFace not using specified cache_dir #544

deependujha opened this issue Apr 8, 2025 · 4 comments · Fixed by #560 or #569
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@deependujha
Copy link
Collaborator

🐛 Bug

To Reproduce

  • Run the below code, and you may observe that my_cache dir is not being used
import litdata as ld

# Define the Hugging Face dataset URI
hf_dataset_uri = "hf://datasets/leonardPKU/clevr_cogen_a_train/data"

# Create a streaming dataset
# dataset is of 13.2 GB - so at the end of the streaming, cache should be clear
dataset = ld.StreamingDataset(hf_dataset_uri, cache_dir = "my_cache", max_cache_size="10GB")

# Stream the dataset using StreamingDataLoader
dataloader = ld.StreamingDataLoader(dataset, batch_size=4)
for sample in dataloader:
    pass 

Expected behavior

Additional context

Environment detail
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:
@deependujha deependujha added bug Something isn't working help wanted Extra attention is needed labels Apr 8, 2025
@philgzl
Copy link
Contributor

philgzl commented Apr 18, 2025

I am facing the same issue.

Isn't it because of these two lines?

cache_dir.path = index_path
input_dir.path = index_path

  • The first line negates the user-provided cache_dir
  • The second line also has to be removed, else _try_create_cache_dir is not called in subsample_streaming_dataset and input_dir.path is used as the cache dir.

I wanted to open a PR but idk how strict the discuss-in-issue-before-submitting rule is.

@deependujha
Copy link
Collaborator Author

Hi @philgzl , feel free to open a PR.

@philgzl philgzl mentioned this issue Apr 18, 2025
4 tasks
@bhimrazy
Copy link
Collaborator

bhimrazy commented Apr 18, 2025

One of the changes that would be added after #560 is that:

Previously, all chunks were also stored in .cache/litdata-cache-index-pq.

This directory was originally used to download and store chunks during the indexing process before actual streaming, which is no longer the case.

Now, the chunks are stored in the DEFAULT_CACHE_DIR, as is typical for LitData-optimized chunks — which is a nice improvement.

Given that, I don’t see a reason for keeping the .cache/litdata-cache-index-pq directory anymore. Would it make sense to consolidate everything under DEFAULT_CACHE_DIR? Otherwise, we’d be maintaining two separate directories — one for storing the index and another for copying the index file and storing chunks.

In case a user wants to store the index file separately, they can run index_hf_dataset first by passing a cache_dir, and then pass that path as the index_path to StreamingDataset.

Otherwise, the index can simply be stored in the DEFAULT_CACHE_DIR or user passed cache_dir in StreamingDataset along with the chunks.

What do you all think?
cc: @tchaton @deependujha @philgzl

@philgzl
Copy link
Contributor

philgzl commented Apr 18, 2025

I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
3 participants