Skip to content

Conversation

@adambarla
Copy link

Problem

When fn_completions is set to huggingface_local_completions, alpaca_eval reloads the model for every chunk of data. This leads to:

  1. Significant time overhead (reloading large models repeatedly).
  2. OOM errors because the previous model isn't always garbage collected immediately before the new one loads.

Fixes #449.

Solution

Implemented module-level caching for the model and tokenizer, following the existing pattern in vllm_local.py.

  • Added _get_or_load_model helper function.
  • Uses global _loaded_model to persist the model across calls.
  • Checks if the requested model matches the cached one.
  • Explicitly unloads/GCs the old model if switching to a new one.

Testing

Tested locally with a dataset split into multiple chunks. Verified that:

  • Model loads only once.
  • GPU memory usage remains stable across chunks.
  • "Reusing cached model" logs appear for subsequent chunks.

Add caching for models and tokenizers in huggingface_local_completions
to avoid reloading the model for each chunk.
This prevents OOM errors when processing large datasets split into multiple chunks.

- Add _get_or_load_model() helper function to handle caching logic
- Cache model, tokenizer, model_name, and adapters_name at module level
- Unload previous model when switching to a different model
- Follows the same pattern as vllm_local.py in the codebase

Closes tatsu-lab#449
@adambarla adambarla changed the title Fix: Cache model in huggingface_local to prevent OOM (#449) Fix: Cache model in huggingface_local to prevent OOM (Issue #449) Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When fn_completions is set to huggingface_local_completions, the model will be reloaded for each chunk causing OOM error

1 participant