Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset cache / mixing support #522

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Add dataset cache / mixing support #522

wants to merge 29 commits into from

Conversation

vwxyzjn
Copy link
Collaborator

@vwxyzjn vwxyzjn commented Jan 16, 2025

This file contains the utility to transform and cache datasets with different configurations.
The main things we are looking for are:

  • handle dataset mixing
  • handle different tokenization functions
  • cache the tokenized dataset so we don't have to re-tokenize every time
    • This is especially important when we have 405B SFT models: 32 nodes are just spending like
      5 minutes to tokenize the dataset. This translates to 32 * 5 * 8 = 1280 minutes = 21 hours of
      wasted H100 time.
    • Sometimes we also launch on places that don't have a shared cache (e.g., GCP), so we would
      download individual datasets 32 times, and wait for concatenation and tokenization (actually
      twice because the with accelerator.main_process_first() function assumes a shared cache)

I did an end-to-end test and found the fine-tuned performance to be the same.

image

@vwxyzjn vwxyzjn marked this pull request as draft January 16, 2025 17:31
@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Jan 21, 2025

Confirmed that the tokenized datasets are exactly the same for SFT / DPO

image image

@vwxyzjn vwxyzjn marked this pull request as ready for review January 21, 2025 18:43
Copy link
Collaborator

@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM some minor nits and double check maybe redundant files

docs/ai2_internal.md Outdated Show resolved Hide resolved
docs/ai2_internal.md Show resolved Hide resolved
docs/ai2_internal.md Show resolved Hide resolved
docs/ai2_internal.md Show resolved Hide resolved
open_instruct/dpo_tune_cache1.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the duplicate files.

docs/ai2_internal.md Show resolved Hide resolved
docs/ai2_internal.md Show resolved Hide resolved
@vwxyzjn vwxyzjn mentioned this pull request Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants