Add dataset cache / mixing support #522

vwxyzjn · 2025-01-16T17:31:32Z

This file contains the utility to transform and cache datasets with different configurations.
The main things we are looking for are:

handle dataset mixing
handle different tokenization functions
cache the tokenized dataset so we don't have to re-tokenize every time
- This is especially important when we have 405B SFT models: 32 nodes are just spending like
  5 minutes to tokenize the dataset. This translates to 32 * 5 * 8 = 1280 minutes = 21 hours of
  wasted H100 time.
- Sometimes we also launch on places that don't have a shared cache (e.g., GCP), so we would
  download individual datasets 32 times, and wait for concatenation and tokenization (actually
  twice because the with accelerator.main_process_first() function assumes a shared cache)

I did an end-to-end test and found the fine-tuned performance to be the same.

vwxyzjn · 2025-01-21T17:41:59Z

Confirmed that the tokenized datasets are exactly the same for SFT / DPO

natolambert

LGTM some minor nits and double check maybe redundant files

docs/ai2_internal.md

open_instruct/dpo_tune_cache1.py

Co-authored-by: Nathan Lambert <[email protected]>

vwxyzjn

I removed the duplicate files.

docs/ai2_internal.md

Add dataset cache / mixing support

a832163

vwxyzjn marked this pull request as draft January 16, 2025 17:31

vwxyzjn added 14 commits January 17, 2025 09:06

prototyping

e92da05

makesure to shuffle with seed

a73d962

quick fix

f4021c0

add finetune1

c8b6a62

support preference tuning as well

15ba999

refactord

966d9d1

allow passing in an SFT message key

4159255

allow customizing chosen / rejected key

367c9e5

fix tests

22d724f

add huggingface card

0257767

refactor

f22f992

quick fix

a82f574

add an option to cache dataset only

90d1846

Add some logic for caching.

b11afed

vwxyzjn added 2 commits January 21, 2025 09:58

make mason work with the latest change

c17c9cf

Use the latest dataset caching logic

25c636a

vwxyzjn marked this pull request as ready for review January 21, 2025 18:43

vwxyzjn requested review from natolambert and hamishivi January 21, 2025 18:43

vwxyzjn added 2 commits January 21, 2025 10:46

restore change

2c2c085

push docs

b911f0a

natolambert approved these changes Jan 21, 2025

View reviewed changes

docs/ai2_internal.md Outdated Show resolved Hide resolved

docs/ai2_internal.md Show resolved Hide resolved

docs/ai2_internal.md Show resolved Hide resolved

docs/ai2_internal.md Show resolved Hide resolved

open_instruct/dpo_tune_cache1.py Outdated Show resolved Hide resolved

vwxyzjn added 5 commits January 21, 2025 14:33

update docs

952611e

quick update

3e4c0fb

quick push

01da6ab

push

60a7ecb

Just replace the existing dpo / finetune

ad5ec2b

vwxyzjn and others added 2 commits January 21, 2025 14:43

remove unused files

dacc3f2

Apply suggestions from code review

4dd4598

Co-authored-by: Nathan Lambert <[email protected]>

vwxyzjn commented Jan 21, 2025

View reviewed changes

docs/ai2_internal.md Show resolved Hide resolved

docs/ai2_internal.md Show resolved Hide resolved

quick change

a041044

vwxyzjn mentioned this pull request Jan 23, 2025

Olmo2 and olmoe support #525

Merged

vwxyzjn added 2 commits January 24, 2025 12:24

update docs and fix mason

8e2b0f9

Merge branch 'main' into dataset-stuff-v2

55b9962

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset cache / mixing support #522

Add dataset cache / mixing support #522

vwxyzjn commented Jan 16, 2025 •

edited

Loading

vwxyzjn commented Jan 21, 2025

natolambert left a comment

vwxyzjn left a comment

Add dataset cache / mixing support #522

Are you sure you want to change the base?

Add dataset cache / mixing support #522

Conversation

vwxyzjn commented Jan 16, 2025 • edited Loading

vwxyzjn commented Jan 21, 2025

natolambert left a comment

Choose a reason for hiding this comment

vwxyzjn left a comment

Choose a reason for hiding this comment

vwxyzjn commented Jan 16, 2025 •

edited

Loading