Add metadata to dataset config files #420

jlamypoirier · 2025-12-12T05:45:19Z

✨ Description

Gather all available metadata from readers during dataset preparation and add it to the yaml config file. Works with both blending and splitting. Example:

$ cat /tmp/fast_llm_tests/common/dataset/dataset_with_image_patches_55_132/fast_llm_config.yaml 
config:
  path: shard_0_0.fast_llm_dataset
  type: memmap
metadata:
  image_patches:
    data_type: uint8
    num_documents: 1000
    num_patch_groups: 1355
    num_patches: 10707
    num_pixels: 513936
    patch_shape:
    - 3
    - 4
    - 4
  num_tokens: 59145
  tokens:
    data_type: int32
    num_documents: 1000
    num_tokens: 59145

The file format is different from before (config is now inside of a dict instead of at top level), but the old format can still be loaded.

…cessing

tscholak · 2025-12-13T17:32:15Z

This looks very useful, thanks @jlamypoirier
@oleksost @RaymondLi0 is this close to what you had in mind?

oleksost · 2025-12-15T18:09:25Z

yes, this looks useful.
So the fast_llm_config.yaml will contain such entry per shard? Wouldn't it be more useful to have a dataset view isntead of per shard view?

jlamypoirier · 2025-12-15T18:32:24Z

yes, this looks useful. So the fast_llm_config.yaml will contain such entry per shard? Wouldn't it be more useful to have a dataset view isntead of per shard view?

This is already global, the example just happens to have only one shard.

jlamypoirier added 30 commits December 3, 2025 20:25

Fix rotary 2d

2ab1825

stuff

8305dd5

stuff

b6e38b8

Merge branch 'main' into jlp/consistent_preprocessing

72f915d

stuff

350fb3d

fix

d27a815

Merge branch 'main' into jlp/consistent_preprocessing

72f3a31

fixes

5ab6cd0

Merge remote-tracking branch 'origin/main' into jlp/consistent_prepro…

1e74469

…cessing

stuff

6454db4

cleanup

916af7a

Merge remote-tracking branch 'origin/main' into jlp/consistent_prepro…

355af7c

…cessing

Merge branch 'jlp/consistent_preprocessing' into jlp/varlen_tweaks

8f6841e

cleanup

bd7a8e6

fix

660fecc

fixes

db93bb5

Merge branch 'jlp/consistent_preprocessing' into jlp/varlen_tweaks

a3fa577

Merge remote-tracking branch 'origin/main' into jlp/varlen_tweaks

a1c0ade

misc

96ce759

Merge remote-tracking branch 'origin/main' into jlp/varlen_tweaks

e23ea04

stuff

e5fe8b2

fixes

68f457b

Merge remote-tracking branch 'origin/main' into jlp/varlen_tweaks

fa668fa

Remove mamba and discrete mamba 2

f7c5d1b

fix

31d856d

Merge remote-tracking branch 'origin/main' into jlp_remove_mamba

e74d30d

fixes

75ad78a

Add metadata to dataset config files

30e0419

fix

9fae16e

Merge remote-tracking branch 'origin/main' into jlp_remove_mamba

2b6527a

jlamypoirier added 2 commits December 13, 2025 00:58

fix

4ddabf1

Merge branch 'jlp_remove_mamba' into jlp_dataset_metadata

69095ea

Base automatically changed from jlp_remove_mamba to main December 13, 2025 06:21

Merge remote-tracking branch 'origin/main' into jlp_dataset_metadata

fabba8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add metadata to dataset config files #420

Add metadata to dataset config files #420

Uh oh!

jlamypoirier commented Dec 12, 2025

Uh oh!

tscholak commented Dec 13, 2025

Uh oh!

oleksost commented Dec 15, 2025 •

edited

Loading

Uh oh!

jlamypoirier commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add metadata to dataset config files #420

Are you sure you want to change the base?

Add metadata to dataset config files #420

Uh oh!

Conversation

jlamypoirier commented Dec 12, 2025

✨ Description

Uh oh!

tscholak commented Dec 13, 2025

Uh oh!

oleksost commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlamypoirier commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

oleksost commented Dec 15, 2025 •

edited

Loading