Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

✨ Description

Gather all available metadata from readers during dataset preparation and add it to the yaml config file. Works with both blending and splitting. Example:

$ cat /tmp/fast_llm_tests/common/dataset/dataset_with_image_patches_55_132/fast_llm_config.yaml 
config:
  path: shard_0_0.fast_llm_dataset
  type: memmap
metadata:
  image_patches:
    data_type: uint8
    num_documents: 1000
    num_patch_groups: 1355
    num_patches: 10707
    num_pixels: 513936
    patch_shape:
    - 3
    - 4
    - 4
  num_tokens: 59145
  tokens:
    data_type: int32
    num_documents: 1000
    num_tokens: 59145

The file format is different from before (config is now inside of a dict instead of at top level), but the old format can still be loaded.

Base automatically changed from jlp_remove_mamba to main December 13, 2025 06:21
@tscholak
Copy link
Collaborator

This looks very useful, thanks @jlamypoirier
@oleksost @RaymondLi0 is this close to what you had in mind?

@oleksost
Copy link
Contributor

oleksost commented Dec 15, 2025

yes, this looks useful.
So the fast_llm_config.yaml will contain such entry per shard? Wouldn't it be more useful to have a dataset view isntead of per shard view?

@jlamypoirier
Copy link
Collaborator Author

yes, this looks useful. So the fast_llm_config.yaml will contain such entry per shard? Wouldn't it be more useful to have a dataset view isntead of per shard view?

This is already global, the example just happens to have only one shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants