Adding multisample feature along with testcases #740

VijayVignesh1 · 2025-10-24T20:05:04Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #317

PR review

Added support for multisample item.

Basically added a boolean parameter which creates a batch of sub samples for each sample, given a list of transform functions.

Sample code:

    def transform_fn_sq(x, *args, **kwargs):
        """A simple transform function that doubles the input."""
        return x * 2

    def transform_fn_add(x):
        """A simple transform function that adds 3 to the input."""
        return x + 3

    def transform_fn_identity(x):
        """A simple transform function that returns the input as is."""
        return x

    dataset = StreamingDataset(
        data_dir,
        cache_dir=str(cache_dir),
        shuffle=False,
        transform=[transform_fn_sq, transform_fn_add, transform_fn_identity],
        is_multisample=True,
    )

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

VijayVignesh1 · 2025-10-24T20:16:06Z

@tchaton @deependujha @bhimrazy Can you verify the approach once? I can then make changes to the README.

deependujha · 2025-10-29T07:30:34Z

src/litdata/streaming/dataset.py

        index_path: Optional[str] = None,
        force_override_state_dict: bool = False,
        transform: Optional[Union[Callable, list[Callable]]] = None,
+        is_multisample: bool = False,


how will you know how many sample_count user wants?

Suggested change

is_multisample: bool = False,

sample_count: int = 1,

This is better. I'll add this.

deependujha · 2025-10-29T07:55:43Z

src/litdata/streaming/dataset.py

    def __len__(self) -> int:
-        return self.get_len(self.num_workers, self.batch_size if self.batch_size else 1)
+        original_len = self.get_len(self.num_workers, self.batch_size if self.batch_size else 1)
+        return original_len if not self.is_multisample else original_len * len(self.transform)


This is an interesting approach, using the number of transforms to determine dataset length implies that the user has defined as many transform functions as the number of desired samples.

It makes sense conceptually, but I’m a bit unsure about the practicality. In most cases, the transforms won’t differ drastically; a user could easily handle minor variations in transforms with using simple conditionals and sample_idx parameter.

So I’m wondering if this design might add unnecessary complexity for limited flexibility.
Curious to hear your thoughts on this, @bhimrazy.

I think that approach makes sense. Adding samples_count as a parameter instead of deriving it from the number of transform functions is more logical. This works well when self.transform is a single function. However, when it’s a list of transform functions, we need to decide how to handle it. In such cases, we can either override the multi-sample behavior or return samples_count samples for each transform function. I believe the first option is better, since the user has already defined multi-sample outputs through the list of transforms.
Thoughts? @deependujha

@VijayVignesh1 let's wait for @bhimrazy thoughts, if he would like to suggest something.

codecov · 2025-10-29T15:57:28Z

Codecov Report

❌ Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 80%. Comparing base (359bbf1) to head (6a77302).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #740   +/-   ##
===================================
  Coverage    80%    80%           
===================================
  Files        52     52           
  Lines      7330   7350   +20     
===================================
+ Hits       5869   5887   +18     
- Misses     1461   1463    +2

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adding multisample feature along with testcases

6a77302

VijayVignesh1 force-pushed the feature/add_multisample_support branch from 1b01b6f to 6a77302 Compare October 24, 2025 20:12

deependujha reviewed Oct 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding multisample feature along with testcases #740

Adding multisample feature along with testcases #740

VijayVignesh1 commented Oct 24, 2025

Uh oh!

VijayVignesh1 commented Oct 24, 2025

Uh oh!

deependujha Oct 29, 2025

Uh oh!

VijayVignesh1 Oct 29, 2025

Uh oh!

deependujha Oct 29, 2025

Uh oh!

VijayVignesh1 Oct 29, 2025

Uh oh!

deependujha Oct 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding multisample feature along with testcases #740

Are you sure you want to change the base?

Adding multisample feature along with testcases #740

Conversation

VijayVignesh1 commented Oct 24, 2025

What does this PR do?

PR review

Did you have fun?

Uh oh!

VijayVignesh1 commented Oct 24, 2025

Uh oh!

deependujha Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

VijayVignesh1 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

deependujha Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

VijayVignesh1 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

deependujha Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deependujha Oct 29, 2025 •

edited

Loading

codecov bot commented Oct 29, 2025 •

edited

Loading