Feat: Append data to pre-optimize dataset #184

deependujha · 2024-06-26T05:49:27Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #23 .

Append or Overwrite to pre-optimized dataset.

Expand here for a testcase that conveys the feature best:

def test_optimize(tmpdir):
    output_dir = str(tmpdir / "output_dir") # or s3 URI

    def compress(index):
        return index, index**2

    def different_compress(index):
        return index, index**2, index**3

    optimize(
        fn=compress,
        inputs=list(range(100)),
        num_workers=1,
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2) for i in range(100)]

    with pytest.raises(RuntimeError, match="HINT: If you want to append/overwrite to the existing dataset"):
        optimize(
            fn=compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
        )

    with pytest.raises(ValueError, match="The provided `mode` should be either `append` or `overwrite`"):
        optimize(
            fn=compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
            mode="some-random-mode",
        )

    optimize(
        fn=compress,
        inputs=list(range(100, 200)),
        num_workers=3,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="overwrite",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2) for i in range(100, 200)]

    optimize(
        fn=compress,
        inputs=list(range(200, 300)),
        num_workers=os.cpu_count(),
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="append",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 200
    assert ds[:] == [(i, i**2) for i in range(100, 300)]

    optimize(
        fn=compress,
        inputs=list(range(300, 400)),
        num_workers=2,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="append",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 300
    assert ds[:] == [(i, i**2) for i in range(100, 400)]

    with pytest.raises(Exception, match="The config isn't consistent between chunks"):
        optimize(
            fn=different_compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
            mode="append",
        )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 300
    assert ds[:] == [(i, i**2) for i in range(100, 400)]

    optimize(
        fn=different_compress,
        inputs=list(range(800, 900)),
        num_workers=1,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="overwrite",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2, i**3) for i in range(800, 900)]

Above testcase is also present at litdata/tests/processing/test_functions.py:48

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

…rks perfectly. `mode="append"` implementation pending.

for more information, see https://pre-commit.ci

index.json is not modified (Pending)

for more information, see https://pre-commit.ci

codecov · 2024-06-26T11:01:05Z

Codecov Report

Attention: Patch coverage is 76.47059% with 20 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@a8f33df). Learn more about missing BASE report.

Additional details and impacted files

@@          Coverage Diff          @@
##             main   #184   +/-   ##
=====================================
  Coverage        ?    78%           
=====================================
  Files           ?     33           
  Lines           ?   4399           
  Branches        ?      0           
=====================================
  Hits            ?   3414           
  Misses          ?    985           
  Partials        ?      0

for more information, see https://pre-commit.ci

src/litdata/processing/utilities.py

add mode argument to optimizer function. mode="overwrite | None" wo…

0ff5b87

…rks perfectly. `mode="append"` implementation pending.

deependujha requested review from tchaton and awaelchli as code owners June 26, 2024 05:49

deependujha marked this pull request as draft June 26, 2024 05:49

pre-commit-ci bot and others added 4 commits June 26, 2024 05:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

81599c0

for more information, see https://pre-commit.ci

fix pre-commit ci checks

f3a37ef

fix failing test

2f71e91

[pre-commit.ci] auto fixes from pre-commit.com hooks

7cca90f

for more information, see https://pre-commit.ci

deependujha mentioned this pull request Jun 26, 2024

train_test_split doesn't support split of 0.0 #182

Closed

deependujha and others added 2 commits June 26, 2024 10:50

mode="append" part half done. Workers start from last chunk, but

888bba2

index.json is not modified (Pending)

[pre-commit.ci] auto fixes from pre-commit.com hooks

9452d07

for more information, see https://pre-commit.ci

deependujha and others added 6 commits June 26, 2024 11:53

mode="append" done. Just need to test it on S3.

6f1df44

[pre-commit.ci] auto fixes from pre-commit.com hooks

e66dca7

for more information, see https://pre-commit.ci

fix pre-commit-ci error

9ce9f1b

tests added for optimize function

dfa34b0

[pre-commit.ci] auto fixes from pre-commit.com hooks

e3a0cbc

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/append-data-to-preoptimize-dataset

bb5a3f7

deependujha marked this pull request as ready for review June 27, 2024 06:39

update

8c766c6

tchaton requested a review from Borda as a code owner June 27, 2024 07:03

tchaton added 3 commits June 27, 2024 08:05

update

0c3f623

update

f696e72

update

efcb782

tchaton approved these changes Jun 27, 2024

View reviewed changes

src/litdata/processing/utilities.py Show resolved Hide resolved

tchaton added 2 commits June 27, 2024 08:34

update

e3f767b

update

83e49e5

tchaton enabled auto-merge (squash) June 27, 2024 07:37

tchaton merged commit fe6e026 into Lightning-AI:main Jun 27, 2024
28 checks passed

deependujha deleted the feat/append-data-to-preoptimize-dataset branch June 27, 2024 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Append data to pre-optimize dataset #184

Feat: Append data to pre-optimize dataset #184

Uh oh!

deependujha commented Jun 26, 2024 •

edited

Loading

Uh oh!

codecov bot commented Jun 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Feat: Append data to pre-optimize dataset #184

Feat: Append data to pre-optimize dataset #184

Uh oh!

Conversation

deependujha commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Did you have fun?

Uh oh!

codecov bot commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deependujha commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading