Skip to content

Feat: Append data to pre-optimize dataset#184

Merged
tchaton merged 19 commits intoLightning-AI:mainfrom
deependujha:feat/append-data-to-preoptimize-dataset
Jun 27, 2024
Merged

Feat: Append data to pre-optimize dataset#184
tchaton merged 19 commits intoLightning-AI:mainfrom
deependujha:feat/append-data-to-preoptimize-dataset

Conversation

@deependujha
Copy link
Collaborator

@deependujha deependujha commented Jun 26, 2024

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #23 .

Append or Overwrite to pre-optimized dataset.

Expand here for a testcase that conveys the feature best:
def test_optimize(tmpdir):
    output_dir = str(tmpdir / "output_dir") # or s3 URI

    def compress(index):
        return index, index**2

    def different_compress(index):
        return index, index**2, index**3

    optimize(
        fn=compress,
        inputs=list(range(100)),
        num_workers=1,
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2) for i in range(100)]

    with pytest.raises(RuntimeError, match="HINT: If you want to append/overwrite to the existing dataset"):
        optimize(
            fn=compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
        )

    with pytest.raises(ValueError, match="The provided `mode` should be either `append` or `overwrite`"):
        optimize(
            fn=compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
            mode="some-random-mode",
        )

    optimize(
        fn=compress,
        inputs=list(range(100, 200)),
        num_workers=3,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="overwrite",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2) for i in range(100, 200)]

    optimize(
        fn=compress,
        inputs=list(range(200, 300)),
        num_workers=os.cpu_count(),
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="append",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 200
    assert ds[:] == [(i, i**2) for i in range(100, 300)]

    optimize(
        fn=compress,
        inputs=list(range(300, 400)),
        num_workers=2,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="append",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 300
    assert ds[:] == [(i, i**2) for i in range(100, 400)]

    with pytest.raises(Exception, match="The config isn't consistent between chunks"):
        optimize(
            fn=different_compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
            mode="append",
        )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 300
    assert ds[:] == [(i, i**2) for i in range(100, 400)]

    optimize(
        fn=different_compress,
        inputs=list(range(800, 900)),
        num_workers=1,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="overwrite",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2, i**3) for i in range(800, 900)]

Above testcase is also present at litdata/tests/processing/test_functions.py:48

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

…rks perfectly. `mode="append"` implementation pending.
@deependujha deependujha marked this pull request as draft June 26, 2024 05:49
@codecov
Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 76.47059% with 20 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@a8f33df). Learn more about missing BASE report.

Additional details and impacted files
@@          Coverage Diff          @@
##             main   #184   +/-   ##
=====================================
  Coverage        ?    78%           
=====================================
  Files           ?     33           
  Lines           ?   4399           
  Branches        ?      0           
=====================================
  Hits            ?   3414           
  Misses          ?    985           
  Partials        ?      0           

@deependujha deependujha marked this pull request as ready for review June 27, 2024 06:39
@tchaton tchaton requested a review from Borda as a code owner June 27, 2024 07:03
@tchaton tchaton enabled auto-merge (squash) June 27, 2024 07:37
@tchaton tchaton merged commit fe6e026 into Lightning-AI:main Jun 27, 2024
@deependujha deependujha deleted the feat/append-data-to-preoptimize-dataset branch June 27, 2024 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Append data to pre-optimized dataset

2 participants