Skip to content

Feat: Append data to pre-optimize dataset #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

deependujha
Copy link
Collaborator

@deependujha deependujha commented Jun 26, 2024

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #23 .

Append or Overwrite to pre-optimized dataset.

Expand here for a testcase that conveys the feature best:
def test_optimize(tmpdir):
    output_dir = str(tmpdir / "output_dir") # or s3 URI

    def compress(index):
        return index, index**2

    def different_compress(index):
        return index, index**2, index**3

    optimize(
        fn=compress,
        inputs=list(range(100)),
        num_workers=1,
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2) for i in range(100)]

    with pytest.raises(RuntimeError, match="HINT: If you want to append/overwrite to the existing dataset"):
        optimize(
            fn=compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
        )

    with pytest.raises(ValueError, match="The provided `mode` should be either `append` or `overwrite`"):
        optimize(
            fn=compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
            mode="some-random-mode",
        )

    optimize(
        fn=compress,
        inputs=list(range(100, 200)),
        num_workers=3,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="overwrite",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2) for i in range(100, 200)]

    optimize(
        fn=compress,
        inputs=list(range(200, 300)),
        num_workers=os.cpu_count(),
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="append",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 200
    assert ds[:] == [(i, i**2) for i in range(100, 300)]

    optimize(
        fn=compress,
        inputs=list(range(300, 400)),
        num_workers=2,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="append",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 300
    assert ds[:] == [(i, i**2) for i in range(100, 400)]

    with pytest.raises(Exception, match="The config isn't consistent between chunks"):
        optimize(
            fn=different_compress,
            inputs=list(range(100, 200)),
            num_workers=1,
            output_dir=output_dir,
            chunk_bytes="64MB",
            mode="append",
        )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 300
    assert ds[:] == [(i, i**2) for i in range(100, 400)]

    optimize(
        fn=different_compress,
        inputs=list(range(800, 900)),
        num_workers=1,
        output_dir=output_dir,
        chunk_bytes="64MB",
        mode="overwrite",
    )

    ds = StreamingDataset(output_dir)

    assert len(ds) == 100
    assert ds[:] == [(i, i**2, i**3) for i in range(800, 900)]

Above testcase is also present at litdata/tests/processing/test_functions.py:48

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

…rks perfectly. `mode="append"` implementation pending.
@deependujha deependujha marked this pull request as draft June 26, 2024 05:49
Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 76.47059% with 20 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@a8f33df). Learn more about missing BASE report.

Additional details and impacted files
@@          Coverage Diff          @@
##             main   #184   +/-   ##
=====================================
  Coverage        ?    78%           
=====================================
  Files           ?     33           
  Lines           ?   4399           
  Branches        ?      0           
=====================================
  Hits            ?   3414           
  Misses          ?    985           
  Partials        ?      0           

@deependujha deependujha marked this pull request as ready for review June 27, 2024 06:39
@tchaton tchaton requested a review from Borda as a code owner June 27, 2024 07:03
@tchaton tchaton enabled auto-merge (squash) June 27, 2024 07:37
@tchaton tchaton merged commit fe6e026 into Lightning-AI:main Jun 27, 2024
28 checks passed
@deependujha deependujha deleted the feat/append-data-to-preoptimize-dataset branch June 27, 2024 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Append data to pre-optimized dataset
2 participants