Releases: Lightning-AI/litData
Release 0.2.58
What's Changed
- Fix: Hide debug statements behind _DEBUG by @robTheBuildr in #730
- Chore: bump version to 0.2.58 by @robTheBuildr in #732
Full Changelog: v0.2.57...v0.2.58
Release 0.2.57
What's Changed
- Better support for streaming optimized dataset by @robTheBuildr in #727
- Update CODEOWNERS to modify ownership assignments by @Borda in #726
- Bump Version 0.2.57 by @tchaton in #729
Full Changelog: v0.2.56...v0.2.57
v0.2.56
What's Changed
- Fix(be): Avoid decompression race condition by @robTheBuildr in #718
- Bump version 0.2.56 by @tchaton in #719
New Contributors
- @robTheBuildr made their first contribution in #718
Full Changelog: v0.2.55...v0.2.56
LitData v0.2.55
Lightning AI ⚡ is excited to announce the release of LitData v0.2.55
Highlights
[Fixed] Writing compressed data to a lighting_storage folder
This release focuses on fixing errors when writing compressed output data to a lightning_storage folder. Previously, a code snippet like the following would break.
from litdata import StreamingDataset, StreamingDataLoader, optimize
import time
def should_keep(data):
if data % 2 == 0:
yield data
if __name__ == "__main__":
output_dir = "/teamspace/lightning_storage/my-folder-1/output"
optimize(
fn=should_keep,
inputs=list(range(500)),
output_dir=output_dir,
chunk_bytes="64MB",
num_workers=4,
compression="zstd", # Previously, this would cause an error
)
time.sleep(20)
dataset = StreamingDataset(output_dir)
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
for _ in dataloader:
# process code here
passChanges
Fixed
- Fix errors when using compression and r2 in optimize() by @pwgardipee in #715
Changed
- Remove s5cmd from the R2 downloader by @pwgardipee in #714
Full Changelog: v0.2.54...v0.2.55
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
Thank you ❤️ and we hope you'll keep them coming!
LitData v0.2.54
Lightning AI ⚡ is excited to announce the release of LitData v0.2.54
Highlights
Lightning AI Storage - Direct download
Lightning Studios have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder. LitData has supported existing folder types like S3 and GCS folders, and this release introduces support for lightning_storage folders which were recently launched.
For example, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.
from litdata import StreamingDataset
if __name__ == "__main__":
data_dir = "/teamspace/lightning_storage/my-bucket-1/data"
dataset = StreamingDataset(data_dir)
for sample in dataset:
print(sample)References to any of the following directories will work similarly:
/teamspace/lightning_storage/.../teamspace/s3_connections/.../teamspace/gcs_connections/.../teamspace/s3_folders/.../teamspace/gcs_folders/...
Changes
Added
- Add downloader for R2 by @pwgardipee in #711
Full Changelog: v0.2.53...v0.2.54
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
Thank you ❤️ and we hope you'll keep them coming!
LitData v0.2.53
Lightning AI ⚡ is excited to announce the release of LitData v0.2.53
Highlights
Lightning AI Storage - Direct download and upload
Lightning Studios have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder. LitData has supported existing folder types like S3 and GCS folders, and this release introduces support for lightning_storage folders which were recently launched.
For example, output artifacts from this code will be directly uploaded to the my-data-1 Lighting Storage bucket.
from litdata import optimize
def should_keep(data):
if data % 2 == 0:
yield data
if __name__ == "__main__":
optimize(
fn=should_keep,
inputs=list(range(1000)),
output_dir="/teamspace/lightning_storage/my-data-1/output",
chunk_bytes="64MB",
num_workers=1
)Similarly, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.
from litdata import StreamingRawDataset
if __name__ == "__main__":
data_dir = "/teamspace/lightning_storage/my-bucket-1/data"
raw_dataset = StreamingRawDataset(data_dir)
data = list(raw_dataset)
print(data)References to any of the following directories will work similarly:
/teamspace/lightning_storage/.../teamspace/s3_connections/.../teamspace/gcs_connections/.../teamspace/s3_folders/.../teamspace/gcs_folders/...
Changes
Added
- Add support for resolving directories in
/teamspace/lightning_storageby @bhimrazy in #695 - Add support for direct upload to r2 buckets by @pwgardipee in #705
- Add readme docs for references to data connection dirs by @pwgardipee in #708
Changed
Chores
- chore(deps): bump actions/first-interaction from 2 to 3 in the gha-updates group by @dependabot[bot] in #693
- chore(deps): update coverage requirement from ==7.8.* to ==7.10.* by @dependabot[bot] in #701
- chore(deps): bump pytest-random-order from 1.1.1 to 1.2.0 by @dependabot[bot] in #703
- chore(deps): bump cryptography from 45.0.4 to 45.0.7 by @dependabot[bot] in #704
- chore(deps): bump the gha-updates group with 3 updates by @dependabot[bot] in #707
Full Changelog: v0.2.52...v0.2.53
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
New Contributors
- @Red-Eyed made their first contribution in #700
- @pwgardipee made their first contribution in #705
Thank you ❤️ and we hope you'll keep them coming!
LitData v0.2.52
Lightning AI ⚡ is excited to announce the release of LitData v0.2.52
Highlights
Grouping Support in StreamingRawDataset
StreamingRawDataset now supports flexible grouping of items during setup—ideal for pairing related files like images and masks.
from litdata import StreamingRawDataset
from litdata.raw import FileMetadata
from typing import Union
class CustomStreamingRawDataset(StreamingRawDataset):
def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
# Example: group files in pairs [[image_1, mask_1], ...]
return files
dataset = CustomStreamingRawDataset("s3://bucket/files/")Remote Index Caching for Faster Startup
StreamingRawDataset now caches its file index both locally and remotely, speeding up initialization for large cloud datasets. It loads from local cache first, then tries remote cache, and rebuilds only if needed. Use recompute_index=True to force rebuild.
from litdata import StreamingRawDataset
dataset = StreamingRawDataset("s3://bucket/files/") # Loads cached index if available
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True) # Force rebuildShuffle Control Added to train_test_split
Splitting your streaming datasets is now more flexible with the new shuffle parameter. You can choose whether to shuffle your dataset before splitting, giving you better control over how your training, testing, and validation sets are created.
from litdata import train_test_split
train_ds, test_ds = train_test_split(streaming_dataset, splits=[0.8, 0.2], shuffle=True)Changes
Added
- Added grouping functionality to
StreamingRawDatasetallowing flexible item structuring insetupmethod (#665 by @bhimrazy) - Added shuffle parameter to
train_test_split(#675 by @otogamer) - Added CI workflow to check for broken links (#676 by @Vimal-Shady)
- Added remote and local index caching in
StreamingRawDatasetto speed up dataset initialization with multi-level cache system (#666 by @bhimrazy)
Changed
Fixed
- Fixed broken 'Get Started' link in README (#674 by @Vimal-Shady)
- Fixed and enabled parallel test execution with pytest-xdist in CI workflow (#620 by @deependujha)
- Clean up leftover chunk lock files by prefix during Reader delete operation (#683 by @jwills)
- Ensure all tests run correctly with ignore pattern fix (#679 by @Borda)
Chores
- Bumped lightning-sdk from 0.1.46 to 2025.8.1 (#668 by @dependabot[bot])
- Bumped pytest-rerunfailures from 14.0 to 15.1 (#667 by @dependabot[bot])
- Bumped pytest-cov from 6.1.1 to 6.2.1 (#669 by @dependabot[bot])
- Bumped the gha-updates group with 2 updates (#690 by @dependabot[bot])
- Bumped
litdataversion to 0.2.52 by (#691 by @bhimrazy)
Full Changelog: v0.2.51...v0.2.52
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
@deependujha, @Borda, @bhimrazy
New Contributors
- @Vimal-Shady made their first contribution in #674
- @otogamer made their first contribution in #675
- @jwills made their first contribution in #683
Thank you ❤️ and we hope you'll keep them coming!
LitData v0.2.51
Lightning AI ⚡ is excited to announce the release of LitData v0.2.51
Highlights
Stream Raw Datasets from Cloud Storage (Beta)
Effortlessly stream raw files (e.g., images, text) directly from S3, GCS, or Azure cloud storage without preprocessing. Perfect for workflows needing immediate access to data in its original format.
from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader
dataset = StreamingRawDataset("s3://bucket/files/")
# Use with PyTorch DataLoader
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
# Process raw bytes
passBenchmarks
Streaming speed for raw ImageNet (1.2M images) from cloud storage:
| Storage | Images/s (No Transform) | Images/s (With Transform) |
|---|---|---|
| AWS S3 | ~6,400 ± 100 | ~3,200 ± 100 |
| Google Cloud Storage | ~5,650 ± 100 | ~3,100 ± 100 |
Note: Use
StreamingRawDatasetfor direct data streaming. Opt forStreamingDatasetfor maximum speed with pre-optimized data.
Resume ParallelStreamingDataset
The ParallelStreamingDataset now supports a resume option, allowing you to seamlessly continue training from the previous epoch's state when cycling through datasets. Enable it with resume=True to avoid restarting at index 0 each epoch, ensuring consistent sample progression across epochs.
from litdata.streaming.parallel import ParallelStreamingDataset
from torch.utils.data import DataLoader
dataset = ParallelStreamingDataset(datasets=[dataset1, dataset2], length=100, resume=True)
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
# Resumes from previous epoch's state
passPer-Dataset Batch Sizes in CombinedStreamingDataset
The CombinedStreamingDataset now supports per-dataset batch sizes when using batching_method="per_stream". Specify unique batch sizes for each dataset using set_batch_size() with a list of integers. The iterator respects these limits, switching datasets once the per-stream quota is met, optimizing GPU utilization for datasets with varying tensor sizes.
from litdata.streaming.combined import CombinedStreamingDataset
dataset = CombinedStreamingDataset(
datasets=[dataset1, dataset2],
weights=[0.5, 0.5],
batching_method="per_stream",
seed=123
)
dataset.set_batch_size([4, 8]) # Set batch sizes: 4 for dataset1, 8 for dataset2
for sample in dataset:
# Iterator yields samples respecting per-dataset batch size limits
passChanges
Added
- Added support for setting cache directory via
LITDATA_CACHE_DIRenvironment variable (#639 by @deependujha) - Added CLI option to clear default cache (#627 by @deependujha)
- Added resume support to
ParallelStreamingDataset(#650 by @philgzl) - Added
verboseoption tooptimize_fn(#654 by @deependujha) - Added support for multiple
transform_fninStreamingDataset(#655 by @deependujha) - Enabled per-dataset batch size support in
CombinedStreamingDataset(#635 by @MagellaX) - Added support for
StreamingRawDatasetto stream raw datasets from cloud storage (#652 by @bhimrazy) - Added GCP support for directory resolution in
resolve_dir(#659 by @bhimrazy)
Changed
- Cleaned up logic in
_loopby removing hacky index assignment (#640 by @deependujha) - Updated CODEOWNERS (#646 by @Borda)
- Switched to
astral-sh/setup-uvfor Python setup and useduv pipfor package installation (#656 by @bhimrazy) - Replaced PIL with torchvision's
decode_imagefor more robust JPEG deserialization (#660 by @bhimrazy)
Fixed
Chores
- Bumped
cryptographyfrom 42.0.8 to 45.0.4 (#644 by @dependabot[bot]) - Updated
numpyrequirement from <2.0 to <3.0 (#645 by @dependabot[bot]) - Bumped
pytest-timeoutfrom 2.3.1 to 2.4.0 (#643 by @dependabot[bot]) - Applied pre-commit suggestions & bumped Python to 3.9 (#653 by @pre-commit-ci[bot])
- Bumped
actions/first-interactionfrom 1 to 2 in GitHub Actions updates (#657 by @dependabot[bot]) - Bumped version to 0.2.51 (#664 by @bhimrazy)
Full Changelog: v0.2.50...v0.2.51
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
@deependujha, @Borda, @bhimrazy, @philgzl
New Contributors
- @lukemerrick made their first contribution in #647
- @MagellaX made their first contribution in #635
Thank you ❤️ and we hope you'll keep them coming!
litData v0.2.50: Fast Random Access & S3 Improvements 🧪⚡️
Lightning AI is excited to announce the release of litData v0.2.50, a lightweight and powerful streaming data library designed for fast AI model training.
This release focuses on improving the developer experience and performance for streamed datasets, with a particular focus on:
- Faster random access support
- Transform hooks for datasets
- Better S3 interoperability
- CI stability and performance improvements
👉 Check out the full changelog here: Compare v0.2.49...v0.2.50
🚀 Highlights
🔄 Fast Random Access (No Chunk Download Needed)
You can now access samples randomly from remote datasets without downloading entire chunks, dramatically reducing IO overhead during sparse reads.
This is especially useful for visualization tools or quickly inspecting your dataset without requiring full downloads.
🚀 Benchmark (on Lightning Studio, chunk size: 64MB)
10 random accesses:
- 🔹
v0.2.49: 20–22 seconds - 🔹
v0.2.50: 5–6 seconds
The benchmark was designed to ensure enough separation between accesses, avoiding repeated reads from the same chunk.
Single item access:
- 🔹
v0.2.49: ~2 seconds - 🔹
v0.2.50: ~0.83 seconds
Sample code
import litdata as ld
uri = "gs://litdata-gcp-bucket/optimized_data"
ds = ld.StreamingDataset(uri, cache_dir="my_cache")
# when iterating, check `my_cache`. it shouldn't download chunks
for i in range(0,1000, 100):
print(i, ds[i])
# it should download chunks now
for data in ds:
print(data)🧩 Transform Support in StreamingDataset
You can now apply transforms to samples in StreamingDataset and CombinedStreamingDataset.
There are two supported ways to use it:
- Pass a transform function when initializing the dataset:
# Define a simple transform function
torch_transform = transforms.Compose([
transforms.Resize((256, 256)), # Resize to 256x256
transforms.ToTensor(), # Convert to PyTorch tensor (C x H x W)
transforms.Normalize( # Normalize using ImageNet stats
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
def transform_fn(x, *args, **kwargs):
"""Define your transform function."""
return torch_transform(x) # Apply the transform to the input image
# Create dataset with appropriate configuration
dataset = StreamingDataset(data_dir, cache_dir=str(cache_dir), shuffle=shuffle, transform=transform_fn)- Subclass and override the
transformmethod:
class StreamingDatasetWithTransform(StreamingDataset):
"""A custom dataset class that inherits from StreamingDataset and applies a transform."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.torch_transform = transforms.Compose([
transforms.Resize((256, 256)), # Resize to 256x256
transforms.ToTensor(), # Convert to PyTorch tensor (C x H x W)
transforms.Normalize( # Normalize using ImageNet stats
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
# Define your transform method
def transform(self, x, *args, **kwargs):
"""A simple transform function."""
return self.torch_transform(x)
dataset = StreamingDatasetWithTransform(data_dir, cache_dir=str(cache_dir), shuffle=shuffle)This makes it easier to insert preprocessing logic directly into the streaming pipeline.
📖 AWS S3 Streaming Docs (with boto3 & unsigned requests Example)
The documentation now includes a clear example of how to stream datasets from AWS S3 using boto3, including support for unsigned requests. It also prioritizes boto3 in the list of options for better clarity.
import botocore
from litdata import StreamingDataset
storage_options = {
"config": botocore.config.Config(
retries={"max_attempts": 1000, "mode": "adaptive"},
signature_version=botocore.UNSIGNED,
)
}
dataset = StreamingDataset(
input_dir="s3://pl-flash-data/optimized_tiny_imagenet",
storage_options=storage_options,
)📖 Batching Methods in CombinedStreamingDataset
The CombinedStreamingDataset supports two different batching methods through the batching_method parameter:
Stratified Batching (Default):
With batching_method="stratified" (the default), each batch contains samples from multiple datasets according to the specified weights:
# Default stratified batching - batches mix samples from all datasets
combined_dataset = CombinedStreamingDataset(
datasets=[dataset1, dataset2],
batching_method="stratified" # This is the default
)Per-Stream Batching:
With batching_method="per_stream", each batch contains samples exclusively from a single dataset. This is useful when datasets have different shapes or structures:
# Per-stream batching - each batch contains samples from only one dataset
combined_dataset = CombinedStreamingDataset(
datasets=[dataset1, dataset2],
batching_method="per_stream"
)🐛 Bug Fixes
-
Fixed breaking
tqdmprogress bar in optimizing dataset
-
Suppressed multiple
lightning-sdkwarnings.
🧪 Testing & CI
- Python 3.12 and 3.13 now supported in CI matrix
#589 - Test durations now logged for debugging
#614 - Added missing CI dependencies.
#634 - Refactored large, slow tests to reduce CI runtime
#629, #632
📎 Minor Improvements
- Updated bug report template for easier Lightning Studio reproduction
#611
📦 Dependency Updates
mosaicml-streaming: 0.8.1 → 0.11.0
#624transformers: <4.50.0 → <4.53.0
#623pytest: 8.3.* → 8.4.*
#625
🧑💻 Contributors
Thanks to everyone who contributed to this release!
Special thanks to @bhimrazy, @deependujha, @Borda, and @dependabot.
What's Changed
- 🕒 Add Test Duration Reporting to Pytest in CI by @bhimrazy in #614
- Update bug report template with Lightning Studio sharing instructions by @bhimrazy in #611
- docs: Add documentation for batching methods in CombinedStreamingDataset by @bhimrazy in #609
- fix: suppress FileNotFoundError when acquiring file lock for count file by @bhimrazy in #615
- chore: suppress FileNotFoundError for locks in downloader classes by @bhimrazy in #617
- Add Dependabot for Pip & GitHub Actions by @Borda in #621
- chore(deps): update pytest requirement from ==8.3.* to ==8.4.* by @dependabot in #625
- chore(deps): bump mosaicml-streaming from 0.8.1 to 0.11.0 by @dependabot in #624
- chore(deps): update transformers requirement from <4.50.0 to <4.53.0 by @dependabot in #623
- chore(deps): bump the gha-updates group with 2 updates by @dependabot in #622
- Feat: add transform support for StreamingDataset by @deependujha in #618
- fix: breaking tqdm progress bar in optimizing dataset by @deependujha in #619
- upd: Optimize test (
test_dataset_for_text_tokens_with_large_num_chunks) to reduce time consumption by @bhimrazy in #629 - docs: Update documentation for AWS S3 dataset st...
v0.2.49
What's Changed
- Add
ParallelStreamingDatasetby @philgzl in #576 - feat: add support for shared queue for data processing by @deependujha in #602
- Add custom collate function for Getting Started example (resolves the
collate_fnTypeError) by @bhimrazy in #607 - feat: support Queue-based streaming inputs for optimize via new recipe by @deependujha in #606
- fix: Mark flaky tests to rerun on failure by @bhimrazy in #610
- bump version to 0.2.49 by @bhimrazy in #613
Full Changelog: v0.2.48...v0.2.49



