Review and suggest improvements for pull request #1227

shcheklein · 2025-07-11T03:54:33Z

Add comprehensive audio support to DataChain with performance optimizations, enhanced error handling, and detailed documentation.

Slack Thread

Summary by Sourcery

Introduce end-to-end audio support in DataChain by adding new audio file models, processing utilities, streaming capabilities, performance optimizations, enhanced error handling, and comprehensive documentation and tests

New Features:

Add AudioFile, AudioFragment, and Audio models for handling audio files, metadata extraction, and fragment generation
Extend FileType literal and File.as_audio_file() to support audio files

Enhancements:

Implement audio processing utilities (audio_info, audio_segment_np, audio_segment_bytes, save_audio_fragment, estimate_memory_usage, validate_audio_format)
Enable streaming audio segments with configurable memory limits and pre-computed metadata for performance
Enhance UDF stream setting to recursively configure nested File objects

Build:

Add torchaudio and soundfile dependencies under the 'audio' extras in pyproject.toml
Include audio extras in test dependencies

Documentation:

Add detailed audio processing documentation and usage guide (docs/audio_processing.md)
Provide an audio-to-text example script demonstrating streaming, fragment processing, and ML integration

Tests:

Add comprehensive unit tests for audio utilities and error handling
Add functional tests covering end-to-end audio workflows in DataChain

Added missing datasets API reference to docs (#1225)

Co-authored-by: ivan <[email protected]>

sourcery-ai · 2025-07-11T03:54:37Z

Reviewer's Guide

This PR extends DataChain with end-to-end audio support by enhancing the core File API (FileType, as_audio_file, get_file_type) to handle audio, implementing dedicated AudioFile, AudioFragment, and Audio data models, introducing a new audio utilities module powered by torchaudio/soundfile, refining UDF stream initialization via recursive traversal, updating configuration/API exports, and providing comprehensive documentation, examples and tests.

Class diagram for new and updated audio data models

classDiagram
    class File {
        +as_audio_file() AudioFile
    }
    class AudioFile {
        +get_info() Audio
        +get_fragment(start: float, end: float) AudioFragment
        +get_fragments(duration: float, start: float=0, end: float=None, audio_duration: float=None) Iterator~AudioFragment~
    }
    class AudioFragment {
        +audio: AudioFile
        +start: float
        +end: float
        +get_np() tuple[ndarray, int]
        +read_bytes(format: str="wav") bytes
        +save(output: str, format: Optional[str]=None) AudioFile
    }
    class Audio {
        +sample_rate: int
        +channels: int
        +duration: float
        +samples: int
        +format: str
        +codec: str
        +bit_rate: int
    }
    File <|-- AudioFile
    AudioFile o-- AudioFragment : fragments
    AudioFragment o-- AudioFile : audio

Class diagram for UDF recursive stream setting enhancement

classDiagram
    class UDF {
        +_set_stream_recursive(obj, catalog, cache, download_cb, visited=None)
    }
    class File
    class DataModel
    UDF --> File : sets stream
    UDF --> DataModel : traverses fields
    DataModel <|-- Audio
    DataModel <|-- AudioFragment
    DataModel <|-- Video
    DataModel <|-- AudioFile

File-Level Changes

Change	Details	Files
Extend core File API to support audio models	Expand FileType literal to include 'audio' Implement as_audio_file on File Add AudioFile, AudioFragment, Audio classes and update get_file_type	`src/datachain/lib/file.py`
Refine UDF streaming logic with recursive stream setting	Replace direct File check with a recursive helper Implement _set_stream_recursive to traverse nested DataModel fields Use visited set to avoid cycles	`src/datachain/lib/udf.py`
Introduce dedicated audio processing utilities	Add audio.py with functions: audio_info, audio_segment_np, audio_segment_bytes, save_audio_fragment, estimate_memory_usage, validate_audio_format Leverage torchaudio/soundfile with supported formats and memory management Define AudioFormat types and constants	`src/datachain/lib/audio.py`
Update configuration and package exports	Add torchaudio and soundfile under ‘audio’ extras in pyproject.toml Include 'audio' in test dependencies Expose Audio, AudioFile, AudioFragment in init.py	`pyproject.toml` `src/datachain/__init__.py`
Add documentation, examples, and tests for audio workflows	Create docs/audio_processing.md with usage guide and best practices Add unit tests (test_audio.py) covering utilities and error handling Add functional tests and example script for audio-to-text	`docs/audio_processing.md` `tests/unit/lib/test_audio.py` `tests/func/test_audio.py` `examples/multimodal/audio-to-text.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

for more information, see https://pre-commit.ci

cloudflare-workers-and-pages · 2025-07-11T03:54:43Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`d692549`
Status:	🚫 Build failed.

View logs

sourcery-ai

Hey @shcheklein - I've reviewed your changes - here's some feedback:

Consider caching the result of AudioFile.get_info() (e.g. via a simple memoization) so that multiple fragment operations don’t repeatedly call torchaudio.info and re-open the file.
The recursive _set_stream_recursive only descends into DataModel fields; you may want to also handle iterable containers (lists, tuples, dicts) of File/DataModel objects to ensure all nested files get their stream set.
It would be clearer to add an explicit else or error in get_file_type for unsupported FileType values instead of relying on fall-through, so typos in the type string are caught early.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Consider caching the result of `AudioFile.get_info()` (e.g. via a simple memoization) so that multiple fragment operations don’t repeatedly call `torchaudio.info` and re-open the file.
- The recursive `_set_stream_recursive` only descends into DataModel fields; you may want to also handle iterable containers (lists, tuples, dicts) of File/DataModel objects to ensure all nested files get their stream set.
- It would be clearer to add an explicit `else` or error in `get_file_type` for unsupported `FileType` values instead of relying on fall-through, so typos in the type string are caught early.

## Individual Comments

### Comment 1
<location> `src/datachain/lib/file.py:1066` </location>
<code_context>
+    start: float
+    end: float
+
+    def get_np(self) -> "tuple[ndarray, int]":
+        """
+        Returns the audio fragment as a NumPy array with sample rate.
</code_context>

<issue_to_address>
Clarify channel axis handling in get_np for single-channel audio.

Squeezing single-channel audio can cause output shape inconsistencies. To avoid breaking downstream code, always return a 2D array (samples, channels).
</issue_to_address>

### Comment 2
<location> `src/datachain/lib/udf.py:277` </location>
<code_context>
+    def _set_stream_recursive(
</code_context>

<issue_to_address>
Recursive stream setting does not handle lists or dicts of DataModels.

Extend the recursion to also handle lists and dicts containing DataModels or Files, as these are currently not traversed.
</issue_to_address>

### Comment 3
<location> `src/datachain/lib/audio.py:100` </location>
<code_context>
+def audio_segment_np(
</code_context>

<issue_to_address>
audio_segment_np may return inconsistent array shapes for mono vs stereo.

Currently, single-channel audio is squeezed to 1D, which may break code expecting a 2D (samples, channels) shape. Recommend always returning a 2D array for consistency.

Suggested implementation:

```python
    """
    Load audio segment as numpy array with memory management.

    Multi-channel audio is transposed to (samples, channels) format.
    For very large segments, considers memory constraints.

    Always returns a 2D numpy array of shape (samples, channels), even for mono audio.

```

```python
    # ... (code that loads audio into `audio_np` and `sample_rate`)

    # Ensure output is always 2D: (samples, channels)
    if audio_np.ndim == 1:
        audio_np = audio_np[:, None]

    return audio_np, sample_rate

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-07-11T03:56:07Z

src/datachain/lib/file.py

+    start: float
+    end: float
+
+    def get_np(self) -> "tuple[ndarray, int]":


suggestion: Clarify channel axis handling in get_np for single-channel audio.

Squeezing single-channel audio can cause output shape inconsistencies. To avoid breaking downstream code, always return a 2D array (samples, channels).

sourcery-ai · 2025-07-11T03:56:07Z

src/datachain/lib/udf.py

+    def _set_stream_recursive(
+        self, obj: Any, catalog: "Catalog", cache: bool, download_cb: Callback,
+        visited: Optional[set] = None
+    ) -> None:
+        """Recursively set the catalog stream on all File objects within an object."""
+        if visited is None:
+            visited = set()
+
+        if id(obj) in visited:
+            return


issue: Recursive stream setting does not handle lists or dicts of DataModels.

Extend the recursion to also handle lists and dicts containing DataModels or Files, as these are currently not traversed.

sourcery-ai · 2025-07-11T03:56:07Z

src/datachain/lib/audio.py

+def audio_segment_np(
+    audio: "AudioFile", 
+    start: float = 0, 
+    duration: Optional[float] = None,
+    max_memory_mb: Optional[int] = None
+) -> "tuple[ndarray, int]":
+    """
+    Load audio segment as numpy array with memory management.
+
+    Multi-channel audio is transposed to (samples, channels) format.


suggestion (bug_risk): audio_segment_np may return inconsistent array shapes for mono vs stereo.

Currently, single-channel audio is squeezed to 1D, which may break code expecting a 2D (samples, channels) shape. Recommend always returning a 2D array for consistency.

Suggested implementation:

""" Load audio segment as numpy array with memory management. Multi-channel audio is transposed to (samples, channels) format. For very large segments, considers memory constraints. Always returns a 2D numpy array of shape (samples, channels), even for mono audio.

# ... (code that loads audio into `audio_np` and `sample_rate`) # Ensure output is always 2D: (samples, channels) if audio_np.ndim == 1: audio_np = audio_np[:, None] return audio_np, sample_rate

sourcery-ai · 2025-07-11T03:56:08Z

tests/func/test_audio.py

+    for i, (duration, freq) in enumerate([(2.0, 440.0), (3.0, 880.0)]):
+        audio_data = generate_test_wav(
+            duration=duration, sample_rate=16000, frequency=freq
+        )
+        audio_path = tmp_path / f"test_audio_{i}.wav"
+        audio_path.write_bytes(audio_data)
+        audio_files.append(str(audio_path))


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-07-11T03:56:08Z

tests/func/test_audio.py

+    for file, info in results:
+        assert isinstance(file, AudioFile)
+        assert isinstance(info, Audio)
+        assert info.sample_rate == 16000
+        assert info.channels == 1
+        assert info.duration > 0
+        assert info.samples > 0
+        assert info.format != ""


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-07-11T03:56:09Z

src/datachain/lib/audio.py

+        raise ValueError(f"Invalid time range: ({start:.3f}, {end:.3f})")
+
+    if format is None:
+        format = audio.get_file_ext()


issue (code-quality): Don't assign to builtin variable format (avoid-builtin-shadow)

Explanation
Python has a number of builtin variables: functions and constants that
form a part of the language, such as list, getattr, and type
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:

list = [1, 2, 3]

However, this is considered poor practice.

It will confuse other developers.

It will confuse syntax highlighters and linters.

It means you can no longer use that builtin for its original purpose.

How can you solve this?

Rename the variable something more specific, such as integers.
In a pinch, my_list and similar names are colloquially-recognized
placeholders.

sourcery-ai · 2025-07-11T03:56:09Z

src/datachain/lib/file.py

+            if audio_duration is not None:
+                end = audio_duration
+            else:
+                end = self.get_info().duration
+


suggestion (code-quality): We've found these issues:

Swap if/else branches (swap-if-else-branches)

Replace if statement with if expression (assign-if-exp)

Suggested change

if audio_duration is not None:

end = audio_duration

else:

end = self.get_info().duration

end = self.get_info().duration if audio_duration is None else audio_duration

sourcery-ai · 2025-07-11T03:56:09Z

tests/func/test_audio.py

+    fragment = file.get_fragment(start, start + 0.5)
+    yield fragment


suggestion (code-quality): Inline variable that is immediately yielded (inline-immediately-yielded-variable)

Suggested change

fragment = file.get_fragment(start, start + 0.5)

yield fragment

yield file.get_fragment(start, start + 0.5)

sourcery-ai · 2025-07-11T03:56:09Z

tests/func/test_audio.py

+
+    # Check that all files have expected audio metadata
+    for file, info in results:
+        assert isinstance(file, AudioFile)


issue (code-quality): Extract code out into function [×2] (extract-method)

sourcery-ai · 2025-07-11T03:56:09Z

tests/func/test_audio.py

+        ).to_values("info"))
+        # If we get here, the error was handled and we should have gotten an exception
+        # in the processing, not here
+        assert len(results) == 0 or any(isinstance(r, Exception) for r in results)


suggestion (code-quality): Simplify sequence length comparison (simplify-len-comparison)

Suggested change

assert len(results) == 0 or any(isinstance(r, Exception) for r in results)

assert not results or any(isinstance(r, Exception) for r in results)

ilongin and others added 2 commits July 11, 2025 00:36

Added missing datasets API reference to docs (#1225)

2e433bb

Added missing datasets API reference to docs (#1225)

feat(audio): Add comprehensive audio processing support to DataChain

1aefdb1

Co-authored-by: ivan <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d692549

for more information, see https://pre-commit.ci

shcheklein changed the base branch from main to audio-fragments-decoder July 11, 2025 03:54

sourcery-ai bot reviewed Jul 11, 2025

View reviewed changes

shcheklein closed this Jul 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Review and suggest improvements for pull request #1227

Review and suggest improvements for pull request #1227

Uh oh!

shcheklein commented Jul 11, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jul 11, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

cloudflare-workers-and-pages bot commented Jul 11, 2025 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

sourcery-ai bot Jul 11, 2025

Uh oh!

Uh oh!

		fragment = file.get_fragment(start, start + 0.5)
		yield fragment

	fragment = file.get_fragment(start, start + 0.5)
	yield fragment
	yield file.get_fragment(start, start + 0.5)

	assert len(results) == 0 or any(isinstance(r, Exception) for r in results)
	assert not results or any(isinstance(r, Exception) for r in results)

Review and suggest improvements for pull request #1227

Review and suggest improvements for pull request #1227

Uh oh!

Conversation

shcheklein commented Jul 11, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for new and updated audio data models

Class diagram for UDF recursive stream setting enhancement

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

cloudflare-workers-and-pages bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shcheklein commented Jul 11, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jul 11, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 11, 2025 •

edited

Loading