Skip to content

Review and suggest improvements for pull request #1227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented Jul 11, 2025

Add comprehensive audio support to DataChain with performance optimizations, enhanced error handling, and detailed documentation.


Slack Thread

Summary by Sourcery

Introduce end-to-end audio support in DataChain by adding new audio file models, processing utilities, streaming capabilities, performance optimizations, enhanced error handling, and comprehensive documentation and tests

New Features:

  • Add AudioFile, AudioFragment, and Audio models for handling audio files, metadata extraction, and fragment generation
  • Extend FileType literal and File.as_audio_file() to support audio files

Enhancements:

  • Implement audio processing utilities (audio_info, audio_segment_np, audio_segment_bytes, save_audio_fragment, estimate_memory_usage, validate_audio_format)
  • Enable streaming audio segments with configurable memory limits and pre-computed metadata for performance
  • Enhance UDF stream setting to recursively configure nested File objects

Build:

  • Add torchaudio and soundfile dependencies under the 'audio' extras in pyproject.toml
  • Include audio extras in test dependencies

Documentation:

  • Add detailed audio processing documentation and usage guide (docs/audio_processing.md)
  • Provide an audio-to-text example script demonstrating streaming, fragment processing, and ML integration

Tests:

  • Add comprehensive unit tests for audio utilities and error handling
  • Add functional tests covering end-to-end audio workflows in DataChain

Copy link
Contributor

sourcery-ai bot commented Jul 11, 2025

Reviewer's Guide

This PR extends DataChain with end-to-end audio support by enhancing the core File API (FileType, as_audio_file, get_file_type) to handle audio, implementing dedicated AudioFile, AudioFragment, and Audio data models, introducing a new audio utilities module powered by torchaudio/soundfile, refining UDF stream initialization via recursive traversal, updating configuration/API exports, and providing comprehensive documentation, examples and tests.

Class diagram for new and updated audio data models

classDiagram
    class File {
        +as_audio_file() AudioFile
    }
    class AudioFile {
        +get_info() Audio
        +get_fragment(start: float, end: float) AudioFragment
        +get_fragments(duration: float, start: float=0, end: float=None, audio_duration: float=None) Iterator~AudioFragment~
    }
    class AudioFragment {
        +audio: AudioFile
        +start: float
        +end: float
        +get_np() tuple[ndarray, int]
        +read_bytes(format: str="wav") bytes
        +save(output: str, format: Optional[str]=None) AudioFile
    }
    class Audio {
        +sample_rate: int
        +channels: int
        +duration: float
        +samples: int
        +format: str
        +codec: str
        +bit_rate: int
    }
    File <|-- AudioFile
    AudioFile o-- AudioFragment : fragments
    AudioFragment o-- AudioFile : audio
Loading

Class diagram for UDF recursive stream setting enhancement

classDiagram
    class UDF {
        +_set_stream_recursive(obj, catalog, cache, download_cb, visited=None)
    }
    class File
    class DataModel
    UDF --> File : sets stream
    UDF --> DataModel : traverses fields
    DataModel <|-- Audio
    DataModel <|-- AudioFragment
    DataModel <|-- Video
    DataModel <|-- AudioFile
Loading

File-Level Changes

Change Details Files
Extend core File API to support audio models
  • Expand FileType literal to include 'audio'
  • Implement as_audio_file on File
  • Add AudioFile, AudioFragment, Audio classes and update get_file_type
src/datachain/lib/file.py
Refine UDF streaming logic with recursive stream setting
  • Replace direct File check with a recursive helper
  • Implement _set_stream_recursive to traverse nested DataModel fields
  • Use visited set to avoid cycles
src/datachain/lib/udf.py
Introduce dedicated audio processing utilities
  • Add audio.py with functions: audio_info, audio_segment_np, audio_segment_bytes, save_audio_fragment, estimate_memory_usage, validate_audio_format
  • Leverage torchaudio/soundfile with supported formats and memory management
  • Define AudioFormat types and constants
src/datachain/lib/audio.py
Update configuration and package exports
  • Add torchaudio and soundfile under ‘audio’ extras in pyproject.toml
  • Include 'audio' in test dependencies
  • Expose Audio, AudioFile, AudioFragment in init.py
pyproject.toml
src/datachain/__init__.py
Add documentation, examples, and tests for audio workflows
  • Create docs/audio_processing.md with usage guide and best practices
  • Add unit tests (test_audio.py) covering utilities and error handling
  • Add functional tests and example script for audio-to-text
docs/audio_processing.md
tests/unit/lib/test_audio.py
tests/func/test_audio.py
examples/multimodal/audio-to-text.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

cloudflare-workers-and-pages bot commented Jul 11, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: d692549
Status:🚫  Build failed.

View logs

@shcheklein shcheklein changed the base branch from main to audio-fragments-decoder July 11, 2025 03:54
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @shcheklein - I've reviewed your changes - here's some feedback:

  • Consider caching the result of AudioFile.get_info() (e.g. via a simple memoization) so that multiple fragment operations don’t repeatedly call torchaudio.info and re-open the file.
  • The recursive _set_stream_recursive only descends into DataModel fields; you may want to also handle iterable containers (lists, tuples, dicts) of File/DataModel objects to ensure all nested files get their stream set.
  • It would be clearer to add an explicit else or error in get_file_type for unsupported FileType values instead of relying on fall-through, so typos in the type string are caught early.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider caching the result of `AudioFile.get_info()` (e.g. via a simple memoization) so that multiple fragment operations don’t repeatedly call `torchaudio.info` and re-open the file.
- The recursive `_set_stream_recursive` only descends into DataModel fields; you may want to also handle iterable containers (lists, tuples, dicts) of File/DataModel objects to ensure all nested files get their stream set.
- It would be clearer to add an explicit `else` or error in `get_file_type` for unsupported `FileType` values instead of relying on fall-through, so typos in the type string are caught early.

## Individual Comments

### Comment 1
<location> `src/datachain/lib/file.py:1066` </location>
<code_context>
+    start: float
+    end: float
+
+    def get_np(self) -> "tuple[ndarray, int]":
+        """
+        Returns the audio fragment as a NumPy array with sample rate.
</code_context>

<issue_to_address>
Clarify channel axis handling in get_np for single-channel audio.

Squeezing single-channel audio can cause output shape inconsistencies. To avoid breaking downstream code, always return a 2D array (samples, channels).
</issue_to_address>

### Comment 2
<location> `src/datachain/lib/udf.py:277` </location>
<code_context>
+    def _set_stream_recursive(
</code_context>

<issue_to_address>
Recursive stream setting does not handle lists or dicts of DataModels.

Extend the recursion to also handle lists and dicts containing DataModels or Files, as these are currently not traversed.
</issue_to_address>

### Comment 3
<location> `src/datachain/lib/audio.py:100` </location>
<code_context>
+def audio_segment_np(
</code_context>

<issue_to_address>
audio_segment_np may return inconsistent array shapes for mono vs stereo.

Currently, single-channel audio is squeezed to 1D, which may break code expecting a 2D (samples, channels) shape. Recommend always returning a 2D array for consistency.

Suggested implementation:

```python
    """
    Load audio segment as numpy array with memory management.

    Multi-channel audio is transposed to (samples, channels) format.
    For very large segments, considers memory constraints.

    Always returns a 2D numpy array of shape (samples, channels), even for mono audio.

```

```python
    # ... (code that loads audio into `audio_np` and `sample_rate`)

    # Ensure output is always 2D: (samples, channels)
    if audio_np.ndim == 1:
        audio_np = audio_np[:, None]

    return audio_np, sample_rate

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

start: float
end: float

def get_np(self) -> "tuple[ndarray, int]":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Clarify channel axis handling in get_np for single-channel audio.

Squeezing single-channel audio can cause output shape inconsistencies. To avoid breaking downstream code, always return a 2D array (samples, channels).

Comment on lines 277 to 286
def _set_stream_recursive(
self, obj: Any, catalog: "Catalog", cache: bool, download_cb: Callback,
visited: Optional[set] = None
) -> None:
"""Recursively set the catalog stream on all File objects within an object."""
if visited is None:
visited = set()

if id(obj) in visited:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Recursive stream setting does not handle lists or dicts of DataModels.

Extend the recursion to also handle lists and dicts containing DataModels or Files, as these are currently not traversed.

Comment on lines 100 to 109
def audio_segment_np(
audio: "AudioFile",
start: float = 0,
duration: Optional[float] = None,
max_memory_mb: Optional[int] = None
) -> "tuple[ndarray, int]":
"""
Load audio segment as numpy array with memory management.

Multi-channel audio is transposed to (samples, channels) format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): audio_segment_np may return inconsistent array shapes for mono vs stereo.

Currently, single-channel audio is squeezed to 1D, which may break code expecting a 2D (samples, channels) shape. Recommend always returning a 2D array for consistency.

Suggested implementation:

    """
    Load audio segment as numpy array with memory management.

    Multi-channel audio is transposed to (samples, channels) format.
    For very large segments, considers memory constraints.

    Always returns a 2D numpy array of shape (samples, channels), even for mono audio.
    # ... (code that loads audio into `audio_np` and `sample_rate`)

    # Ensure output is always 2D: (samples, channels)
    if audio_np.ndim == 1:
        audio_np = audio_np[:, None]

    return audio_np, sample_rate

Comment on lines +66 to +72
for i, (duration, freq) in enumerate([(2.0, 440.0), (3.0, 880.0)]):
audio_data = generate_test_wav(
duration=duration, sample_rate=16000, frequency=freq
)
audio_path = tmp_path / f"test_audio_{i}.wav"
audio_path.write_bytes(audio_data)
audio_files.append(str(audio_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +89 to +96
for file, info in results:
assert isinstance(file, AudioFile)
assert isinstance(info, Audio)
assert info.sample_rate == 16000
assert info.channels == 1
assert info.duration > 0
assert info.samples > 0
assert info.format != ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

raise ValueError(f"Invalid time range: ({start:.3f}, {end:.3f})")

if format is None:
format = audio.get_file_ext()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Don't assign to builtin variable format (avoid-builtin-shadow)


ExplanationPython has a number of builtin variables: functions and constants that
form a part of the language, such as list, getattr, and type
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:

list = [1, 2, 3]

However, this is considered poor practice.

  • It will confuse other developers.
  • It will confuse syntax highlighters and linters.
  • It means you can no longer use that builtin for its original purpose.

How can you solve this?

Rename the variable something more specific, such as integers.
In a pinch, my_list and similar names are colloquially-recognized
placeholders.

Comment on lines +1033 to +1037
if audio_duration is not None:
end = audio_duration
else:
end = self.get_info().duration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
if audio_duration is not None:
end = audio_duration
else:
end = self.get_info().duration
end = self.get_info().duration if audio_duration is None else audio_duration

Comment on lines +41 to +42
fragment = file.get_fragment(start, start + 0.5)
yield fragment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Inline variable that is immediately yielded (inline-immediately-yielded-variable)

Suggested change
fragment = file.get_fragment(start, start + 0.5)
yield fragment
yield file.get_fragment(start, start + 0.5)


# Check that all files have expected audio metadata
for file, info in results:
assert isinstance(file, AudioFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Extract code out into function [×2] (extract-method)

).to_values("info"))
# If we get here, the error was handled and we should have gotten an exception
# in the processing, not here
assert len(results) == 0 or any(isinstance(r, Exception) for r in results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Simplify sequence length comparison (simplify-len-comparison)

Suggested change
assert len(results) == 0 or any(isinstance(r, Exception) for r in results)
assert not results or any(isinstance(r, Exception) for r in results)

@shcheklein shcheklein closed this Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants