-
Notifications
You must be signed in to change notification settings - Fork 116
Review and suggest improvements for pull request #1227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review and suggest improvements for pull request #1227
Conversation
Added missing datasets API reference to docs (#1225)
Co-authored-by: ivan <[email protected]>
Reviewer's GuideThis PR extends DataChain with end-to-end audio support by enhancing the core File API (FileType, as_audio_file, get_file_type) to handle audio, implementing dedicated AudioFile, AudioFragment, and Audio data models, introducing a new audio utilities module powered by torchaudio/soundfile, refining UDF stream initialization via recursive traversal, updating configuration/API exports, and providing comprehensive documentation, examples and tests. Class diagram for new and updated audio data modelsclassDiagram
class File {
+as_audio_file() AudioFile
}
class AudioFile {
+get_info() Audio
+get_fragment(start: float, end: float) AudioFragment
+get_fragments(duration: float, start: float=0, end: float=None, audio_duration: float=None) Iterator~AudioFragment~
}
class AudioFragment {
+audio: AudioFile
+start: float
+end: float
+get_np() tuple[ndarray, int]
+read_bytes(format: str="wav") bytes
+save(output: str, format: Optional[str]=None) AudioFile
}
class Audio {
+sample_rate: int
+channels: int
+duration: float
+samples: int
+format: str
+codec: str
+bit_rate: int
}
File <|-- AudioFile
AudioFile o-- AudioFragment : fragments
AudioFragment o-- AudioFile : audio
Class diagram for UDF recursive stream setting enhancementclassDiagram
class UDF {
+_set_stream_recursive(obj, catalog, cache, download_cb, visited=None)
}
class File
class DataModel
UDF --> File : sets stream
UDF --> DataModel : traverses fields
DataModel <|-- Audio
DataModel <|-- AudioFragment
DataModel <|-- Video
DataModel <|-- AudioFile
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @shcheklein - I've reviewed your changes - here's some feedback:
- Consider caching the result of
AudioFile.get_info()
(e.g. via a simple memoization) so that multiple fragment operations don’t repeatedly calltorchaudio.info
and re-open the file. - The recursive
_set_stream_recursive
only descends into DataModel fields; you may want to also handle iterable containers (lists, tuples, dicts) of File/DataModel objects to ensure all nested files get their stream set. - It would be clearer to add an explicit
else
or error inget_file_type
for unsupportedFileType
values instead of relying on fall-through, so typos in the type string are caught early.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider caching the result of `AudioFile.get_info()` (e.g. via a simple memoization) so that multiple fragment operations don’t repeatedly call `torchaudio.info` and re-open the file.
- The recursive `_set_stream_recursive` only descends into DataModel fields; you may want to also handle iterable containers (lists, tuples, dicts) of File/DataModel objects to ensure all nested files get their stream set.
- It would be clearer to add an explicit `else` or error in `get_file_type` for unsupported `FileType` values instead of relying on fall-through, so typos in the type string are caught early.
## Individual Comments
### Comment 1
<location> `src/datachain/lib/file.py:1066` </location>
<code_context>
+ start: float
+ end: float
+
+ def get_np(self) -> "tuple[ndarray, int]":
+ """
+ Returns the audio fragment as a NumPy array with sample rate.
</code_context>
<issue_to_address>
Clarify channel axis handling in get_np for single-channel audio.
Squeezing single-channel audio can cause output shape inconsistencies. To avoid breaking downstream code, always return a 2D array (samples, channels).
</issue_to_address>
### Comment 2
<location> `src/datachain/lib/udf.py:277` </location>
<code_context>
+ def _set_stream_recursive(
</code_context>
<issue_to_address>
Recursive stream setting does not handle lists or dicts of DataModels.
Extend the recursion to also handle lists and dicts containing DataModels or Files, as these are currently not traversed.
</issue_to_address>
### Comment 3
<location> `src/datachain/lib/audio.py:100` </location>
<code_context>
+def audio_segment_np(
</code_context>
<issue_to_address>
audio_segment_np may return inconsistent array shapes for mono vs stereo.
Currently, single-channel audio is squeezed to 1D, which may break code expecting a 2D (samples, channels) shape. Recommend always returning a 2D array for consistency.
Suggested implementation:
```python
"""
Load audio segment as numpy array with memory management.
Multi-channel audio is transposed to (samples, channels) format.
For very large segments, considers memory constraints.
Always returns a 2D numpy array of shape (samples, channels), even for mono audio.
```
```python
# ... (code that loads audio into `audio_np` and `sample_rate`)
# Ensure output is always 2D: (samples, channels)
if audio_np.ndim == 1:
audio_np = audio_np[:, None]
return audio_np, sample_rate
```
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
start: float | ||
end: float | ||
|
||
def get_np(self) -> "tuple[ndarray, int]": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Clarify channel axis handling in get_np for single-channel audio.
Squeezing single-channel audio can cause output shape inconsistencies. To avoid breaking downstream code, always return a 2D array (samples, channels).
src/datachain/lib/udf.py
Outdated
def _set_stream_recursive( | ||
self, obj: Any, catalog: "Catalog", cache: bool, download_cb: Callback, | ||
visited: Optional[set] = None | ||
) -> None: | ||
"""Recursively set the catalog stream on all File objects within an object.""" | ||
if visited is None: | ||
visited = set() | ||
|
||
if id(obj) in visited: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Recursive stream setting does not handle lists or dicts of DataModels.
Extend the recursion to also handle lists and dicts containing DataModels or Files, as these are currently not traversed.
src/datachain/lib/audio.py
Outdated
def audio_segment_np( | ||
audio: "AudioFile", | ||
start: float = 0, | ||
duration: Optional[float] = None, | ||
max_memory_mb: Optional[int] = None | ||
) -> "tuple[ndarray, int]": | ||
""" | ||
Load audio segment as numpy array with memory management. | ||
|
||
Multi-channel audio is transposed to (samples, channels) format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): audio_segment_np may return inconsistent array shapes for mono vs stereo.
Currently, single-channel audio is squeezed to 1D, which may break code expecting a 2D (samples, channels) shape. Recommend always returning a 2D array for consistency.
Suggested implementation:
"""
Load audio segment as numpy array with memory management.
Multi-channel audio is transposed to (samples, channels) format.
For very large segments, considers memory constraints.
Always returns a 2D numpy array of shape (samples, channels), even for mono audio.
# ... (code that loads audio into `audio_np` and `sample_rate`)
# Ensure output is always 2D: (samples, channels)
if audio_np.ndim == 1:
audio_np = audio_np[:, None]
return audio_np, sample_rate
tests/func/test_audio.py
Outdated
for i, (duration, freq) in enumerate([(2.0, 440.0), (3.0, 880.0)]): | ||
audio_data = generate_test_wav( | ||
duration=duration, sample_rate=16000, frequency=freq | ||
) | ||
audio_path = tmp_path / f"test_audio_{i}.wav" | ||
audio_path.write_bytes(audio_data) | ||
audio_files.append(str(audio_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests
)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
tests/func/test_audio.py
Outdated
for file, info in results: | ||
assert isinstance(file, AudioFile) | ||
assert isinstance(info, Audio) | ||
assert info.sample_rate == 16000 | ||
assert info.channels == 1 | ||
assert info.duration > 0 | ||
assert info.samples > 0 | ||
assert info.format != "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests
)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
src/datachain/lib/audio.py
Outdated
raise ValueError(f"Invalid time range: ({start:.3f}, {end:.3f})") | ||
|
||
if format is None: | ||
format = audio.get_file_ext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Don't assign to builtin variable format
(avoid-builtin-shadow
)
Explanation
Python has a number ofbuiltin
variables: functions and constants thatform a part of the language, such as
list
, getattr
, and type
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:
list = [1, 2, 3]
However, this is considered poor practice.
- It will confuse other developers.
- It will confuse syntax highlighters and linters.
- It means you can no longer use that builtin for its original purpose.
How can you solve this?
Rename the variable something more specific, such as integers
.
In a pinch, my_list
and similar names are colloquially-recognized
placeholders.
if audio_duration is not None: | ||
end = audio_duration | ||
else: | ||
end = self.get_info().duration | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Swap if/else branches (
swap-if-else-branches
) - Replace if statement with if expression (
assign-if-exp
)
if audio_duration is not None: | |
end = audio_duration | |
else: | |
end = self.get_info().duration | |
end = self.get_info().duration if audio_duration is None else audio_duration |
tests/func/test_audio.py
Outdated
fragment = file.get_fragment(start, start + 0.5) | ||
yield fragment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Inline variable that is immediately yielded (inline-immediately-yielded-variable
)
fragment = file.get_fragment(start, start + 0.5) | |
yield fragment | |
yield file.get_fragment(start, start + 0.5) |
tests/func/test_audio.py
Outdated
|
||
# Check that all files have expected audio metadata | ||
for file, info in results: | ||
assert isinstance(file, AudioFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Extract code out into function [×2] (extract-method
)
tests/func/test_audio.py
Outdated
).to_values("info")) | ||
# If we get here, the error was handled and we should have gotten an exception | ||
# in the processing, not here | ||
assert len(results) == 0 or any(isinstance(r, Exception) for r in results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Simplify sequence length comparison (simplify-len-comparison
)
assert len(results) == 0 or any(isinstance(r, Exception) for r in results) | |
assert not results or any(isinstance(r, Exception) for r in results) |
Add comprehensive audio support to DataChain with performance optimizations, enhanced error handling, and detailed documentation.
Slack Thread
Summary by Sourcery
Introduce end-to-end audio support in DataChain by adding new audio file models, processing utilities, streaming capabilities, performance optimizations, enhanced error handling, and comprehensive documentation and tests
New Features:
Enhancements:
Build:
Documentation:
Tests: