-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kaldi: add an switch/option to read the durations from kaldi utt2dur … #832
kaldi: add an switch/option to read the durations from kaldi utt2dur … #832
Conversation
…instead of touching the audio file
I tested:
and compared the duration with original files. They weren't equal to the durations in utt2dur but I assume it's just float representation issue |
😱 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some issues, could you add a unit test to cover those? The existing tests didn't catch them.
Regarding reco2dur, IIRC it only represents durations up to two decimal points so the tolerance is +/- 5ms. Default Lhotse tolerance for mismatch is +/-0.25s (which makes Lhotse happy with how most MP3 decoders are diverging in terms of the total number of samples), so it should be OK to set this option to true as a default as save everybody's time. Historically this code used reco2dur already, but Lhotse tolerated zero mismatch between manifest and audio duration at the time, so I removed it (and much later realized there is no way that approach can work with various audio codecs and added the tolerance).
lhotse/kaldi.py
Outdated
] = f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 -i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |" | ||
audios[channel] = ( | ||
f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 " | ||
"-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |" | |
f"-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |" |
lhotse/kaldi.py
Outdated
] = f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} -map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |" | ||
audios[channel] = ( | ||
f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} " | ||
"-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |" | |
f"-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |" |
yeah, sorry, didn't realize these...
working on the unit test for the new feature (ETA 5 mins)
y.
…On Fri, Sep 30, 2022 at 4:10 PM Piotr Żelasko ***@***.***> wrote:
***@***.**** commented on this pull request.
I found some issues, could you add a unit test to cover those? The
existing tests didn't catch them.
Regarding reco2dur, IIRC it only represents durations up to two decimal
points so the tolerance is +/- 5ms. Default Lhotse tolerance for mismatch
is +/-0.25s
<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L70>
(which makes Lhotse happy with how most MP3 decoders are diverging in terms
of the total number of samples), so it should be OK to set this option to
true as a default as save everybody's time. Historically this code used
reco2dur already, but Lhotse tolerated zero mismatch between manifest and
audio duration at the time, so I removed it (and much later realized there
is no way that approach can work with various audio codecs and added the
tolerance).
------------------------------
In lhotse/kaldi.py
<#832 (comment)>:
> @@ -371,18 +381,20 @@ def make_wavscp_channel_string_map(
# used in the sph files
audios = dict()
for channel in source.channels:
- audios[
- channel
- ] = f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 -i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
+ audios[channel] = (
+ f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 "
+ "-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
⬇️ Suggested change
- "-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
+ f"-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
------------------------------
In lhotse/kaldi.py
<#832 (comment)>:
>
return audios
else:
# Handles non-WAVE audio formats and multi-channel WAVEs.
audios = dict()
for channel in source.channels:
- audios[
- channel
- ] = f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} -map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
+ audios[channel] = (
+ f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} "
+ "-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
⬇️ Suggested change
- "-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
+ f"-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
—
Reply to this email directly, view it on GitHub
<#832 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX7QRFEJQYMVAQP65I3WA5CKFANCNFSM6AAAAAAQ2BTL2U>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Added tests for my incorrect formatting and added test for the new feature. Lets wait. But tests pass fine on my setup |
Can you also change to use reco2dur by default, if present? It should speed up things and shouldn't hurt anybody. |
the default behavior is to ignore the reco2dur existence and read the files
fully, as it has been done before, only when the import is called with the
proper parameter, the durations are read from that file.
y.
…On Fri, Sep 30, 2022 at 4:12 PM Jan Yenda Trmal ***@***.***> wrote:
yeah, sorry, didn't realize these...
working on the unit test for the new feature (ETA 5 mins)
y.
On Fri, Sep 30, 2022 at 4:10 PM Piotr Żelasko ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
>
> I found some issues, could you add a unit test to cover those? The
> existing tests didn't catch them.
>
> Regarding reco2dur, IIRC it only represents durations up to two decimal
> points so the tolerance is +/- 5ms. Default Lhotse tolerance for
> mismatch is +/-0.25s
> <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L70>
> (which makes Lhotse happy with how most MP3 decoders are diverging in terms
> of the total number of samples), so it should be OK to set this option to
> true as a default as save everybody's time. Historically this code used
> reco2dur already, but Lhotse tolerated zero mismatch between manifest and
> audio duration at the time, so I removed it (and much later realized there
> is no way that approach can work with various audio codecs and added the
> tolerance).
> ------------------------------
>
> In lhotse/kaldi.py
> <#832 (comment)>:
>
> > @@ -371,18 +381,20 @@ def make_wavscp_channel_string_map(
>
> # used in the sph files
>
> audios = dict()
>
> for channel in source.channels:
>
> - audios[
>
> - channel
>
> - ] = f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 -i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>
> + audios[channel] = (
>
> + f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 "
>
> + "-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>
>
> ⬇️ Suggested change
>
> - "-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>
> + f"-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>
>
> ------------------------------
>
> In lhotse/kaldi.py
> <#832 (comment)>:
>
> >
>
> return audios
>
> else:
>
> # Handles non-WAVE audio formats and multi-channel WAVEs.
>
> audios = dict()
>
> for channel in source.channels:
>
> - audios[
>
> - channel
>
> - ] = f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} -map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>
> + audios[channel] = (
>
> + f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} "
>
> + "-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>
>
> ⬇️ Suggested change
>
> - "-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>
> + f"-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>
>
> —
> Reply to this email directly, view it on GitHub
> <#832 (review)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACUKYX7QRFEJQYMVAQP65I3WA5CKFANCNFSM6AAAAAAQ2BTL2U>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
our emails just crossed...
ok, if you are fine with it, happily.
Just FYI -- it also implicitly tests for the existence of the file (because
it will complain if the file does not exist0, so it might cause some
confusion
y.
…On Fri, Sep 30, 2022 at 4:28 PM Jan Yenda Trmal ***@***.***> wrote:
the default behavior is to ignore the reco2dur existence and read the
files fully, as it has been done before, only when the import is called
with the proper parameter, the durations are read from that file.
y.
On Fri, Sep 30, 2022 at 4:12 PM Jan Yenda Trmal ***@***.***> wrote:
> yeah, sorry, didn't realize these...
> working on the unit test for the new feature (ETA 5 mins)
> y.
>
> On Fri, Sep 30, 2022 at 4:10 PM Piotr Żelasko ***@***.***>
> wrote:
>
>> ***@***.**** commented on this pull request.
>>
>> I found some issues, could you add a unit test to cover those? The
>> existing tests didn't catch them.
>>
>> Regarding reco2dur, IIRC it only represents durations up to two decimal
>> points so the tolerance is +/- 5ms. Default Lhotse tolerance for
>> mismatch is +/-0.25s
>> <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L70>
>> (which makes Lhotse happy with how most MP3 decoders are diverging in terms
>> of the total number of samples), so it should be OK to set this option to
>> true as a default as save everybody's time. Historically this code used
>> reco2dur already, but Lhotse tolerated zero mismatch between manifest and
>> audio duration at the time, so I removed it (and much later realized there
>> is no way that approach can work with various audio codecs and added the
>> tolerance).
>> ------------------------------
>>
>> In lhotse/kaldi.py
>> <#832 (comment)>
>> :
>>
>> > @@ -371,18 +381,20 @@ def make_wavscp_channel_string_map(
>>
>> # used in the sph files
>>
>> audios = dict()
>>
>> for channel in source.channels:
>>
>> - audios[
>>
>> - channel
>>
>> - ] = f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 -i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>>
>> + audios[channel] = (
>>
>> + f"sph2pipe {source.source} -f wav -c {channel+1} -p | ffmpeg -threads 1 "
>>
>> + "-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>>
>>
>> ⬇️ Suggested change
>>
>> - "-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>>
>> + f"-i pipe:0 -ar {sampling_rate} -f wav -threads 1 pipe:1 |"
>>
>>
>> ------------------------------
>>
>> In lhotse/kaldi.py
>> <#832 (comment)>
>> :
>>
>> >
>>
>> return audios
>>
>> else:
>>
>> # Handles non-WAVE audio formats and multi-channel WAVEs.
>>
>> audios = dict()
>>
>> for channel in source.channels:
>>
>> - audios[
>>
>> - channel
>>
>> - ] = f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} -map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>>
>> + audios[channel] = (
>>
>> + f"ffmpeg -threads 1 -i {source.source} -ar {sampling_rate} "
>>
>> + "-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>>
>>
>> ⬇️ Suggested change
>>
>> - "-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>>
>> + f"-map_channel 0.0.{channel} -f wav -threads 1 pipe:1 |"
>>
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#832 (review)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACUKYX7QRFEJQYMVAQP65I3WA5CKFANCNFSM6AAAAAAQ2BTL2U>
>> .
>> You are receiving this because you authored the thread.Message ID:
>> ***@***.***>
>>
>
|
OK, I'm done with the changes (including making the option enabled by default). |
I'm investigating. I assume its something with my cast/converstion to float |
OK, should be fixed now. It was just an issue in the test, due to change of defaults (I was much stricter than I should have been) |
OK, we survived. Ready for review :) |
@@ -220,6 +221,7 @@ def test_ok_on_file_singlechannel_sph_source_type(tmp_path, channel): | |||
assert list(out.keys()) == [channel] | |||
assert out[channel].startswith("sph2pipe") | |||
assert "nonexistent.sph" in out[channel] | |||
assert "{" not in out[channel] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious to know what's this assert
verifying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just detect someone has goofed up and didn't fully formatted the f-string (or forgot to add f"")
lhotse/bin/modes/kaldi.py
Outdated
@@ -42,7 +42,7 @@ def kaldi(): | |||
@click.option( | |||
"-d", | |||
"--use-reco2dur", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with default=True and is_flag=True it is impossible to turn this off. The idiomatic click way would be @click.option("--use-reco2dur/--compute-durations", default=True, help="...")
and then name the Python function argument use_reco2dur: bool
(the same as the first option before slash, i.e. the same as right now).
lhotse/kaldi.py
Outdated
num_samples=compute_num_samples(durations[recording_id], sampling_rate), | ||
duration=durations[recording_id], | ||
num_samples=compute_num_samples( | ||
float(durations[recording_id]), sampling_rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be cleaner to do sth like load_kaldi_text_mapping(path, float_vals: bool = False)
and when float_vals=True
, cast the second column to float in the function. Otherwise you have to cast in multiple places, looks like a nasty surprise for anybody modifying these things in the future :)
just to detect improperly formatted strings (I split f"" string into to
and forgot the second string prefix with "f"). So it will try to detect
left-over "{placeholders}". It's only in a test so I didn't do it more
extensive/foolproof.
y.
…On Fri, Sep 30, 2022 at 9:22 PM Desh Raj ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In test/test_kaldi_dirs.py
<#832 (comment)>:
> @@ -220,6 +221,7 @@ def test_ok_on_file_singlechannel_sph_source_type(tmp_path, channel):
assert list(out.keys()) == [channel]
assert out[channel].startswith("sph2pipe")
assert "nonexistent.sph" in out[channel]
+ assert "{" not in out[channel]
Just curious to know what's this assert verifying?
—
Reply to this email directly, view it on GitHub
<#832 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX5NY7Z6725ISLIGFQ3WA6G65ANCNFSM6AAAAAAQ2BTL2U>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
good idea, will fix
…On Fri, Sep 30, 2022 at 9:44 PM Piotr Żelasko ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lhotse/kaldi.py
<#832 (comment)>:
> @@ -97,8 +107,10 @@ def fix_id(t: str) -> str:
)
],
sampling_rate=sampling_rate,
- num_samples=compute_num_samples(durations[recording_id], sampling_rate),
- duration=durations[recording_id],
+ num_samples=compute_num_samples(
+ float(durations[recording_id]), sampling_rate
I think it could be cleaner to do sth like load_kaldi_text_mapping(path,
float_vals: bool = False) and when float_vals=True, cast the second
column to float in the function. Otherwise you have to cast in multiple
places, looks like a nasty surprise for anybody modifying these things in
the future :)
—
Reply to this email directly, view it on GitHub
<#832 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX5A3D4PRXKE66XQUCLWA6JPTANCNFSM6AAAAAAQ2BTL2U>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@pzelasko your comments are addressed. I don't understand the 3.7 unit tests failing reason |
🤦🏻♂️ do you mind adding |
hm, that wasn't it :(
the last errors come from hypothesis, but I'm not sure if thats the
culprit :(
y.
…On Mon, Oct 3, 2022 at 1:01 PM Piotr Żelasko ***@***.***> wrote:
could this be an issue?
https://stackoverflow.com/questions/73929564/entrypoints-object-has-no-attribute-get-digital-ocean
🤦🏻♂️
do you mind adding importlib-metadata<5.0.0 to requirements and checking
if it helps? Thanks...
—
Reply to this email directly, view it on GitHub
<#832 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX7GGUFNU2NIKXCGJALWBMGOPANCNFSM6AAAAAAQ2BTL2U>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one minor comment. Don't worry about failing tests, it'll either sort itself out (pypi/packaging flake for some third party dependency) or I will look into it later.
setup.py
Outdated
@@ -141,6 +141,7 @@ def mark_lhotse_version(version: str) -> None: | |||
|
|||
|
|||
install_requires = [ | |||
"importlib-metadata<5.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these lines don't help, let's remove them
setup.py
Outdated
@@ -180,6 +181,7 @@ def mark_lhotse_version(version: str) -> None: | |||
|
|||
docs_require = (project_root / "docs" / "requirements.txt").read_text().splitlines() | |||
tests_require = [ | |||
"importlib-metadata<5.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these lines don't help, let's remove them
will remove. Thanks!
y.
…On Mon, Oct 3, 2022 at 9:57 PM Piotr Żelasko ***@***.***> wrote:
***@***.**** commented on this pull request.
LGTM, just one minor comment. Don't worry about failing tests, it'll
either sort itself out (pypi/packaging flake for some third party
dependency) or I will look into it later.
------------------------------
In setup.py
<#832 (comment)>:
> @@ -141,6 +141,7 @@ def mark_lhotse_version(version: str) -> None:
install_requires = [
+ "importlib-metadata<5.0.0",
if these lines don't help, let's remove them
------------------------------
In setup.py
<#832 (comment)>:
> @@ -180,6 +181,7 @@ def mark_lhotse_version(version: str) -> None:
docs_require = (project_root / "docs" / "requirements.txt").read_text().splitlines()
tests_require = [
+ "importlib-metadata<5.0.0",
if these lines don't help, let's remove them
—
Reply to this email directly, view it on GitHub
<#832 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX3AXLHL2S2XVKHWOFDWBOFHXANCNFSM6AAAAAAQ2BTL2U>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I added one more test, but IMO I'm done now :) |
ah, sorry, fixing this
…On Fri, Sep 30, 2022 at 9:40 PM Piotr Żelasko ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lhotse/bin/modes/kaldi.py
<#832 (comment)>:
> @@ -42,7 +42,7 @@ def kaldi():
@click.option(
"-d",
"--use-reco2dur",
I think with default=True and is_flag=True it is impossible to turn this
off. The idiomatic click way would be @click.option("--use-reco2dur/--compute-durations",
default=True, help="...") and then name the Python function argument use_reco2dur:
bool (the same as the first option before slash, i.e. the same as right
now).
—
Reply to this email directly, view it on GitHub
<#832 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX3IC3IQIIXSPP75H2LWA6JBZANCNFSM6AAAAAAQ2BTL2U>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
…instead of touching the audio file
should resolve #788
also, let me know if -d is the option that feels right for it, to me it does not exactly do, but didn't feel another is better