Reset decoder states on resync #31

drowe67 · 2024-10-16T22:09:58Z

During alpha testing of freedv-gui + RADE Mooneer and Walter reporting a howling sound from decoder that could be reset by re-initialing RADE. This may be due to the decoder being kicked into bad states where it gets stuck. As a precaution, this PR resets the decoder states on re-sync.

TODO

Any other states in Rx we should reset, e.g. classical DSP code in dsp.py? Does anything go crazy when it gets zeros fed into it?
Can we reproduce the problem from the command line?
It's possible a set of inputs not seen in training could push the network into an undefined state, either on the tx (a certain speaker) or rx (certain channel noise) side. We should be able to trap that with an example.
Can we reset the FARGAN decoder states (external C library)?
(tmiw) Fix gap-in-rx-audio dropped sample bug in freedv-gui

drowe67 · 2024-10-16T22:12:24Z

@tmiw ☝️

tmiw · 2024-10-17T14:04:13Z

This looks like it partially helps based on tests using the recording I sent over. I did notice that it could take a bit before it goes fully out of sync, though. I did try this change on my local copy to reduce the duration of the sound further, but I don't know if there are any negative side effects:

              if candidate:
                 self.valid_count = self.Nmf_unsync
              else:
                 self.valid_count -= 1
+                model.core_decoder_statefull.module.reset()
                 if unsync_enable and self.valid_count == 0:
                    next_state = "search"

Re: FARGAN reset, I tried calling fargan_init() and fargan_cont() again when sync goes from 0 to != 0 but that didn't seem to make any difference.

As for freedv-gui, I really do suspect it's related to doing tests on Wi-Fi rather than Ethernet (my Flex 6300 is connected via TCP/IP). I'm going to listen for a bit on the air this morning and see if I can get another RADE recording to look at.

drowe67 · 2024-10-17T19:51:50Z

As for freedv-gui, I really do suspect it's related to doing tests on Wi-Fi rather than Ethernet (my Flex 6300 is connected via TCP/IP). I'm going to listen for a bit on the air this morning and see if I can get another RADE recording to look at.

Yep - Wifi sounds like a bad idea. As per email, I'd avoid OTA signals.

We really need a way to detect this problem automatically and give a go/no-go result, for example so we can test that Ethernet is working properly. Manual listening is tedious and won't pick up short gaps (a few 10's of samples) that will kill the link but we can't hear. A way for end users to test/pick up this issue would be useful too, e.g. documentation, instructions on a simple listening test.

I feel we need nail the dropout issue first, then we can return to the howling issues with a known good signal. Another possibility is something on the tx side getting into a weird state. Once again - if we can reproduce the issue with a clean, dropout free signal, we will have an easier time tracking it down. For example we might see NANs from the decoder, or a repeating sequence of output features, or the modem re-syncing for no reason.

This looks like it partially helps based on tests using the recording I sent over. I did notice that it could take a bit before it goes fully out of sync, though.

There's a timer that counts errors over a few seconds and resets the sync state machine. You can see the error count on the rx logs. The trade off is we need to ride through fades without a re-sync - they will also drop out the channel for a few seconds.

tmiw · 2024-10-17T21:40:19Z

If it helps, I'm able to duplicate the noise issue with the following:

Creating transmit audio file

Open Audacity and create a 16 kHz mono WAV file (I just took my existing voice keyer file and duplicated the audio until I reached like 4-5 minutes or so in length).
Go to File->New and set the record audio device to some sort of loopback device (i.e. the one created by sudo modprobe snd-aloop on Linux). Save and close the file created in (1) to avoid confusion.
In FreeDV:
a. Adjust the audio settings so that the output radio device is that loopback device. Disable all CAT control.
b. Right-click on the Voice Keyer button and select "Use another voice keyer file..."
c. Select the file created in (1) and click Open.
Click the Record button in the empty document in Audacity and then push the Voice Keyer button in FreeDV.
When the voice keyer finishes one transmit cycle, push the Voice Keyer button to stop it, then push Stop in Audacity.
Review the recorded file in Audacity to verify that there's no gaps in the transmit audio.

Add gaps to audio

In Audacity, press Ctrl-A to select all audio and then go to Tools->Regular Interval Labels (Note: you may need to install this plugin using Audacity's plugin manager first.)
Use the following settings:
- Create labels based on: Label Interval
- Label interval (seconds): 10.0
- Length of label region (seconds): 0.25 (This can be adjusted to test various scenarios. For example, RADE/FreeDV still seemed to behave okay at 0.1 seconds.)
Click Apply.
Press Ctrl-A again and go to Edit->Labeled Audio->Silence Audio.

Test decode

In Audacity, change the playback device to point to the loopback audio device provided above.
In FreeDV, change the input radio device is the loopback audio device. Choose RADE and press Start.
Go back to Audacity and press Play. Listen for any artifacts (i.e. the howling noise).

Unfortunately it's not exactly automated but it's at least repeatable. In theory one could save the resulting audio from "Add gaps to audio" and use that in a RADE ctest or something. There might also be a way to use sox or something to automatically add the dropouts, too.

Re: dropouts, I didn't see any obvious ones when I followed "Creating transmit audio file" above, either at light load (~90% idle) or after starting something like 14 yes >/dev/null processes (~0% idle).

BTW we might not even need to go that far to duplicate the howling noise. Simply stopping playback in Audacity (causing FreeDV to receive silence) triggered it, I think without even needing to add gaps in the audio first.

tmiw · 2024-10-17T21:46:11Z

We really need a way to detect this problem automatically and give a go/no-go result, for example so we can test that Ethernet is working properly. Manual listening is tedious and won't pick up short gaps (a few 10's of samples) that will kill the link but we can't hear. A way for end users to test/pick up this issue would be useful too, e.g. documentation, instructions on a simple listening test.

I'm not fully sure how the use of Ethernet/some sort of reliable datalink could be reliably detected without introducing a lot of OS-specific dependencies (and potentially special radio-specific logic). Given that most users use USB connected radios, I suspect this can be deferred.

tmiw · 2024-10-18T03:25:08Z

Figured out a possible way to duplicate the dropouts in a RADE ctest:

(radae-venv) MooneerMBP16158:radae mooneer$ ./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null                        --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset;                        cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth --txbpf | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0.25@5 > rx.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Processing: 972 feature vectors
          Eb/No   C/No     SNR3k  Rb'    Eq     PAPR
Target..: 100.00  133.01   98.24  3000
Measured:  97.45  132.22   97.45                 0.79
loss: 0.128 BER: 0.000
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
sox WARN sox: `-' output clipped 3 samples; decrease volume?
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_txs_out.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.16 Dtmax12:     5.17     0.00 tmax:  324 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     4.98 Dtmax12:     8.25     0.00 tmax:  326 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     5.74 Dtmax12:    10.27     0.00 tmax:  328 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     5.79 Dtmax12:    10.26     0.00 tmax:  328 fmax:   0.00
 48 state: search     valid: 1 0  0 Dthresh:     5.50 Dtmax12:    10.20     1.89 tmax:  368 fmax:   0.00
 49 state: candidate  valid: 1 0  1 Dthresh:     5.51 Dtmax12:    10.26     1.89 tmax:  368 fmax:   0.00
 50 state: candidate  valid: 1 0  2 Dthresh:     5.64 Dtmax12:    10.29     1.89 tmax:  368 fmax:   0.00
 51 state: candidate  valid: 1 0  3 Dthresh:     5.70 Dtmax12:    10.26     1.89 tmax:  368 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ python3 loss.py features_in.f32 features_txs_out.f32 --loss_test 0.15 --acq_time_test 0.5 --clip_start 5
Loss between features_in.f32 and features_txs_out.f32
  loss: 2.546 start: 77 acq_time:  0.77 s
FAIL
(radae-venv) MooneerMBP16158:radae mooneer$

I'm not sure what loss should be, though, but if I use 0.1 instead of 0.25 in the sox call:

(radae-venv) MooneerMBP16158:radae mooneer$ ./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null                        --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset;                        cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth --txbpf | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0.1@5 > rx.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Processing: 972 feature vectors
          Eb/No   C/No     SNR3k  Rb'    Eq     PAPR
Target..: 100.00  133.01   98.24  3000
Measured:  97.45  132.22   97.45                 0.79
loss: 0.128 BER: 0.000
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
sox WARN sox: `-' output clipped 3 samples; decrease volume?
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_txs_out.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.16 Dtmax12:     5.17     0.00 tmax:  324 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     4.98 Dtmax12:     8.25     0.00 tmax:  326 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     5.74 Dtmax12:    10.27     0.00 tmax:  328 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     5.79 Dtmax12:    10.26     0.00 tmax:  328 fmax:   0.00
 29 state: search     valid: 1 0 19 Dthresh:     5.57 Dtmax12:    10.22     2.03 tmax:  728 fmax:   0.00
 30 state: candidate  valid: 1 0  1 Dthresh:     5.62 Dtmax12:    10.22     2.03 tmax:  728 fmax:   0.00
 31 state: candidate  valid: 1 0  2 Dthresh:     5.62 Dtmax12:    10.30     2.03 tmax:  728 fmax:   0.00
 32 state: candidate  valid: 1 0  3 Dthresh:     5.59 Dtmax12:    10.30     2.03 tmax:  728 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ python3 loss.py features_in.f32 features_txs_out.f32 --loss_test 0.15 --acq_time_test 0.5 --clip_start 5
Loss between features_in.f32 and features_txs_out.f32
  loss: 1.105 start: 89 acq_time:  0.89 s
FAIL
(radae-venv) MooneerMBP16158:radae mooneer$

and 0:

(radae-venv) MooneerMBP16158:radae mooneer$ ./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null                        --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset;                        cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth --txbpf | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0@5 > rx.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Processing: 972 feature vectors
          Eb/No   C/No     SNR3k  Rb'    Eq     PAPR
Target..: 100.00  133.01   98.24  3000
Measured:  97.45  132.22   97.45                 0.79
loss: 0.128 BER: 0.000
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
sox WARN sox: `-' output clipped 3 samples; decrease volume?
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_txs_out.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.16 Dtmax12:     5.17     0.00 tmax:  324 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     4.98 Dtmax12:     8.25     0.00 tmax:  326 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     5.74 Dtmax12:    10.27     0.00 tmax:  328 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     5.79 Dtmax12:    10.26     0.00 tmax:  328 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ python3 loss.py features_in.f32 features_txs_out.f32 --loss_test 0.15 --acq_time_test 0.5 --clip_start 5
Loss between features_in.f32 and features_txs_out.f32
  loss: 0.130 start: 41 acq_time:  0.41 s
PASS
(radae-venv) MooneerMBP16158:radae mooneer$

drowe67 · 2024-10-24T22:15:01Z

Hi @tmiw - thanks for working on those demos. I can't seem to reproduce the "OP" howling bug. There are some transient issues when a gap is introduced as:

the decoder won't know sync is lost for a few seconds, so you'll get R2D2
it will then have to re-sync

I wondering if the number of gaps in the OP sample was so large that it caused continual issues. This does highlights the need to make sure the audio stream (on tx and rx) is gap free - gaps are death to any modern mode like 700X or RADE.

Do you have a way to repdroduce the howling issue using the command line tools? Any further examples from off air recordings?

I tried your example above:

./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset
cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0.25@5 > rx_gap.f32
cat rx_gap.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_out_gap.f32

Then checked out the loss, and listened with:

python3 loss.py features_in.f32 features_out.f32 --features_hat2 features_out_gap.f32 --plot
./build/src/lpcnet_demo -fargan-synthesis features_out_gap.f32 - | aplay -f S16_LE -r 16000

It sounds OK, except for some transients due to the gap and re-sync. Key point it recovers - my understanding of the bug was that it lead to some sort of long term instability?

The loss from the features_out_gap.f32 was indeed high, but that's because the gap breaks the time alignment the tool depends on. So if you have gaps in the audio, loss.py breaks and can't be used.

tmiw · 2024-10-26T02:00:54Z

Do you have a way to repdroduce the howling issue using the command line tools? Any further examples from off air recordings?

I was trying to figure out how I duplicated it before and it looks like I mistyped the number of channels in the sox call. It should be 2, not 1, since radae_txe.py seems to return both real and complex components. The commands I used:

(radae-venv) MooneerMBP16158:radae mooneer$ cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth | sox -t raw -e floating-point -b 32 -c 2 -r 8000 - -t raw -e floating-point -b 32 -c 2 -r 8000 - pad 0.25@5 > rx_gap.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx_gap.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_out_gap.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.44 Dtmax12:     5.32     0.00 tmax:  274 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     5.44 Dtmax12:    10.10     0.00 tmax:  275 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     6.10 Dtmax12:    10.56     0.00 tmax:  276 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     6.12 Dtmax12:    10.55     0.00 tmax:  276 fmax:   0.00
 69 state: search     valid: 1 0  0 Dthresh:     5.94 Dtmax12:    10.55     2.20 tmax:  356 fmax:   0.00
 70 state: candidate  valid: 1 0  1 Dthresh:     5.88 Dtmax12:    10.52     2.20 tmax:  356 fmax:   0.00
 71 state: candidate  valid: 1 0  2 Dthresh:     5.84 Dtmax12:    10.64     2.20 tmax:  356 fmax:   0.00
 72 state: candidate  valid: 1 0  3 Dthresh:     5.86 Dtmax12:    10.66     2.20 tmax:  356 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ ./build/src/lpcnet_demo -fargan-synthesis features_out_gap.f32 - > rx.raw

I then did a raw file import into Audacity and got the following. That full scale audio segment matches the behavior I've been seeing in freedv-gui:

Hope this helps!

drowe67 · 2024-10-27T19:55:35Z

Thanks @tmiw - I can reproduce the issue here. It's def the RADE decoder, here is a mesh plot of the features_out_gap vectors, it's just zero-ed out for the duration of the noise on the synthesized speech output. So the RADE decoder must be getting stuck in a weird state.

I'll dig in some more.

drowe67 · 2024-10-27T20:48:09Z

@tmiw - 4438249 has fixed the problem for me

tmiw · 2024-10-28T07:35:25Z

@tmiw - 4438249 has fixed the problem for me

I built freedv-gui with that commit and it looks like the issue is fixed with the recorded OTA samples that have had the issue before. 👍

reset decoder states on resync

a50d25b

This was referenced Oct 19, 2024

freedv_2_0_0 sample drops outs drowe67/freedv-gui#750

Closed

freedv_2_0_0 RADE howling on silence input drowe67/freedv-gui#751

Closed

prevent /0 errors that were causing instability with gaps in audio

4438249

drowe67 merged commit 516f4e4 into main Oct 29, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset decoder states on resync #31

Reset decoder states on resync #31

drowe67 commented Oct 16, 2024 •

edited

Loading

drowe67 commented Oct 16, 2024

tmiw commented Oct 17, 2024 •

edited

Loading

drowe67 commented Oct 17, 2024

tmiw commented Oct 17, 2024

tmiw commented Oct 17, 2024

tmiw commented Oct 18, 2024

drowe67 commented Oct 24, 2024

tmiw commented Oct 26, 2024

drowe67 commented Oct 27, 2024 •

edited

Loading

drowe67 commented Oct 27, 2024

tmiw commented Oct 28, 2024

Reset decoder states on resync #31

Reset decoder states on resync #31

Conversation

drowe67 commented Oct 16, 2024 • edited Loading

drowe67 commented Oct 16, 2024

tmiw commented Oct 17, 2024 • edited Loading

drowe67 commented Oct 17, 2024

tmiw commented Oct 17, 2024

tmiw commented Oct 17, 2024

tmiw commented Oct 18, 2024

drowe67 commented Oct 24, 2024

tmiw commented Oct 26, 2024

drowe67 commented Oct 27, 2024 • edited Loading

drowe67 commented Oct 27, 2024

tmiw commented Oct 28, 2024

drowe67 commented Oct 16, 2024 •

edited

Loading

tmiw commented Oct 17, 2024 •

edited

Loading

drowe67 commented Oct 27, 2024 •

edited

Loading