Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset decoder states on resync #31

Merged
merged 2 commits into from
Oct 29, 2024
Merged

Reset decoder states on resync #31

merged 2 commits into from
Oct 29, 2024

Conversation

drowe67
Copy link
Owner

@drowe67 drowe67 commented Oct 16, 2024

During alpha testing of freedv-gui + RADE Mooneer and Walter reporting a howling sound from decoder that could be reset by re-initialing RADE. This may be due to the decoder being kicked into bad states where it gets stuck. As a precaution, this PR resets the decoder states on re-sync.

TODO

  • Any other states in Rx we should reset, e.g. classical DSP code in dsp.py? Does anything go crazy when it gets zeros fed into it?
  • Can we reproduce the problem from the command line?
  • It's possible a set of inputs not seen in training could push the network into an undefined state, either on the tx (a certain speaker) or rx (certain channel noise) side. We should be able to trap that with an example.
  • Can we reset the FARGAN decoder states (external C library)?
  • (tmiw) Fix gap-in-rx-audio dropped sample bug in freedv-gui

@drowe67
Copy link
Owner Author

drowe67 commented Oct 16, 2024

@tmiw ☝️

@tmiw
Copy link
Collaborator

tmiw commented Oct 17, 2024

This looks like it partially helps based on tests using the recording I sent over. I did notice that it could take a bit before it goes fully out of sync, though. I did try this change on my local copy to reduce the duration of the sound further, but I don't know if there are any negative side effects:

              if candidate:
                 self.valid_count = self.Nmf_unsync
              else:
                 self.valid_count -= 1
+                model.core_decoder_statefull.module.reset()
                 if unsync_enable and self.valid_count == 0:
                    next_state = "search"

Re: FARGAN reset, I tried calling fargan_init() and fargan_cont() again when sync goes from 0 to != 0 but that didn't seem to make any difference.

As for freedv-gui, I really do suspect it's related to doing tests on Wi-Fi rather than Ethernet (my Flex 6300 is connected via TCP/IP). I'm going to listen for a bit on the air this morning and see if I can get another RADE recording to look at.

@drowe67
Copy link
Owner Author

drowe67 commented Oct 17, 2024

As for freedv-gui, I really do suspect it's related to doing tests on Wi-Fi rather than Ethernet (my Flex 6300 is connected via TCP/IP). I'm going to listen for a bit on the air this morning and see if I can get another RADE recording to look at.

Yep - Wifi sounds like a bad idea. As per email, I'd avoid OTA signals.

We really need a way to detect this problem automatically and give a go/no-go result, for example so we can test that Ethernet is working properly. Manual listening is tedious and won't pick up short gaps (a few 10's of samples) that will kill the link but we can't hear. A way for end users to test/pick up this issue would be useful too, e.g. documentation, instructions on a simple listening test.

I feel we need nail the dropout issue first, then we can return to the howling issues with a known good signal. Another possibility is something on the tx side getting into a weird state. Once again - if we can reproduce the issue with a clean, dropout free signal, we will have an easier time tracking it down. For example we might see NANs from the decoder, or a repeating sequence of output features, or the modem re-syncing for no reason.

This looks like it partially helps based on tests using the recording I sent over. I did notice that it could take a bit before it goes fully out of sync, though.

There's a timer that counts errors over a few seconds and resets the sync state machine. You can see the error count on the rx logs. The trade off is we need to ride through fades without a re-sync - they will also drop out the channel for a few seconds.

@tmiw
Copy link
Collaborator

tmiw commented Oct 17, 2024

If it helps, I'm able to duplicate the noise issue with the following:

Creating transmit audio file

  1. Open Audacity and create a 16 kHz mono WAV file (I just took my existing voice keyer file and duplicated the audio until I reached like 4-5 minutes or so in length).
  2. Go to File->New and set the record audio device to some sort of loopback device (i.e. the one created by sudo modprobe snd-aloop on Linux). Save and close the file created in (1) to avoid confusion.
  3. In FreeDV:
    a. Adjust the audio settings so that the output radio device is that loopback device. Disable all CAT control.
    b. Right-click on the Voice Keyer button and select "Use another voice keyer file..."
    c. Select the file created in (1) and click Open.
  4. Click the Record button in the empty document in Audacity and then push the Voice Keyer button in FreeDV.
  5. When the voice keyer finishes one transmit cycle, push the Voice Keyer button to stop it, then push Stop in Audacity.
  6. Review the recorded file in Audacity to verify that there's no gaps in the transmit audio.

Add gaps to audio

  1. In Audacity, press Ctrl-A to select all audio and then go to Tools->Regular Interval Labels (Note: you may need to install this plugin using Audacity's plugin manager first.)
  2. Use the following settings:
    • Create labels based on: Label Interval
    • Label interval (seconds): 10.0
    • Length of label region (seconds): 0.25 (This can be adjusted to test various scenarios. For example, RADE/FreeDV still seemed to behave okay at 0.1 seconds.)
  3. Click Apply.
  4. Press Ctrl-A again and go to Edit->Labeled Audio->Silence Audio.

Test decode

  1. In Audacity, change the playback device to point to the loopback audio device provided above.
  2. In FreeDV, change the input radio device is the loopback audio device. Choose RADE and press Start.
  3. Go back to Audacity and press Play. Listen for any artifacts (i.e. the howling noise).

Unfortunately it's not exactly automated but it's at least repeatable. In theory one could save the resulting audio from "Add gaps to audio" and use that in a RADE ctest or something. There might also be a way to use sox or something to automatically add the dropouts, too.

Re: dropouts, I didn't see any obvious ones when I followed "Creating transmit audio file" above, either at light load (~90% idle) or after starting something like 14 yes >/dev/null processes (~0% idle).

BTW we might not even need to go that far to duplicate the howling noise. Simply stopping playback in Audacity (causing FreeDV to receive silence) triggered it, I think without even needing to add gaps in the audio first.

@tmiw
Copy link
Collaborator

tmiw commented Oct 17, 2024

We really need a way to detect this problem automatically and give a go/no-go result, for example so we can test that Ethernet is working properly. Manual listening is tedious and won't pick up short gaps (a few 10's of samples) that will kill the link but we can't hear. A way for end users to test/pick up this issue would be useful too, e.g. documentation, instructions on a simple listening test.

I'm not fully sure how the use of Ethernet/some sort of reliable datalink could be reliably detected without introducing a lot of OS-specific dependencies (and potentially special radio-specific logic). Given that most users use USB connected radios, I suspect this can be deferred.

@tmiw
Copy link
Collaborator

tmiw commented Oct 18, 2024

Figured out a possible way to duplicate the dropouts in a RADE ctest:

(radae-venv) MooneerMBP16158:radae mooneer$ ./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null                        --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset;                        cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth --txbpf | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0.25@5 > rx.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Processing: 972 feature vectors
          Eb/No   C/No     SNR3k  Rb'    Eq     PAPR
Target..: 100.00  133.01   98.24  3000
Measured:  97.45  132.22   97.45                 0.79
loss: 0.128 BER: 0.000
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
sox WARN sox: `-' output clipped 3 samples; decrease volume?
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_txs_out.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.16 Dtmax12:     5.17     0.00 tmax:  324 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     4.98 Dtmax12:     8.25     0.00 tmax:  326 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     5.74 Dtmax12:    10.27     0.00 tmax:  328 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     5.79 Dtmax12:    10.26     0.00 tmax:  328 fmax:   0.00
 48 state: search     valid: 1 0  0 Dthresh:     5.50 Dtmax12:    10.20     1.89 tmax:  368 fmax:   0.00
 49 state: candidate  valid: 1 0  1 Dthresh:     5.51 Dtmax12:    10.26     1.89 tmax:  368 fmax:   0.00
 50 state: candidate  valid: 1 0  2 Dthresh:     5.64 Dtmax12:    10.29     1.89 tmax:  368 fmax:   0.00
 51 state: candidate  valid: 1 0  3 Dthresh:     5.70 Dtmax12:    10.26     1.89 tmax:  368 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ python3 loss.py features_in.f32 features_txs_out.f32 --loss_test 0.15 --acq_time_test 0.5 --clip_start 5
Loss between features_in.f32 and features_txs_out.f32
  loss: 2.546 start: 77 acq_time:  0.77 s
FAIL
(radae-venv) MooneerMBP16158:radae mooneer$

I'm not sure what loss should be, though, but if I use 0.1 instead of 0.25 in the sox call:

(radae-venv) MooneerMBP16158:radae mooneer$ ./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null                        --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset;                        cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth --txbpf | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0.1@5 > rx.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Processing: 972 feature vectors
          Eb/No   C/No     SNR3k  Rb'    Eq     PAPR
Target..: 100.00  133.01   98.24  3000
Measured:  97.45  132.22   97.45                 0.79
loss: 0.128 BER: 0.000
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
sox WARN sox: `-' output clipped 3 samples; decrease volume?
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_txs_out.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.16 Dtmax12:     5.17     0.00 tmax:  324 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     4.98 Dtmax12:     8.25     0.00 tmax:  326 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     5.74 Dtmax12:    10.27     0.00 tmax:  328 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     5.79 Dtmax12:    10.26     0.00 tmax:  328 fmax:   0.00
 29 state: search     valid: 1 0 19 Dthresh:     5.57 Dtmax12:    10.22     2.03 tmax:  728 fmax:   0.00
 30 state: candidate  valid: 1 0  1 Dthresh:     5.62 Dtmax12:    10.22     2.03 tmax:  728 fmax:   0.00
 31 state: candidate  valid: 1 0  2 Dthresh:     5.62 Dtmax12:    10.30     2.03 tmax:  728 fmax:   0.00
 32 state: candidate  valid: 1 0  3 Dthresh:     5.59 Dtmax12:    10.30     2.03 tmax:  728 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ python3 loss.py features_in.f32 features_txs_out.f32 --loss_test 0.15 --acq_time_test 0.5 --clip_start 5
Loss between features_in.f32 and features_txs_out.f32
  loss: 1.105 start: 89 acq_time:  0.89 s
FAIL
(radae-venv) MooneerMBP16158:radae mooneer$ 

and 0:

(radae-venv) MooneerMBP16158:radae mooneer$ ./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null                        --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset;                        cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth --txbpf | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0@5 > rx.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Processing: 972 feature vectors
          Eb/No   C/No     SNR3k  Rb'    Eq     PAPR
Target..: 100.00  133.01   98.24  3000
Measured:  97.45  132.22   97.45                 0.79
loss: 0.128 BER: 0.000
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
sox WARN sox: `-' output clipped 3 samples; decrease volume?
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_txs_out.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.16 Dtmax12:     5.17     0.00 tmax:  324 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     4.98 Dtmax12:     8.25     0.00 tmax:  326 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     5.74 Dtmax12:    10.27     0.00 tmax:  328 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     5.79 Dtmax12:    10.26     0.00 tmax:  328 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ python3 loss.py features_in.f32 features_txs_out.f32 --loss_test 0.15 --acq_time_test 0.5 --clip_start 5
Loss between features_in.f32 and features_txs_out.f32
  loss: 0.130 start: 41 acq_time:  0.41 s
PASS
(radae-venv) MooneerMBP16158:radae mooneer$

@drowe67
Copy link
Owner Author

drowe67 commented Oct 24, 2024

Hi @tmiw - thanks for working on those demos. I can't seem to reproduce the "OP" howling bug. There are some transient issues when a gap is introduced as:

  • the decoder won't know sync is lost for a few seconds, so you'll get R2D2
  • it will then have to re-sync

I wondering if the number of gaps in the OP sample was so large that it caused continual issues. This does highlights the need to make sure the audio stream (on tx and rx) is gap free - gaps are death to any modern mode like 700X or RADE.

Do you have a way to repdroduce the howling issue using the command line tools? Any further examples from off air recordings?

I tried your example above:

./inference.sh model19_check3/checkpoints/checkpoint_epoch_100.pth wav/brian_g8sez.wav /dev/null --rate_Fs --pilots --pilot_eq --eq_ls --cp 0.004 --bottleneck 3 --auxdata --write_rx rx.f32 --correct_freq_offset
cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth | sox -t raw -e floating-point -b 32 -c 1 -r 8000 - -t raw -e floating-point -b 32 -c 1 -r 8000 - pad 0.25@5 > rx_gap.f32
cat rx_gap.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_out_gap.f32

Then checked out the loss, and listened with:

python3 loss.py features_in.f32 features_out.f32 --features_hat2 features_out_gap.f32 --plot
./build/src/lpcnet_demo -fargan-synthesis features_out_gap.f32 - | aplay -f S16_LE -r 16000

It sounds OK, except for some transients due to the gap and re-sync. Key point it recovers - my understanding of the bug was that it lead to some sort of long term instability?

The loss from the features_out_gap.f32 was indeed high, but that's because the gap breaks the time alignment the tool depends on. So if you have gaps in the audio, loss.py breaks and can't be used.

@tmiw
Copy link
Collaborator

tmiw commented Oct 26, 2024

Do you have a way to repdroduce the howling issue using the command line tools? Any further examples from off air recordings?

I was trying to figure out how I duplicated it before and it looks like I mistyped the number of channels in the sox call. It should be 2, not 1, since radae_txe.py seems to return both real and complex components. The commands I used:

(radae-venv) MooneerMBP16158:radae mooneer$ cat features_in.f32 | python3 radae_txe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth | sox -t raw -e floating-point -b 32 -c 2 -r 8000 - -t raw -e floating-point -b 32 -c 2 -r 8000 - pad 0.25@5 > rx_gap.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
(radae-venv) MooneerMBP16158:radae mooneer$ cat rx_gap.f32 | python3 radae_rxe.py --model model19_check3/checkpoints/checkpoint_epoch_100.pth -v 1 > features_out_gap.f32
encoder: 937200 weights
decoder: 907764 weights
encoder: 937200 weights
decoder: 907764 weights
Rs: 33.33 Rs': 50.00 Ts': 0.020 Nsmf: 120 Ns:   4 Nc:  30 M: 160 Ncp: 32
Input BPF bandwidth: 1740.000162 centre: 1474.999994
  1 state: search     valid: 1 0  0 Dthresh:     2.44 Dtmax12:     5.32     0.00 tmax:  274 fmax:   0.00
  2 state: candidate  valid: 1 0  1 Dthresh:     5.44 Dtmax12:    10.10     0.00 tmax:  275 fmax:   0.00
  3 state: candidate  valid: 1 0  2 Dthresh:     6.10 Dtmax12:    10.56     0.00 tmax:  276 fmax:   0.00
  4 state: candidate  valid: 1 0  3 Dthresh:     6.12 Dtmax12:    10.55     0.00 tmax:  276 fmax:   0.00
 69 state: search     valid: 1 0  0 Dthresh:     5.94 Dtmax12:    10.55     2.20 tmax:  356 fmax:   0.00
 70 state: candidate  valid: 1 0  1 Dthresh:     5.88 Dtmax12:    10.52     2.20 tmax:  356 fmax:   0.00
 71 state: candidate  valid: 1 0  2 Dthresh:     5.84 Dtmax12:    10.64     2.20 tmax:  356 fmax:   0.00
 72 state: candidate  valid: 1 0  3 Dthresh:     5.86 Dtmax12:    10.66     2.20 tmax:  356 fmax:   0.00
(radae-venv) MooneerMBP16158:radae mooneer$ ./build/src/lpcnet_demo -fargan-synthesis features_out_gap.f32 - > rx.raw

I then did a raw file import into Audacity and got the following. That full scale audio segment matches the behavior I've been seeing in freedv-gui:

image

Hope this helps!

@drowe67
Copy link
Owner Author

drowe67 commented Oct 27, 2024

Thanks @tmiw - I can reproduce the issue here. It's def the RADE decoder, here is a mesh plot of the features_out_gap vectors, it's just zero-ed out for the duration of the noise on the synthesized speech output. So the RADE decoder must be getting stuck in a weird state.

Screenshot from 2024-10-28 06-23-26

I'll dig in some more.

@drowe67
Copy link
Owner Author

drowe67 commented Oct 27, 2024

@tmiw - 4438249 has fixed the problem for me

@tmiw
Copy link
Collaborator

tmiw commented Oct 28, 2024

@tmiw - 4438249 has fixed the problem for me

I built freedv-gui with that commit and it looks like the issue is fixed with the recorded OTA samples that have had the issue before. 👍

@drowe67 drowe67 merged commit 516f4e4 into main Oct 29, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants