Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM user frame processor with tests #703

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

aconchillo
Copy link
Contributor

@aconchillo aconchillo commented Nov 9, 2024

This PR is based on #450 and improves VAD and user transcription ordering issues fixing the LLM context aggregators.

@skandere
Copy link

Hey @aconchillo how is it going? any updates on when you think this PR will be merged?

@skandere
Copy link

Hello team! @aconchillo @marcus-daily , any ETA on when you think this PR could be merged? I've been using a forked repo with this fix, works super well. Would love to go back to the main repo!

@aconchillo
Copy link
Contributor Author

There are a couple of use cases (the ones were we interrupt) that I'm not sure about and I have never ran into actually. One of the cases is:

S E T1 T2

In this case we get 2 transcriptions after UserStoppedSpeaking. In this PR implementation (which is still the original one from @EdgarMMProsper) when we get T2 we interrupt the bot. It's actually weird, because if T1 was a final transcription we shouldn't really get another final transcription without having another UserStartSpeaking-UserStoppedSpeaking, but I guess it can happen? Interrupting the bot from the context aggregator is something I'd like to avoid since I don't think it's the right place . In general we don't want to send random interruptions from processors to fix some other issue (i.e. separation of concerns for each processor as much as possible).

@EdgarMMProsper
Copy link

There are a couple of use cases (the ones were we interrupt) that I'm not sure about and I have never ran into actually. One of the cases is:

S E T1 T2

In this case we get 2 transcriptions after UserStoppedSpeaking. In this PR implementation (which is still the original one from @EdgarMMProsper) when we get T2 we interrupt the bot. It's actually weird, because if T1 was a final transcription we shouldn't really get another final transcription without having another UserStartSpeaking-UserStoppedSpeaking, but I guess it can happen? Interrupting the bot from the context aggregator is something I'd like to avoid since I don't think it's the right place . In general we don't want to send random interruptions from processors to fix some other issue (i.e. separation of concerns for each processor as much as possible).

In our experience this does happen quite a lot. If you do not want to send interruptions from this processors (I agree it is not very clean), you can implement a similar logic in the input processor directly.

@aconchillo
Copy link
Contributor Author

There are a couple of use cases (the ones were we interrupt) that I'm not sure about and I have never ran into actually. One of the cases is:

S E T1 T2

In this case we get 2 transcriptions after UserStoppedSpeaking. In this PR implementation (which is still the original one from @EdgarMMProsper) when we get T2 we interrupt the bot. It's actually weird, because if T1 was a final transcription we shouldn't really get another final transcription without having another UserStartSpeaking-UserStoppedSpeaking, but I guess it can happen? Interrupting the bot from the context aggregator is something I'd like to avoid since I don't think it's the right place . In general we don't want to send random interruptions from processors to fix some other issue (i.e. separation of concerns for each processor as much as possible).

In our experience this does happen quite a lot. If you do not want to send interruptions from this processors (I agree it is not very clean), you can implement a similar logic in the input processor directly.

Interesting. Well, I guess we can give this a try 😃 . I have some local changes (e.g. pushing BotInterruptionFrame upstream instead) and some unit tests fixes because many internal things have changed since this PR was originally written. I should be able to finish it today/tomorrow.

@aconchillo
Copy link
Contributor Author

@EdgarMMProsper @skandere In the case discussed above:

S E T1 T2

what is an example of the contents of T1 and T2. Why do we need to interrupt instead of simply queuing T2?

@aconchillo
Copy link
Contributor Author

@EdgarMMProsper @skandere In the case discussed above:

S E T1 T2

what is an example of the contents of T1 and T2. Why do we need to interrupt instead of simply queuing T2?

I was actually reading the original commit and sent_aggregation_after_last_interruption is actually never sent to True which means interruptions will never happen. So, I guess my question above is still valid.

@EdgarMMProsper
Copy link

@EdgarMMProsper @skandere In the case discussed above:

S E T1 T2

what is an example of the contents of T1 and T2. Why do we need to interrupt instead of simply queuing T2?

T1: start of the sentence. T2: end of the sentence. Both things said between S and E but received a bit late because of STT latency. If you do not interrupt the bot will generate two very similar outputs "repeating" itself.

@aconchillo
Copy link
Contributor Author

@EdgarMMProsper @skandere In the case discussed above:

S E T1 T2

what is an example of the contents of T1 and T2. Why do we need to interrupt instead of simply queuing T2?

T1: start of the sentence. T2: end of the sentence. Both things said between S and E but received a bit late because of STT latency. If you do not interrupt the bot will generate two very similar outputs "repeating" itself.

It seems the foundational examples series 22 would solve this issue without needing to interrupt. I was assuming T1 and T2 would be two different sentences in the which case the interruptions didn’t make sense. But if it’s the same sentence then those examples might help better without needing to interrupt. What if T1 has more context than T2?

@aconchillo
Copy link
Contributor Author

aconchillo commented Dec 22, 2024

@EdgarMMProsper @skandere In the case discussed above:

S E T1 T2

what is an example of the contents of T1 and T2. Why do we need to interrupt instead of simply queuing T2?

T1: start of the sentence. T2: end of the sentence. Both things said between S and E but received a bit late because of STT latency. If you do not interrupt the bot will generate two very similar outputs "repeating" itself.

It seems the foundational examples series 22 would solve this issue without needing to interrupt. I was assuming T1 and T2 would be two different sentences in the which case the interruptions didn’t make sense. But if it’s the same sentence then those examples might help better without needing to interrupt. What if T1 has more context than T2?

I've decided I will make this a constructor argument. So you will have the option to interrupt the bot if you want to. But I have the feeling the series 22 examples (or any other approach) would work better.

@aconchillo aconchillo force-pushed the aleix/edgar/llm-user-frame-processor branch 3 times, most recently from bf78cee to 569d10a Compare December 24, 2024 02:04
This seems necessary because of how pytest works. If a task is cancelled, pytest
will know the task has been cancelled even if # `asyncio.CancelledError` is
handled internally in the task.
@aconchillo aconchillo force-pushed the aleix/edgar/llm-user-frame-processor branch from 569d10a to eef3f32 Compare December 24, 2024 02:15
Copy link
Contributor

@kwindla kwindla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aconchillo aconchillo marked this pull request as draft December 24, 2024 03:10
@aconchillo
Copy link
Contributor Author

Alright, I have been playing with this a bit. I wanted to add it to 0.0.52 but it's not going to happen. We will rethink LLM aggregators in January considering also how multi-modal models works. For now, I'll park this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants