Refactor dataprep multimedia2text #1065

Spycsh · 2024-12-24T02:16:38Z

Description

Refactor dataprep multimedia2text

audio2text is duplicated with asr
video2audio is simple, can be moved to the preparation stage in the example DocSum
reduce the number of microservices to save resources
use ffmpeg to handle faster video2audio operation (also leverage Gaudi acceleration in the future)

Issues

na

Type of change

List the type of change like below. Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds new functionality)
Breaking change (fix or feature that would break existing design and interface)
Others (enhancement, documentation, validation, etc.)

Dependencies

na

Tests

na

Spycsh · 2024-12-24T02:36:18Z

Hello @MSCetin37, we need to refactor the multimedia2text component and the relevant example DocSum. Firstly thanks for your previous contribution of multimedia2text. After rechecking your code, we find that the audio2text is mostly duplicated with the existing ASR component and the video2audio can be replaced with a simple ffmpeg command of the conversion from video to audio. By rechecking DocSum, the only example that requires this multimedia2text functionality, we find that we can basically refactor the logics with minimal extra hardware resources like following:

    def add_remote_service(self):
        asr = MicroService(
            name="asr",
            host=ASR_SERVICE_HOST_IP,
            port=ASR_SERVICE_PORT,
            endpoint="/v1/audio/transcriptions",
            use_remote_service=True,
            service_type=ServiceType.ASR,
        )

        llm = MicroService(
            name="llm",
            host=LLM_SERVICE_HOST_IP,
            port=LLM_SERVICE_PORT,
            endpoint="/v1/chat/docsum",
            use_remote_service=True,
            service_type=ServiceType.LLM,
        )

        self.megaservice.add(asr).add(llm)
        self.megaservice.flow_to(asr, llm)

    async def handle_request(self, request: Request, files: List[UploadFile] = File(default=None)):
### 1. if file is audio, do the megaservice.schedule
### 2. if file is video, do a simple 'ffmpeg -i input_video.mp4 -q:a 0 -map a input_video.mp3', and do the megaservice.schedule
### 3. if input is text, skip the ASR node in the runtime graph, and do megaservice.schedule

This will have following advantages:

remove the duplicate part of audio2text
move the video2audio to the preparation stage in the example DocSum
reduce the number of microservices to save resources
use ffmpeg to handle faster video2audio operation (also leverage Gaudi acceleration in the future)

I will later open a PR for this on the GenAIExample side, and if @MSCetin37 you have any suggestions, please don't hesitate to tell me!

Spycsh · 2024-12-24T07:00:55Z

Here is the relevant refactor PR opea-project/GenAIExamples#1286 in examples.

MSCetin37 · 2024-12-24T07:20:57Z

@Spycsh
Thanks for letting me know regarding the next phase of implementation. I have a few concerns:

> * Remove the duplicate part of audio2text

The response of the ASR does not align with the input of the LLM service. That was the reason why we implemented the audio2text service, which rearranges the response of the Whisper service so the results can be used as input to the LLM service. You might need to implement this logic inside the async function or multimedia2text service.

> * Move the video2audio to the preparation stage in the example DocSum

Video2audio is a simple service and might not require a dedicated service. However, we were planning to use this service for other examples as well, such as the MultiModal Q&A example.

My overall thoughts are that implementing a service (similar to multimedia2text) that can convert any data domain to a targeted domain will simplify and expand the scope of the implementations in OPEA.
For example:

Video to audio
Audio to text
Text to audio (speech)
PDF/doc to audio (speech)
etc.

Spycsh · 2024-12-24T08:12:22Z

Hi @MSCetin37 ,

The response of the ASR does not align with the input of the LLM service. That was the reason why we implemented the audio2text service, which rearranges the response of the Whisper service so the results can be used as input to the LLM service. You might need to implement this logic inside the async function or multimedia2text service.

Yes as you may notice, we leverage the align_input in megaservice to align the inputs. https://github.com/opea-project/GenAIExamples/blob/main/DocSum/docsum.py#L35 is an example.

Video2audio is a simple service and might not require a dedicated service. However, we were planning to use this service for other examples as well, such as the MultiModal Q&A example.

Sure however I think video2audio is super lightweight and can be easily replaced with ffmpeg -i input_video.mp4 -q:a 0 -map a input_video.mp3. Also separate it out as a standalone microservice does not really improve the pipeline (it does not need to be scaled-out), but just increase the latency and occupy hardware resources. In https://github.com/opea-project/GenAIExamples/pull/1286/files#diff-53767fc53b41e36c885b4416f3bbedebedae4f26be0ce616668ef368818fa3eeR47 I just replace the original functionalities with the very simple implementation.

My overall thoughts are that implementing a service (similar to multimedia2text) that can convert any data domain to a targeted domain will simplify and expand the scope of the implementations in OPEA.

I agree with you if what you mean is to add a multimedia2text component, which serves like a wrapper (or controller) to access existing whisper(asr)/speecht5(tts)/video2audio(local ffmpeg or moviepy conversion) etc, even other multimedia related functionalities. And I think it would be good for the future MultiModal Q&A. The name can also not to be exactly "multimedia2text", but something else (multimediaprocessing?).

Currently there are some refactoring work that I have to do this week. So do you agree we just keep it simple now? The hard requirements is that dataprep should not contain audio2text and video2audio. I will do the change based on our current design and later I will sync with you about adding a new component "multimedia2text" (the scope and its name), following the new controller https://github.com/opea-project/GenAIComps/blob/refactor_comps/comps/cores/common/component.py#L11 logics. Do you think this is reasonable?

MSCetin37 · 2024-12-24T18:33:00Z

Sounds good to me.

remove dataprep/multimedia2text

700bcd6

Spycsh requested a review from lvliang-intel as a code owner December 24, 2024 02:16

Spycsh mentioned this pull request Dec 24, 2024

Refactor DocSum example opea-project/GenAIExamples#1286

Open

7 tasks

letonghan approved these changes Dec 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataprep multimedia2text #1065

Refactor dataprep multimedia2text #1065

Spycsh commented Dec 24, 2024

Spycsh commented Dec 24, 2024 •

edited

Loading

Spycsh commented Dec 24, 2024

MSCetin37 commented Dec 24, 2024 •

edited

Loading

Spycsh commented Dec 24, 2024

MSCetin37 commented Dec 24, 2024

Refactor dataprep multimedia2text #1065

Are you sure you want to change the base?

Refactor dataprep multimedia2text #1065

Conversation

Spycsh commented Dec 24, 2024

Description

Issues

Type of change

Dependencies

Tests

Spycsh commented Dec 24, 2024 • edited Loading

Spycsh commented Dec 24, 2024

MSCetin37 commented Dec 24, 2024 • edited Loading

Spycsh commented Dec 24, 2024

MSCetin37 commented Dec 24, 2024

Spycsh commented Dec 24, 2024 •

edited

Loading

MSCetin37 commented Dec 24, 2024 •

edited

Loading