Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dataprep multimedia2text #1065

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Spycsh
Copy link
Member

@Spycsh Spycsh commented Dec 24, 2024

Description

Refactor dataprep multimedia2text

  • audio2text is duplicated with asr
  • video2audio is simple, can be moved to the preparation stage in the example DocSum
  • reduce the number of microservices to save resources
  • use ffmpeg to handle faster video2audio operation (also leverage Gaudi acceleration in the future)

Issues

na

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

na

Tests

na

@Spycsh
Copy link
Member Author

Spycsh commented Dec 24, 2024

Hello @MSCetin37, we need to refactor the multimedia2text component and the relevant example DocSum. Firstly thanks for your previous contribution of multimedia2text. After rechecking your code, we find that the audio2text is mostly duplicated with the existing ASR component and the video2audio can be replaced with a simple ffmpeg command of the conversion from video to audio. By rechecking DocSum, the only example that requires this multimedia2text functionality, we find that we can basically refactor the logics with minimal extra hardware resources like following:

    def add_remote_service(self):
        asr = MicroService(
            name="asr",
            host=ASR_SERVICE_HOST_IP,
            port=ASR_SERVICE_PORT,
            endpoint="/v1/audio/transcriptions",
            use_remote_service=True,
            service_type=ServiceType.ASR,
        )

        llm = MicroService(
            name="llm",
            host=LLM_SERVICE_HOST_IP,
            port=LLM_SERVICE_PORT,
            endpoint="/v1/chat/docsum",
            use_remote_service=True,
            service_type=ServiceType.LLM,
        )

        self.megaservice.add(asr).add(llm)
        self.megaservice.flow_to(asr, llm)

    async def handle_request(self, request: Request, files: List[UploadFile] = File(default=None)):
### 1. if file is audio, do the megaservice.schedule
### 2. if file is video, do a simple 'ffmpeg -i input_video.mp4 -q:a 0 -map a input_video.mp3', and do the megaservice.schedule
### 3. if input is text, skip the ASR node in the runtime graph, and do megaservice.schedule

This will have following advantages:

  • remove the duplicate part of audio2text
  • move the video2audio to the preparation stage in the example DocSum
  • reduce the number of microservices to save resources
  • use ffmpeg to handle faster video2audio operation (also leverage Gaudi acceleration in the future)

I will later open a PR for this on the GenAIExample side, and if @MSCetin37 you have any suggestions, please don't hesitate to tell me!

@Spycsh
Copy link
Member Author

Spycsh commented Dec 24, 2024

Here is the relevant refactor PR opea-project/GenAIExamples#1286 in examples.

@MSCetin37
Copy link
Contributor

MSCetin37 commented Dec 24, 2024

@Spycsh
Thanks for letting me know regarding the next phase of implementation. I have a few concerns:

> * Remove the duplicate part of audio2text

  • The response of the ASR does not align with the input of the LLM service. That was the reason why we implemented the audio2text service, which rearranges the response of the Whisper service so the results can be used as input to the LLM service. You might need to implement this logic inside the async function or multimedia2text service.

> * Move the video2audio to the preparation stage in the example DocSum

  • Video2audio is a simple service and might not require a dedicated service. However, we were planning to use this service for other examples as well, such as the MultiModal Q&A example.

My overall thoughts are that implementing a service (similar to multimedia2text) that can convert any data domain to a targeted domain will simplify and expand the scope of the implementations in OPEA.
For example:

  • Video to audio
  • Audio to text
  • Text to audio (speech)
  • PDF/doc to audio (speech)
  • etc.

@Spycsh
Copy link
Member Author

Spycsh commented Dec 24, 2024

Hi @MSCetin37 ,

The response of the ASR does not align with the input of the LLM service. That was the reason why we implemented the audio2text service, which rearranges the response of the Whisper service so the results can be used as input to the LLM service. You might need to implement this logic inside the async function or multimedia2text service.

Yes as you may notice, we leverage the align_input in megaservice to align the inputs. https://github.com/opea-project/GenAIExamples/blob/main/DocSum/docsum.py#L35 is an example.

Video2audio is a simple service and might not require a dedicated service. However, we were planning to use this service for other examples as well, such as the MultiModal Q&A example.

Sure however I think video2audio is super lightweight and can be easily replaced with ffmpeg -i input_video.mp4 -q:a 0 -map a input_video.mp3. Also separate it out as a standalone microservice does not really improve the pipeline (it does not need to be scaled-out), but just increase the latency and occupy hardware resources. In https://github.com/opea-project/GenAIExamples/pull/1286/files#diff-53767fc53b41e36c885b4416f3bbedebedae4f26be0ce616668ef368818fa3eeR47 I just replace the original functionalities with the very simple implementation.

My overall thoughts are that implementing a service (similar to multimedia2text) that can convert any data domain to a targeted domain will simplify and expand the scope of the implementations in OPEA.

I agree with you if what you mean is to add a multimedia2text component, which serves like a wrapper (or controller) to access existing whisper(asr)/speecht5(tts)/video2audio(local ffmpeg or moviepy conversion) etc, even other multimedia related functionalities. And I think it would be good for the future MultiModal Q&A. The name can also not to be exactly "multimedia2text", but something else (multimediaprocessing?).

Currently there are some refactoring work that I have to do this week. So do you agree we just keep it simple now? The hard requirements is that dataprep should not contain audio2text and video2audio. I will do the change based on our current design and later I will sync with you about adding a new component "multimedia2text" (the scope and its name), following the new controller https://github.com/opea-project/GenAIComps/blob/refactor_comps/comps/cores/common/component.py#L11 logics. Do you think this is reasonable?

@MSCetin37
Copy link
Contributor

Sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants