Skip to content

dataset parsing error ,and async_result.get(timeout=0.05) #9546

@clinton81

Description

@clinton81

Reminder

  • I have read the above rules and searched the existing issues.

System Info

fine-turning
dataset: share_gpt4

Reproduction

English:

In the mm_plugin.py code, there is a sort of bugs that prevents me from fine-tuning the model using the ShareGPT dataset.

The problem is that the multimodal dataset processing module defines the following placeholders:
IMAGE_PLACEHOLDER = "<image>", VIDEO_PLACEHOLDER = "<video>", AUDIO_PLACEHOLDER = "<audio>".

As a result, mm_plugin.py SIMPLY checks whether the messages contain IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER, or AUDIO_PLACEHOLDER. For example:

def _validate_messages(
    self,
    messages: list[dict[str, str]],
    images: list["ImageInput"],
    videos: list["VideoInput"],
    audios: list["AudioInput"],
):
    r"""Validate if the number of images, videos and audios match the number of placeholders in messages."""
    num_image_tokens, num_video_tokens, num_audio_tokens = 0, 0, 0
    for message in messages:
        num_image_tokens += message["content"].count(IMAGE_PLACEHOLDER)
        num_video_tokens += message["content"].count(VIDEO_PLACEHOLDER)
        num_audio_tokens += message["content"].count(AUDIO_PLACEHOLDER)

You can see that, here is a very simply string check.

When the dataset naturally contains literal strings like <image> or <video>, this logic breaks and throws errors.

Unfortunate the ShareGPT4 dataset does contain such characters. For example:

{
"content": "HTML (Hypertext Markup Language) ... HTML5 includes new semantic elements like <header>, <footer>, <nav>, <section>, <article>, <aside>, and <figure> which provide better structure to web pages, making them easier to understand for both humans and search engines.\n3. Multimedia: HTML5 includes new elements like <video> and <audio> which allow for embedding multimedia content directly into web pages, without the need ....",
"role": "assistant"
}, {
"content": "difference between html and html5 in tabular format",
"role": "user"
}, {

This same issue also appears in the process_messages function inside mm_plugin.py:

@override
def process_messages(
    self,
    messages: list[dict[str, str]],
    images: list["ImageInput"],
    videos: list["VideoInput"],
    audios: list["AudioInput"],
    processor: Optional["MMProcessor"],
) -> list[dict[str, str]]:


...


while VIDEO_PLACEHOLDER in content:
    video_pos = content.find(VIDEO_PLACEHOLDER)
    audio_pos = content.find(AUDIO_PLACEHOLDER, video_pos)
    if audio_pos == -1 or audio_pos < video_pos:
        raise ValueError(
            f"Each {VIDEO_PLACEHOLDER} must be followed by an {AUDIO_PLACEHOLDER} when using audio in video."
        )

--
OK.
What I also want to say is: the logic of the parsing code is overly naive and fragile. It cannot distinguish between actual multimodal placeholders and normal text that happens to contain HTML tags like <video> or <audio>.

I hope you can address this issue properly.

中文:
在 mm_plugin.py 代码里有一类错误,导致我现在无法使用 share_gpt 数据微调模型。
错误是:多模态数据集处理器里约定: IMAGE_PLACEHOLDER= <image> , VIDEO_PLACEHOLDER= <video> , AUDIO_PLACEHOLDER= <audio>
而 mm_plugin.py 代码里简单判断 messages 里是否包含有 IMAGE_PLACEHOLDER、VIDEO_PLACEHOLDER、AUDIO_PLACEHOLDER。例如下面代码:

    def _validate_messages(
        self,
        messages: list[dict[str, str]],
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
    ):
        r"""Validate if the number of images, videos and audios match the number of placeholders in messages."""
        num_image_tokens, num_video_tokens, num_audio_tokens = 0, 0, 0
        for message in messages:
            num_image_tokens += message["content"].count(IMAGE_PLACEHOLDER)
            num_video_tokens += message["content"].count(VIDEO_PLACEHOLDER)
            num_audio_tokens += message["content"].count(AUDIO_PLACEHOLDER)

当数据集里恰好也有 <image> 或 <video> 时,就会出现错误。并抛异常。
share_gpt4 数据集里恰好就有这样的字符:

....

{
"content": "HTML (Hypertext Markup Language) ... HTML5 includes new semantic elements like <header>, <footer>, <nav>, <section>, <article>, <aside>, and <figure> which provide better structure to web pages, making them easier to understand for both humans and search engines.\n3. Multimedia: HTML5 includes new elements like <video> and <audio> which allow for embedding multimedia content directly into web pages, without the need ....",
"role": "assistant"
}, {
"content": "difference between html and html5 in tabular format",
"role": "user"
}, {

....

这样的错误还存在于 mm_plugin.py 里的

    @override
    def process_messages(
        self,
        messages: list[dict[str, str]],
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
        processor: Optional["MMProcessor"],
    ) -> list[dict[str, str]]:

函数中:

               while VIDEO_PLACEHOLDER in content:
                    video_pos = content.find(VIDEO_PLACEHOLDER)
                    audio_pos = content.find(AUDIO_PLACEHOLDER, video_pos)
                    if audio_pos == -1 or audio_pos < video_pos:
                        raise ValueError(
                            f"Each {VIDEO_PLACEHOLDER} must be followed by an {AUDIO_PLACEHOLDER} when using audio in video."
                        )

这样的解析处理逻辑你们写得太草率了。希望你们能解决好。

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    solvedThis problem has been already solved

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions