dataset parsing error ,and async_result.get(timeout=0.05)

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

fine-turning
dataset: share_gpt4


### Reproduction

**English:**

In the `mm_plugin.py` code, there is a sort of bugs that prevents me from fine-tuning the model using the ShareGPT dataset.

The problem is that the multimodal dataset processing module defines the following placeholders:
`IMAGE_PLACEHOLDER = "<image>"`, `VIDEO_PLACEHOLDER = "<video>"`, `AUDIO_PLACEHOLDER = "<audio>"`.

As a result, `mm_plugin.py`  SIMPLY checks whether the messages contain `IMAGE_PLACEHOLDER`, `VIDEO_PLACEHOLDER`, or `AUDIO_PLACEHOLDER`. For example:

```python
def _validate_messages(
    self,
    messages: list[dict[str, str]],
    images: list["ImageInput"],
    videos: list["VideoInput"],
    audios: list["AudioInput"],
):
    r"""Validate if the number of images, videos and audios match the number of placeholders in messages."""
    num_image_tokens, num_video_tokens, num_audio_tokens = 0, 0, 0
    for message in messages:
        num_image_tokens += message["content"].count(IMAGE_PLACEHOLDER)
        num_video_tokens += message["content"].count(VIDEO_PLACEHOLDER)
        num_audio_tokens += message["content"].count(AUDIO_PLACEHOLDER)
```

You can see that, here is a very simply string check.

When the dataset  naturally contains literal strings like `<image>` or `<video>`, this logic breaks and throws errors.

Unfortunate the ShareGPT4 dataset does contain such characters. For example:



 {
  "content": "HTML (Hypertext Markup Language) ... HTML5 includes new semantic elements like \<header\>, \<footer\>, \<nav\>, \<section\>, \<article\>, \<aside\>, and \<figure\> which provide better structure to web pages, making them easier to understand for both humans and search engines.\n3. Multimedia: HTML5 includes new elements like \<video\> and \<audio\> which allow for embedding multimedia content directly into web pages, without the need ....",
  "role": "assistant"
}, {
  "content": "difference between html and html5 in tabular format",
  "role": "user"
}, {



This same issue also appears in the `process_messages` function inside `mm_plugin.py`:

```python
@override
def process_messages(
    self,
    messages: list[dict[str, str]],
    images: list["ImageInput"],
    videos: list["VideoInput"],
    audios: list["AudioInput"],
    processor: Optional["MMProcessor"],
) -> list[dict[str, str]]:


...


while VIDEO_PLACEHOLDER in content:
    video_pos = content.find(VIDEO_PLACEHOLDER)
    audio_pos = content.find(AUDIO_PLACEHOLDER, video_pos)
    if audio_pos == -1 or audio_pos < video_pos:
        raise ValueError(
            f"Each {VIDEO_PLACEHOLDER} must be followed by an {AUDIO_PLACEHOLDER} when using audio in video."
        )
```

--
OK. 
What I also want to say is: the logic of the parsing code is overly naive and fragile. It cannot distinguish between actual multimodal placeholders and normal text that happens to contain HTML tags like `<video>` or `<audio>`.

I hope you can address this issue properly.




**中文：**
在  mm_plugin.py 代码里有一类错误，导致我现在无法使用 share_gpt 数据微调模型。
错误是：多模态数据集处理器里约定： IMAGE_PLACEHOLDER= \<image\>  , VIDEO_PLACEHOLDER= \<video\> , AUDIO_PLACEHOLDER= \<audio\>
而 mm_plugin.py  代码里简单判断 messages 里是否包含有 IMAGE_PLACEHOLDER、VIDEO_PLACEHOLDER、AUDIO_PLACEHOLDER。例如下面代码：

```python
    def _validate_messages(
        self,
        messages: list[dict[str, str]],
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
    ):
        r"""Validate if the number of images, videos and audios match the number of placeholders in messages."""
        num_image_tokens, num_video_tokens, num_audio_tokens = 0, 0, 0
        for message in messages:
            num_image_tokens += message["content"].count(IMAGE_PLACEHOLDER)
            num_video_tokens += message["content"].count(VIDEO_PLACEHOLDER)
            num_audio_tokens += message["content"].count(AUDIO_PLACEHOLDER)
```

 当数据集里恰好也有 \<image\> 或 \<video\> 时，就会出现错误。并抛异常。
 share_gpt4 数据集里恰好就有这样的字符：
 
 ....
 
 {
  "content": "HTML (Hypertext Markup Language) ... HTML5 includes new semantic elements like \<header\>, \<footer\>, \<nav\>, \<section\>, \<article\>, \<aside\>, and \<figure\> which provide better structure to web pages, making them easier to understand for both humans and search engines.\n3. Multimedia: HTML5 includes new elements like \<video\> and \<audio\> which allow for embedding multimedia content directly into web pages, without the need ....",
  "role": "assistant"
}, {
  "content": "difference between html and html5 in tabular format",
  "role": "user"
}, {

....


这样的错误还存在于 mm_plugin.py 里的 
```python
    @override
    def process_messages(
        self,
        messages: list[dict[str, str]],
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
        processor: Optional["MMProcessor"],
    ) -> list[dict[str, str]]:
 ```
函数中：
```python
               while VIDEO_PLACEHOLDER in content:
                    video_pos = content.find(VIDEO_PLACEHOLDER)
                    audio_pos = content.find(AUDIO_PLACEHOLDER, video_pos)
                    if audio_pos == -1 or audio_pos < video_pos:
                        raise ValueError(
                            f"Each {VIDEO_PLACEHOLDER} must be followed by an {AUDIO_PLACEHOLDER} when using audio in video."
                        )
```
                        
这样的解析处理逻辑你们写得太草率了。希望你们能解决好。



### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset parsing error ,and async_result.get(timeout=0.05) #9546

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

dataset parsing error ,and async_result.get(timeout=0.05) #9546

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions