-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
fine-turning
dataset: share_gpt4
Reproduction
English:
In the mm_plugin.py code, there is a sort of bugs that prevents me from fine-tuning the model using the ShareGPT dataset.
The problem is that the multimodal dataset processing module defines the following placeholders:
IMAGE_PLACEHOLDER = "<image>", VIDEO_PLACEHOLDER = "<video>", AUDIO_PLACEHOLDER = "<audio>".
As a result, mm_plugin.py SIMPLY checks whether the messages contain IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER, or AUDIO_PLACEHOLDER. For example:
def _validate_messages(
self,
messages: list[dict[str, str]],
images: list["ImageInput"],
videos: list["VideoInput"],
audios: list["AudioInput"],
):
r"""Validate if the number of images, videos and audios match the number of placeholders in messages."""
num_image_tokens, num_video_tokens, num_audio_tokens = 0, 0, 0
for message in messages:
num_image_tokens += message["content"].count(IMAGE_PLACEHOLDER)
num_video_tokens += message["content"].count(VIDEO_PLACEHOLDER)
num_audio_tokens += message["content"].count(AUDIO_PLACEHOLDER)You can see that, here is a very simply string check.
When the dataset naturally contains literal strings like <image> or <video>, this logic breaks and throws errors.
Unfortunate the ShareGPT4 dataset does contain such characters. For example:
{
"content": "HTML (Hypertext Markup Language) ... HTML5 includes new semantic elements like <header>, <footer>, <nav>, <section>, <article>, <aside>, and <figure> which provide better structure to web pages, making them easier to understand for both humans and search engines.\n3. Multimedia: HTML5 includes new elements like <video> and <audio> which allow for embedding multimedia content directly into web pages, without the need ....",
"role": "assistant"
}, {
"content": "difference between html and html5 in tabular format",
"role": "user"
}, {
This same issue also appears in the process_messages function inside mm_plugin.py:
@override
def process_messages(
self,
messages: list[dict[str, str]],
images: list["ImageInput"],
videos: list["VideoInput"],
audios: list["AudioInput"],
processor: Optional["MMProcessor"],
) -> list[dict[str, str]]:
...
while VIDEO_PLACEHOLDER in content:
video_pos = content.find(VIDEO_PLACEHOLDER)
audio_pos = content.find(AUDIO_PLACEHOLDER, video_pos)
if audio_pos == -1 or audio_pos < video_pos:
raise ValueError(
f"Each {VIDEO_PLACEHOLDER} must be followed by an {AUDIO_PLACEHOLDER} when using audio in video."
)--
OK.
What I also want to say is: the logic of the parsing code is overly naive and fragile. It cannot distinguish between actual multimodal placeholders and normal text that happens to contain HTML tags like <video> or <audio>.
I hope you can address this issue properly.
中文:
在 mm_plugin.py 代码里有一类错误,导致我现在无法使用 share_gpt 数据微调模型。
错误是:多模态数据集处理器里约定: IMAGE_PLACEHOLDER= <image> , VIDEO_PLACEHOLDER= <video> , AUDIO_PLACEHOLDER= <audio>
而 mm_plugin.py 代码里简单判断 messages 里是否包含有 IMAGE_PLACEHOLDER、VIDEO_PLACEHOLDER、AUDIO_PLACEHOLDER。例如下面代码:
def _validate_messages(
self,
messages: list[dict[str, str]],
images: list["ImageInput"],
videos: list["VideoInput"],
audios: list["AudioInput"],
):
r"""Validate if the number of images, videos and audios match the number of placeholders in messages."""
num_image_tokens, num_video_tokens, num_audio_tokens = 0, 0, 0
for message in messages:
num_image_tokens += message["content"].count(IMAGE_PLACEHOLDER)
num_video_tokens += message["content"].count(VIDEO_PLACEHOLDER)
num_audio_tokens += message["content"].count(AUDIO_PLACEHOLDER)当数据集里恰好也有 <image> 或 <video> 时,就会出现错误。并抛异常。
share_gpt4 数据集里恰好就有这样的字符:
....
{
"content": "HTML (Hypertext Markup Language) ... HTML5 includes new semantic elements like <header>, <footer>, <nav>, <section>, <article>, <aside>, and <figure> which provide better structure to web pages, making them easier to understand for both humans and search engines.\n3. Multimedia: HTML5 includes new elements like <video> and <audio> which allow for embedding multimedia content directly into web pages, without the need ....",
"role": "assistant"
}, {
"content": "difference between html and html5 in tabular format",
"role": "user"
}, {
....
这样的错误还存在于 mm_plugin.py 里的
@override
def process_messages(
self,
messages: list[dict[str, str]],
images: list["ImageInput"],
videos: list["VideoInput"],
audios: list["AudioInput"],
processor: Optional["MMProcessor"],
) -> list[dict[str, str]]:函数中:
while VIDEO_PLACEHOLDER in content:
video_pos = content.find(VIDEO_PLACEHOLDER)
audio_pos = content.find(AUDIO_PLACEHOLDER, video_pos)
if audio_pos == -1 or audio_pos < video_pos:
raise ValueError(
f"Each {VIDEO_PLACEHOLDER} must be followed by an {AUDIO_PLACEHOLDER} when using audio in video."
)这样的解析处理逻辑你们写得太草率了。希望你们能解决好。
Others
No response