Skip to content

Inputs and input_ids includes labels??! 输入token包含全部输出这对吗? #7603

@Luffy966

Description

@Luffy966

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
  • Python version: 3.12.9
  • PyTorch version: 2.6.0+cu118 (GPU)
  • Transformers version: 4.49.0
  • Datasets version: 3.4.1
  • Accelerate version: 1.5.2
  • PEFT version: 0.15.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090
  • GPU number: 8
  • GPU memory: 23.65GB
  • DeepSpeed version: 0.16.4
  • vLLM version: 0.8.1

Reproduction

I found by chance in logs that in the printed training example, the inputs tokens include labels tokens, which is very abnormal!!

I first feel like it's my fault, then I start to randomly read some issues that show training example. And I also found this happens in every dataset or example. The following log is from: #7156 (comment)

training example:
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151653, 74785, 279, 1887, 14878, 323, 862, 19026, 311, 279, 2766, 13, 151645, 198, 151644, 77091, 198, 32, 5220, 448, 2805, 6869, 11, 10078, 8664, 11, 12233, 264, 4158, 1909, 323, 3691, 24549, 14638, 304, 4065, 315, 279, 1965, 11, 12771, 705, 264, 2311, 504, 279, 1965, 11, 323, 8930, 432, 311, 1349, 151645, 198]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|vision_end|>Describe the main subjects and their contributions to the video.<|im_end|>
<|im_start|>assistant
A woman with short hair, slightly fat, wearing a white top and black pants stood in front of the table, picked up a book from the table, and opened it to read<|im_end|>

label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 32, 5220, 448, 2805, 6869, 11, 10078, 8664, 11, 12233, 264, 4158, 1909, 323, 3691, 24549, 14638, 304, 4065, 315, 279, 1965, 11, 12771, 705, 264, 2311, 504, 279, 1965, 11, 323, 8930, 432, 311, 1349, 151645, 198]
labels:
A woman with short hair, slightly fat, wearing a white top and black pants stood in front of the table, picked up a book from the table, and opened it to read<|im_end|>

I refer to the source code in llamafactory and found these several lines in SupervisedDatasetProcessor._encode_data_example() from src/llamafactory/data/processor/supervised.py are suspected:

if self.data_args.mask_history:  # reversed sequences
    input_ids = source_ids + target_ids + input_ids
    labels = source_label + target_label + labels
else:
    input_ids += source_ids + target_ids
    labels += source_label + target_label

No matter mask_history is turned on or not, target_ids will always be added into input_ids. Is it normal? @hiyouga

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    invalidThis doesn't seem right

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions