Inputs and input_ids includes labels??! 输入token包含全部输出这对吗？

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

- `llamafactory` version: 0.9.3.dev0                                                                                                                                                                                                        
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31                                                                                                                                                                                   
- Python version: 3.12.9                                                                                                                                                                                                                    
- PyTorch version: 2.6.0+cu118 (GPU)                                                                                                                                                                                                        
- Transformers version: 4.49.0                                                                                                                                                                                                              
- Datasets version: 3.4.1                                                                                                                                                                                                                   
- Accelerate version: 1.5.2                                                                                                                                                                                                                 
- PEFT version: 0.15.0                                                                                                                                                                                                                      
- TRL version: 0.9.6                                                                                                                                                                                                                        
- GPU type: NVIDIA GeForce RTX 4090                                                                                                                                                                                                         
- GPU number: 8                                                                                                                                                                                                                             
- GPU memory: 23.65GB                                                                                                                                                                                                                       
- DeepSpeed version: 0.16.4                                                                                                                                                                                                                 
- vLLM version: 0.8.1  

### Reproduction

I found by chance in logs that in the printed `training example`, the `inputs` tokens include `labels` tokens, which is very abnormal!!

I first feel like it's my fault, then I start to randomly read some issues that show `training example`. And I also found this happens in every dataset or example. The following log is from: https://github.com/hiyouga/LLaMA-Factory/issues/7156#issue-2894704443
```text
training example:
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151653, 74785, 279, 1887, 14878, 323, 862, 19026, 311, 279, 2766, 13, 151645, 198, 151644, 77091, 198, 32, 5220, 448, 2805, 6869, 11, 10078, 8664, 11, 12233, 264, 4158, 1909, 323, 3691, 24549, 14638, 304, 4065, 315, 279, 1965, 11, 12771, 705, 264, 2311, 504, 279, 1965, 11, 323, 8930, 432, 311, 1349, 151645, 198]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|vision_end|>Describe the main subjects and their contributions to the video.<|im_end|>
<|im_start|>assistant
A woman with short hair, slightly fat, wearing a white top and black pants stood in front of the table, picked up a book from the table, and opened it to read<|im_end|>

label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 32, 5220, 448, 2805, 6869, 11, 10078, 8664, 11, 12233, 264, 4158, 1909, 323, 3691, 24549, 14638, 304, 4065, 315, 279, 1965, 11, 12771, 705, 264, 2311, 504, 279, 1965, 11, 323, 8930, 432, 311, 1349, 151645, 198]
labels:
A woman with short hair, slightly fat, wearing a white top and black pants stood in front of the table, picked up a book from the table, and opened it to read<|im_end|>
```

I refer to the source code in llamafactory and found these several lines in `SupervisedDatasetProcessor._encode_data_example()` from `src/llamafactory/data/processor/supervised.py` are suspected:
```python
if self.data_args.mask_history:  # reversed sequences
    input_ids = source_ids + target_ids + input_ids
    labels = source_label + target_label + labels
else:
    input_ids += source_ids + target_ids
    labels += source_label + target_label
```

No matter `mask_history` is turned on or not, target_ids will always be added into input_ids. Is it normal? @hiyouga 

### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inputs and input_ids includes labels??! 输入token包含全部输出这对吗？ #7603

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inputs and input_ids includes labels??! 输入token包含全部输出这对吗？ #7603

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions