Skip to content

BatchEncoding.to throws away columns silently, thus no way to pass non-tensor columns such as String in Trainer metric computation #34983

@fzyzcjy

Description

@fzyzcjy

System Info

unrelated

Who can help?

@muellerzr @SunMarc
(original tags, no longer valid)

@ArthurZucker
(re-tag because want to discuss patch release)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi thanks for the library! Consider this simple line:

x = transformers.tokenization_utils_base.BatchEncoding({'a': ['x','y']})
x.to('cpu') # or cuda or whatever

The column a is then silently removed :(

This is annoying in the following scenario: For each of my training/eval sample, I have a string column that serves as a tag for it, and want to utilize it when computing metrics and losses.

Then it does not work. After some debugging, the root reason is that it gets silently removed in the to mentioned above.

It seems torch does not support a tensor of dtype str, thus it seems impossible to have data pass through.

Expected behavior

(see above)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions