docs(zh-TW): fix tokenizer input type in chapter3/2.mdx #1120

jieyao-MilestoneHub · 2025-10-10T03:37:20Z

Description

Fix tokenizer input in the MRPC preprocessing example.
In recent versions, raw_datasets["train"]["sentence1"] and ["sentence2"] are datasets.arrow_dataset.Column objects, which must be converted to Python lists before tokenization.

Change

# Before
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

# After
tokenized_dataset = tokenizer(
    list(raw_datasets["train"]["sentence1"]),
    list(raw_datasets["train"]["sentence2"]),
    padding=True,
    truncation=True,
)
# or
tokenized_dataset = tokenizer(
    raw_datasets["train"].data["sentence1"].to_pylist(),
    raw_datasets["train"].data["sentence2"].to_pylist(),
    padding=True,
    truncation=True,
)

Impact

Prevents

ValueError: text input must be of type `str`, `list[str]`, or `list[list[str]]`

and ensures compatibility with transformers ≥4.56 and datasets ≥3.0.

jieyao-MilestoneHub · 2025-10-10T03:41:16Z

Hi @thliang01 — I noticed this tokenizer input issue in the zh-TW course file.
Could you please help review and confirm if the fix looks correct? Thank you! 🙏

HuggingFaceDocBuilderDev · 2025-10-10T03:53:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ilestoneHub/course into fix-doc-tokenizer-input

docs: fix tokenizer input by wrapping dataset column with list()

af2e776

Joel Hsu(徐捷耀) added 2 commits October 10, 2025 12:06

docs: fix tokenizer input by wrapping dataset column with list()

feb3c78

Merge branch 'fix-doc-tokenizer-input' of https://github.com/jieyao-M…

1ab86f5

…ilestoneHub/course into fix-doc-tokenizer-input

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(zh-TW): fix tokenizer input type in chapter3/2.mdx #1120

docs(zh-TW): fix tokenizer input type in chapter3/2.mdx #1120

Uh oh!

jieyao-MilestoneHub commented Oct 10, 2025

Uh oh!

jieyao-MilestoneHub commented Oct 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

docs(zh-TW): fix tokenizer input type in chapter3/2.mdx #1120

Are you sure you want to change the base?

docs(zh-TW): fix tokenizer input type in chapter3/2.mdx #1120

Uh oh!

Conversation

jieyao-MilestoneHub commented Oct 10, 2025

Description

Change

Impact

Uh oh!

jieyao-MilestoneHub commented Oct 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants