Skip to content

make_seq2seq_fields crashes with confusing error when empty prompt array is passed #483

@markknoffler

Description

@markknoffler

Description

Bug Overview

The make_seq2seq_fields function in gemma/gm/data/_functional.py crashes with a confusing NumPy error when an empty prompt array is passed, instead of handling it gracefully or providing a clear error message.

Bug Location

  • File: gemma/gm/data/_functional.py
  • Line: 142
  • Function: make_seq2seq_fields

Root Cause

The function attempts to create a target mask with this code:

target_mask = np.concatenate([
    np.zeros((len(prompt) - 1,), dtype=np.bool_),
    np.ones((len(response),), dtype=np.bool_),
])

When len(prompt) == 0, the expression len(prompt) - 1 evaluates to -1, causing np.zeros((-1,), ...) to raise:

ValueError: negative dimensions are not allowed

Why This Is a Problem

  1. No input validation - The function doesn't check for empty prompts before attempting array operations
  2. Confusing error message - Users get a cryptic NumPy error instead of understanding what they did wrong
  3. Crashes entire pipeline - Instead of handling the edge case gracefully, the entire operation fails
  4. Difficult to debug - The error comes from deep within NumPy code, not from the user's code
  5. Poor user experience - Users cannot easily identify that the issue is with an empty prompt

How to Reproduce

Scenario 1: Direct API usage

from gemma import gm

result = gm.data.make_seq2seq_fields(
    prompt=[],  # Empty prompt
    response=[20, 21, 1]
)
# Raises: ValueError: negative dimensions are not allowed

Scenario 2: Through AddSeq2SeqFields transform

from gemma import gm

transform = gm.data.AddSeq2SeqFields(
    in_prompt="prompt",
    in_response="response",
    out_input="input",
    out_target="target",
    out_target_mask="target_mask",
)

element = {
    "prompt": [],  # Empty prompt tokens
    "response": [20, 21, 1]
}

result = transform.map(element)
# Raises: ValueError: negative dimensions are not allowed

Error Traceback

Traceback (most recent call last):
  File ".../_functional.py", line 142, in make_seq2seq_fields
    np.zeros((len(prompt) - 1,), dtype=np.bool_),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: negative dimensions are not allowed

Expected Behavior

The function should either:

  1. Raise a clear, descriptive error message indicating that empty prompts are not supported, OR
  2. Handle empty prompts gracefully with a warning and continue execution

Impact

  • Affects users calling make_seq2seq_fields directly with empty prompt arrays
  • Affects data pipelines using AddSeq2SeqFields transform with potentially empty prompt tokens
  • Can cause entire batch processing pipelines to fail
  • Difficult for users to identify and fix the root cause
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions