Skip to content

QA pipeline prediction generates wrong response when top_k param > 1 #38984

Open
@WeichenXu123

Description

@WeichenXu123

System Info

  • transformers version: 4.53.0.dev0
  • Platform: Linux-5.4.0-1128-aws-fips-x86_64-with-glibc2.31
  • Python version: 3.11.11
  • Huggingface_hub version: 0.33.0
  • Safetensors version: 0.5.3
  • Accelerate version: 1.8.1
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.7.1+cu126 (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

No response

Information

  • The official example scripts
    My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    My own task or dataset (give details below)

Reproduction

import transformers

architecture = "csarron/mobilebert-uncased-squad-v2"
tokenizer = transformers.AutoTokenizer.from_pretrained(architecture, low_cpu_mem_usage=True)
model = transformers.MobileBertForQuestionAnswering.from_pretrained(
    architecture, low_cpu_mem_usage=True
)
pipeline = transformers.pipeline(task="question-answering", model=model, tokenizer=tokenizer)


data = [
    {'question': ['What color is it?', 'How do the people go?', "What does the 'wolf' howl at?"],
     'context': [
         "Some people said it was green but I know that it's pink.",
         'The people on the bus go up and down. Up and down.',
         "The pack of 'wolves' stood on the cliff and a 'lone wolf' howled at the moon for hours."
     ]}
]

# prediction result is wrong
pipeline(data, top_k=2, max_answer_len=5)

Expected behavior

Expected prediction response:

[[{'score': 0.5683297514915466, 'start': 51, 'end': 55, 'answer': 'pink'}, {'score': 0.028800610452890396, 'start': 51, 'end': 56, 'answer': 'pink.'}], [{'score': 0.3008899986743927, 'start': 25, 'end': 36, 'answer': 'up and down'}, {'score': 0.12070021033287048, 'start': 38, 'end': 49, 'answer': 'Up and down'}], [{'score': 0.8356598615646362, 'start': 68, 'end': 76, 'answer': 'the moon'}, {'score': 0.0971309095621109, 'start': 72, 'end': 76, 'answer': 'moon'}]]

But it gets the following response (one 'Up and down' answer is missing )

[[{'score': 0.5683297514915466, 'start': 51, 'end': 55, 'answer': 'pink'}, {'score': 0.028800610452890396, 'start': 51, 'end': 56, 'answer': 'pink.'}], {'score': 0.4215902090072632, 'start': 25, 'end': 36, 'answer': 'up and down'}, [{'score': 0.8356598615646362, 'start': 68, 'end': 76, 'answer': 'the moon'}, {'score': 0.0971309095621109, 'start': 72, 'end': 76, 'answer': 'moon'}]]

Activity

Rocketknight1

Rocketknight1 commented on Jun 23, 2025

@Rocketknight1
Member

cc @yushi2006, I did a git bisect and this change occurs because of #38761! I think the issue is that top_k and the new answer-merging logic are conflicting, so we get fewer than top_k answers because they get merged. What users probably want is that answers get merged before top_k is applied. I probably should have caught this in the review.

Maybe we should do a follow-up PR to fix it and move the score-merging before top_k? There are multiple ways to do this - if you want to take the PR, let me know, if not we'll do it internally at some point.

itsmejul

itsmejul commented on Jul 2, 2025

@itsmejul

I think the easiest way would be to just remove the topk sampling in decode_spans, and keep the full scores matrix until after we merge duplicate answers, calculate answer probabilities and save the answers, and then sample topk only at the very end.
Obviously this adds quadratic overhead because we would need to add the probs for all start-end combinations, not sure if there are more efficient ways to get around this. The only thing I could think of would be to just artificially increase topk at first (lets say 10*topk) before the merging, and then later sample again with the actual topk value, which would only have constant extra overhead but would not guarantee that we have exact probabilities in the results (which we currently also don’t). @Rocketknight1 What do you think?

yushi2006

yushi2006 commented on Jul 3, 2025

@yushi2006
Contributor

Hey @Rocketknight1! I just noticed I was tagged here — sorry I missed it earlier. I’m jumping on the bug now and will get a fix out soon. Appreciate the mention!

yushi2006

yushi2006 commented on Jul 7, 2025

@yushi2006
Contributor

Hey @Rocketknight1! I have finished fixing this bug, appreciate if you can review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @Rocketknight1@WeichenXu123@itsmejul@yushi2006

      Issue actions

        QA pipeline prediction generates wrong response when `top_k` param > 1 · Issue #38984 · huggingface/transformers