Skip to content

Changed Attempt.outputs to return all assistant outputs #1168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mrowebot
Copy link
Contributor

@mrowebot mrowebot commented Apr 17, 2025

This PR fixes #1127.

Changed

  • attempts.Attempt.outputs to return all assistant turns outputs, not just the final turn output.

Added

  • last_output property to attempts.Attempt to return the last assistant turn output.
  • Unit test cases for probes.atkgen.Tox and detectors.unsafe_content.ToxicCommentModel to validate that outputs and detector_results cardinalities match.

Verification

  • Run the tests and ensure they pass python -m pytest tests/
  • python -m garak --model_type huggingface --model_name gpt2 --probes atkgen.Tox --detectors unsafe_content.ToxicCommentModel

The above CMD will write the probe attempts including detector results to the garak*.report.jsonl file and will contain entries where the "entry_type": "attempt" entry will contain outputs and detector_results with the same cardinality, should any detection results have been produced.

@mrowebot mrowebot marked this pull request as ready for review April 17, 2025 22:13
@mrowebot mrowebot changed the title 1127 Changed Attempt.outputs to return all assistant outputs #1127 Changed Attempt.outputs to return all assistant outputs Apr 17, 2025
@mrowebot mrowebot changed the title #1127 Changed Attempt.outputs to return all assistant outputs 1127 Changed Attempt.outputs to return all assistant outputs Apr 17, 2025
@jmartin-tech
Copy link
Collaborator

jmartin-tech commented Apr 17, 2025

The attempt.outputs method is currently in flux as turn an conversation support is coming soon in main. Until a week ago detectors used attempt.all_outputs() and as of land for #943 with language translation options they all now rely on attempt.outputs_for(lang) to get the list of outputs a detector should perform evaluation on, these methods are currently what are expected to have a 1:1 match in to the list of detector_results[detector_name] however I do see that Evaluator.evaluate() relies on attempt.outputs() which likely indicates #1127 is a unique issue. The rework of turn/conversation will need likely incorporate testing to validate detector result count to output expectations are enforced.

Note that the atkgen probe's custom behavior and mutation of Attempt is likely the true root of this divergence as that probe emits Attempts that to not conform to normal detector expectations.

I suspect this change is not quite what is needed however I need to do some more testing to be sure.

@jmartin-tech jmartin-tech changed the title 1127 Changed Attempt.outputs to return all assistant outputs Changed Attempt.outputs to return all assistant outputs Apr 17, 2025
@mrowebot
Copy link
Contributor Author

Sounds like this PR needs to be held until attempts.output is clearer, as this could be an edge case involving atkgen

@leondz
Copy link
Collaborator

leondz commented Apr 18, 2025

General refactor is upcoming in #1089 which is planned for the coming fortnight

@mrowebot
Copy link
Contributor Author

@leondz park this then until refactor and potentially delete? Feels like it might be a redundant change.

@jmartin-tech jmartin-tech marked this pull request as draft April 18, 2025 22:54
@leondz
Copy link
Collaborator

leondz commented Apr 19, 2025

The Turn & Conversation functionality won't hit release until end May, possibly end June. Depending on review, if we can land this earlier than that, I'm happy

@jmartin-tech
Copy link
Collaborator

@leondz sorry if I was unclear, the root cause of the issue seems to be in atkgen's custom mutation of it's attempts not matching the expectations of the detector the attempts are passed to.

While attempt.outputs() is only called in Evaluator.evaluate at this time, I do not believe a global changing in behavior of this method is the desired result as this impacts all evaluation processing. This PR is set as draft for now and I will take a closer look in the next few days at possible follow on impacts before shifting it back to ready for review state.

@mrowebot
Copy link
Contributor Author

@jmartin-tech let me know if any changes are needed here and happy to update the branch/PR.

@leondz
Copy link
Collaborator

leondz commented Apr 23, 2025

This highlighted an unclear behaviour in atkgen and whether use of all_outputs or outputs is preferred when doing detection on an attempt. Atkgen expects that the detector will be run over all outputs at any conversational turn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: attempt["outputs"] log has fewer entries than in detector_results
3 participants