-
Notifications
You must be signed in to change notification settings - Fork 775
Added Sycophancy Evaluation Metric in SDK, FE, Docs #2624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added Sycophancy Evaluation Metric in SDK, FE, Docs #2624
Conversation
Hello, @vincentkoc please review and suggest changes if any. Also kindly help me understand the frontend issue mentioned above.
|
Thanks! @yashkumar2603 the team will review and circle back. |
sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py
Outdated
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py
Outdated
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/parser.py
Outdated
Show resolved
Hide resolved
Hi @yashkumar2603 ! Thank you for your work on this PR — it looks very promising. I’ve left a few review comments. Additionally, I’d like to ask you to add a unit test for your metric that uses mocked model calls but verifies the scoring logic in both synchronous and asynchronous modes. Please take a look how other LLM judge metrics are tested. |
Thanks for the review @yaricom !! |
1. Implemented suggestions from reviews on the previous commit and made necessary changes. 2. Added unit tests for the sycophancy_evaluation_metric just like how it is applied for the other metrics
I have added the unit tests and also made necessary changes based on the reviews. |
sdks/python/tests/library_integration/metrics_with_llm_judge/test_evaluation_metrics.py
Outdated
Show resolved
Hide resolved
@aadereiko Could you please take a look at frontend changes if you have any comments or suggestions. |
@yaricom @yashkumar2603 |
Moved test for invalid score into the unit tests as it uses a dummy model and doesnt need to be in integration tests. removed unnecessary @model_parametrizer from the same test.
I have made the changes mentioned in the above comment, moved the test from integration to unit. Thank you for your time. |
): | ||
parser.parse_model_output(content=invalid_model_output, name=metric.name) | ||
|
||
def test__syc_eval__invalid_score(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be testing the metric’s score
method rather than the parser’s parse_model_output
. I suggest moving it to test_metric.py
for better alignment.
This metric presents rebuttals of varying rhetorical strength to test if models change their responses | ||
based on user pressure rather than maintaining independent reasoning. Returns progressive sycophancy | ||
score (beneficial changes) and regressive sycophancy score (harmful changes). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a description of the score values produced—specifically, which value indicates detected sycophancy and what the possible output values are.
|
||
|
||
@model_parametrizer | ||
def test__syc_eval__happyflow(model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails with default configuration because 'llama3-8b' is not supported.
self = <opik.evaluation.models.litellm.litellm_chat_model.LiteLLMChatModel object at 0x10fb7c2d0>
def _check_model_name(self) -> None:
import litellm
try:
_ = litellm.get_llm_provider(self.model_name)
except litellm.exceptions.BadRequestError:
> raise ValueError(f"Unsupported model: '{self.model_name}'!")
E ValueError: Unsupported model: 'llama3-8b'!
src/opik/evaluation/models/litellm/litellm_chat_model.py:102: ValueError
Please set a supported model as the default value for the rebuttal_model
parameter.
You can verify your code by running the OPIK server locally and executing your test.
): | ||
parser.parse_model_output(content=invalid_model_output, name=metric.name) | ||
|
||
def test_syc_eval_invalid_classification(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s rename this test for better clarity to reflect what method is being tested, the conditions, and the expected output. Something like this:
test__parse_model_output__syc_eval_invalid_classification__raise_error
import pytest | ||
from opik.evaluation.metrics.llm_judges.syc_eval.metric import SycEval | ||
|
||
def test_syc_eval_score_out_of_range(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s rename this test for better clarity to reflect what method is being tested, the conditions, and the expected output. Something like this:
test__parse_model_output__syc_eval_score_out_of_range__raise_error
parser.parse_model_output(content=invalid_model_output, name=metric.name) | ||
|
||
def test_syc_eval_invalid_classification(): | ||
metric = SycEval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails:
self = <opik.evaluation.models.litellm.litellm_chat_model.LiteLLMChatModel object at 0x11ef05350>
def _check_model_name(self) -> None:
import litellm
try:
_ = litellm.get_llm_provider(self.model_name)
except litellm.exceptions.BadRequestError:
> raise ValueError(f"Unsupported model: '{self.model_name}'!")
E ValueError: Unsupported model: 'llama3-8b'!
../../../../../../src/opik/evaluation/models/litellm/litellm_chat_model.py:102: ValueError
parser.parse_model_output(content=invalid_model_output, name=metric.name) | ||
|
||
def test_syc_eval_invalid_sycophancy_type(): | ||
metric = SycEval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails:
self = <opik.evaluation.models.litellm.litellm_chat_model.LiteLLMChatModel object at 0x11ef0cef0>
def _check_model_name(self) -> None:
import litellm
try:
_ = litellm.get_llm_provider(self.model_name)
except litellm.exceptions.BadRequestError:
> raise ValueError(f"Unsupported model: '{self.model_name}'!")
E ValueError: Unsupported model: 'llama3-8b'!
../../../../../../src/opik/evaluation/models/litellm/litellm_chat_model.py:102: ValueError
): | ||
parser.parse_model_output(content=invalid_model_output, name=metric.name) | ||
|
||
def test_syc_eval_invalid_sycophancy_type(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s rename this test for better clarity to reflect what method is being tested, the conditions, and the expected output. Something like this:
test__parse_model_output__syc_eval_invalid_sycophancy_type__raise_error
Dear @yashkumar2603 ! Thank you for committing the changes. Please run all tests locally using the OPIK server to ensure there are no unexpected errors. You can find detailed instructions on how to run the OPIK server here: https://www.comet.com/docs/opik/quickstart |
Details
Resolves #2520
This PR adds the SycEval metric for evaluating sycophantic behavior in large language models. The metric tests whether models change their responses based on user pressure rather than maintaining independent reasoning by presenting rebuttals of varying rhetorical strength.
It is based on this paper https://arxiv.org/pdf/2502.08177
as linked in the issue.
Key Features:
Implementation:
SycEval
class with sync/async scoring methodsfrom opik.evaluation.metrics import SycEval
in SDK easily, I tried to follow the coding style of the project, and other things mentioned in the contributing doc.Issues
I faced one problem, I wasnt able to figure out a way to add the different results found out by the sycophancy analysis, such as sycophancy_type into the scores category in FrontEnd, as that would have required a STRING type in the LLM_SCHEMA_TYPE
So I instead made those available on the SDK, but not on the frontend. Please suggest something to tackle this problem. Guide me to make the necessary improvements in PR.
Documentation
Working Video
2025-06-29_23-50-51.mp4
/claim #2520
Edit: added working video I forgot to add