feat: replace all claude3 specific impl with canonicalEvaluator that … #101

amznokapl · 2025-03-07T15:45:53Z

Summary

There is not an open issue for this PR, aside from this issue #13 to add support for Haiku 3 which the PR would close. Instead of just adding 1 more model, a wider approach is taken here through a reconstruction of the code.

The main goal of this pull request is to allow users to specify and configure the bedrock model they choose as an evaluator. The motivation behind this PR is that Sonnet 3 capacity issues posed a material problem for my team's use case.

To justify this goal it should be noted that:

Sonnet 3 is an out of date with below STOA performance across tasks.
Sonnet 3 is an expensive model in per-token costs.
Sonnet 3 has reduced Bedrock capacity compared to other models.
Models are changing quickly. Consumers of this package should be able to flexibly update models when desired.

Code Approach

The insight for how to make the code-base significantly more agnostic to specific models is realizing that the Claude3 evaluator implementation contained almost no code specific to that one model. Substituting any other Anthropic model (or Llama model with some other small changes added in the PR) also works. The assumption of the Claude3 evaluator was much more coupled to how the templates and the interactions with those templates were set up.

In the diff, we see that most the implementation of the previous Claude3 evaluator is unchanged.

Supporting Anthropic and Meta to start

Since the request and response shape is not the same for all model providers, I wrote a small handler class to handle Meta and Anthropic models to start. This class can be easily extended in the future to support other providers. See src/agenteval/evaluators/bedrock/request/bedrock_request_handler.py to see exactly the difference in how to make requests to and parse responses from the two providers.

Going forward

Now, the end user can pick any Anthropic or Meta model, specify whatever request_config they desire (temperature, max_tokens, etc) but will still use the same overall evaluation approach defined in this repository.

Going forward, an alternative evaluation method can be defined next to the canonical method defined in this PR (previously the claude3 evaluator). Future evaluators will similarly accept the BedrockModelConfig, which can also be extended to support other providers.

Backwards Compatibility

The changes presented in this PR should be fully backwards compatible. If the previous model: claude-3 is specified in the YAML, exactly the same behavior should be seen.

Note: There is some chance that loggers expecting certain file paths will be affected by this change.

Limitations

There is a very wide surface area with what can be tested here. Since this PR retains the existing functionality as is, I hope that we can test the new capabilities presented in this PR in the wild.

I did not spend time tuning the default model configurations. I kept the Bedrock request configurations in parity with the existing Sonnet 3 configuration (temperature, topP topK maxTokens). Since users can define any configuration now, I thought this was a sensible approach.

Testing

Linting: python3 -m flake8 src/ && python3 -m black --check src/ && python3 -m isort src/ --check --diff
Unit Tests: added unit tests to cover new functionality
E2E Tests: ran package against the example added in the samples/ folder successfully.

Unit test run

python3 -m pytest .

tests/src/agenteval/evaluators/bedrock_request/test_bedrock_request_handler.py .....                                                                             [  6%]
tests/src/agenteval/evaluators/canonical/test_evaluator.py .........                                                                                             [ 19%]
tests/src/agenteval/evaluators/test_evaluator_factory.py ..                                                                                                      [ 21%]
tests/src/agenteval/plan/test_logging.py ...                                                                                                                     [ 26%]
tests/src/agenteval/plan/test_plan.py .........                                                                                                                  [ 38%]
tests/src/agenteval/targets/bedrock_agent/test_target.py ..                                                                                                      [ 41%]
tests/src/agenteval/targets/bedrock_flow/test_target.py .                                                                                                        [ 42%]
tests/src/agenteval/targets/bedrock_knowledge_base/test_target.py .                                                                                              [ 43%]
tests/src/agenteval/targets/lexv2/test_target.py ...                                                                                                             [ 47%]
tests/src/agenteval/targets/q_business/test_target.py .                                                                                                          [ 49%]
tests/src/agenteval/targets/sagemaker_endpoint/test_target.py .........                                                                                          [ 61%]
tests/src/agenteval/targets/test_target_factory.py ...                                                                                                           [ 65%]
tests/src/agenteval/test/test_test_suite.py .....                                                                                                                [ 72%]
tests/src/agenteval/test_cli.py ....                                                                                                                             [ 78%]
tests/src/agenteval/test_metrics.py ...                                                                                                                          [ 82%]
tests/src/agenteval/test_summary.py .                                                                                                                            [ 83%]
tests/src/agenteval/test_trace.py .....                                                                                                                          [ 90%]
tests/src/agenteval/utils/test_aws.py .                                                                                                                          [ 91%]
tests/src/agenteval/utils/test_imports.py ......                                                                                                                 [100%]

========================================================================== 73 passed in 0.69s ==========================================================================

…supports all models defined in newly introduced BedrockModelConfig

feat: replace all claude3 specific impl with canonicalEvaluator that …

3e85d82

…supports all models defined in newly introduced BedrockModelConfig

sharonxiaohanli self-requested a review March 10, 2025 20:02

sharonxiaohanli approved these changes Mar 10, 2025

View reviewed changes

sharonxiaohanli merged commit a325ad3 into awslabs:main Mar 10, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace all claude3 specific impl with canonicalEvaluator that … #101

feat: replace all claude3 specific impl with canonicalEvaluator that … #101

amznokapl commented Mar 7, 2025 •

edited

Loading

feat: replace all claude3 specific impl with canonicalEvaluator that … #101

feat: replace all claude3 specific impl with canonicalEvaluator that … #101

Conversation

amznokapl commented Mar 7, 2025 • edited Loading

Summary

Code Approach

Supporting Anthropic and Meta to start

Going forward

Backwards Compatibility

Limitations

Testing

Unit test run

amznokapl commented Mar 7, 2025 •

edited

Loading