Skip to content

feat: replace all claude3 specific impl with canonicalEvaluator that … #101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 10, 2025

Conversation

amznokapl
Copy link
Contributor

@amznokapl amznokapl commented Mar 7, 2025

Summary

There is not an open issue for this PR, aside from this issue #13 to add support for Haiku 3 which the PR would close. Instead of just adding 1 more model, a wider approach is taken here through a reconstruction of the code.

The main goal of this pull request is to allow users to specify and configure the bedrock model they choose as an evaluator. The motivation behind this PR is that Sonnet 3 capacity issues posed a material problem for my team's use case.

To justify this goal it should be noted that:

  1. Sonnet 3 is an out of date with below STOA performance across tasks.
  2. Sonnet 3 is an expensive model in per-token costs.
  3. Sonnet 3 has reduced Bedrock capacity compared to other models.
  4. Models are changing quickly. Consumers of this package should be able to flexibly update models when desired.

Code Approach

The insight for how to make the code-base significantly more agnostic to specific models is realizing that the Claude3 evaluator implementation contained almost no code specific to that one model. Substituting any other Anthropic model (or Llama model with some other small changes added in the PR) also works. The assumption of the Claude3 evaluator was much more coupled to how the templates and the interactions with those templates were set up.

In the diff, we see that most the implementation of the previous Claude3 evaluator is unchanged.

Supporting Anthropic and Meta to start

Since the request and response shape is not the same for all model providers, I wrote a small handler class to handle Meta and Anthropic models to start. This class can be easily extended in the future to support other providers. See src/agenteval/evaluators/bedrock/request/bedrock_request_handler.py to see exactly the difference in how to make requests to and parse responses from the two providers.

Going forward

Now, the end user can pick any Anthropic or Meta model, specify whatever request_config they desire (temperature, max_tokens, etc) but will still use the same overall evaluation approach defined in this repository.

Going forward, an alternative evaluation method can be defined next to the canonical method defined in this PR (previously the claude3 evaluator). Future evaluators will similarly accept the BedrockModelConfig, which can also be extended to support other providers.

Backwards Compatibility

The changes presented in this PR should be fully backwards compatible. If the previous model: claude-3 is specified in the YAML, exactly the same behavior should be seen.

Note: There is some chance that loggers expecting certain file paths will be affected by this change.

Limitations

There is a very wide surface area with what can be tested here. Since this PR retains the existing functionality as is, I hope that we can test the new capabilities presented in this PR in the wild.

I did not spend time tuning the default model configurations. I kept the Bedrock request configurations in parity with the existing Sonnet 3 configuration (temperature, topP topK maxTokens). Since users can define any configuration now, I thought this was a sensible approach.

Testing

  • Linting: python3 -m flake8 src/ && python3 -m black --check src/ && python3 -m isort src/ --check --diff
  • Unit Tests: added unit tests to cover new functionality
  • E2E Tests: ran package against the example added in the samples/ folder successfully.

Unit test run

python3 -m pytest .

tests/src/agenteval/evaluators/bedrock_request/test_bedrock_request_handler.py .....                                                                             [  6%]
tests/src/agenteval/evaluators/canonical/test_evaluator.py .........                                                                                             [ 19%]
tests/src/agenteval/evaluators/test_evaluator_factory.py ..                                                                                                      [ 21%]
tests/src/agenteval/plan/test_logging.py ...                                                                                                                     [ 26%]
tests/src/agenteval/plan/test_plan.py .........                                                                                                                  [ 38%]
tests/src/agenteval/targets/bedrock_agent/test_target.py ..                                                                                                      [ 41%]
tests/src/agenteval/targets/bedrock_flow/test_target.py .                                                                                                        [ 42%]
tests/src/agenteval/targets/bedrock_knowledge_base/test_target.py .                                                                                              [ 43%]
tests/src/agenteval/targets/lexv2/test_target.py ...                                                                                                             [ 47%]
tests/src/agenteval/targets/q_business/test_target.py .                                                                                                          [ 49%]
tests/src/agenteval/targets/sagemaker_endpoint/test_target.py .........                                                                                          [ 61%]
tests/src/agenteval/targets/test_target_factory.py ...                                                                                                           [ 65%]
tests/src/agenteval/test/test_test_suite.py .....                                                                                                                [ 72%]
tests/src/agenteval/test_cli.py ....                                                                                                                             [ 78%]
tests/src/agenteval/test_metrics.py ...                                                                                                                          [ 82%]
tests/src/agenteval/test_summary.py .                                                                                                                            [ 83%]
tests/src/agenteval/test_trace.py .....                                                                                                                          [ 90%]
tests/src/agenteval/utils/test_aws.py .                                                                                                                          [ 91%]
tests/src/agenteval/utils/test_imports.py ......                                                                                                                 [100%]

========================================================================== 73 passed in 0.69s ==========================================================================

…supports all models defined in newly introduced BedrockModelConfig
@sharonxiaohanli sharonxiaohanli self-requested a review March 10, 2025 20:02
@sharonxiaohanli sharonxiaohanli merged commit a325ad3 into awslabs:main Mar 10, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants