feat: replace all claude3 specific impl with canonicalEvaluator that … #101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
There is not an open issue for this PR, aside from this issue #13 to add support for Haiku 3 which the PR would close. Instead of just adding 1 more model, a wider approach is taken here through a reconstruction of the code.
The main goal of this pull request is to allow users to specify and configure the bedrock model they choose as an evaluator. The motivation behind this PR is that Sonnet 3 capacity issues posed a material problem for my team's use case.
To justify this goal it should be noted that:
Code Approach
The insight for how to make the code-base significantly more agnostic to specific models is realizing that the
Claude3 evaluator
implementation contained almost no code specific to that one model. Substituting any other Anthropic model (or Llama model with some other small changes added in the PR) also works. The assumption of theClaude3 evaluator
was much more coupled to how the templates and the interactions with those templates were set up.In the diff, we see that most the implementation of the previous
Claude3
evaluator is unchanged.Supporting Anthropic and Meta to start
Since the request and response shape is not the same for all model providers, I wrote a small handler class to handle Meta and Anthropic models to start. This class can be easily extended in the future to support other providers. See
src/agenteval/evaluators/bedrock/request/bedrock_request_handler.py
to see exactly the difference in how to make requests to and parse responses from the two providers.Going forward
Now, the end user can pick any Anthropic or Meta model, specify whatever
request_config
they desire (temperature
,max_tokens
, etc) but will still use the same overall evaluation approach defined in this repository.Going forward, an alternative evaluation method can be defined next to the
canonical
method defined in this PR (previously theclaude3
evaluator). Future evaluators will similarly accept theBedrockModelConfig
, which can also be extended to support other providers.Backwards Compatibility
The changes presented in this PR should be fully backwards compatible. If the previous
model: claude-3
is specified in the YAML, exactly the same behavior should be seen.Note: There is some chance that loggers expecting certain file paths will be affected by this change.
Limitations
There is a very wide surface area with what can be tested here. Since this PR retains the existing functionality as is, I hope that we can test the new capabilities presented in this PR in the wild.
I did not spend time tuning the default model configurations. I kept the Bedrock request configurations in parity with the existing Sonnet 3 configuration (
temperature
,topP
topK
maxTokens
). Since users can define any configuration now, I thought this was a sensible approach.Testing
python3 -m flake8 src/ && python3 -m black --check src/ && python3 -m isort src/ --check --diff
samples/
folder successfully.Unit test run
python3 -m pytest .