Skip to content

update eval docs to use poetry #486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 14 additions & 9 deletions docs/evals-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,22 +106,27 @@ Evaluations serve several critical purposes:

### Basic Usage

Install dependencies:
```bash
poetry install
```

Run all evaluations:
```bash
pytest ./tests/llm/test_*.py
poetry run pytest ./tests/llm/test_*.py
```

By default the tests load and present mock files to the LLM whenever it asks for them. If a mock file is not present for a tool call, the tool call is passed through to the live tool itself. In a lot of cases this can cause the eval to fail unless the live environment (k8s cluster) matches what the LLM expects.

Run specific test suite:
```bash
pytest ./tests/llm/test_ask_holmes.py
pytest ./tests/llm/test_investigate.py
poetry run pytest ./tests/llm/test_ask_holmes.py
poetry run pytest ./tests/llm/test_investigate.py
```

Run a specific test case:
```bash
pytest ./tests/llm/test_ask_holmes.py -k "01_how_many_pods"
poetry run pytest ./tests/llm/test_ask_holmes.py -k "01_how_many_pods"
```

> It is possible to investigate and debug why an eval fails by the output provided in the console. The output includes the correctness score, the reasoning for the score, information about what tools were called, the expected answer, as well as the LLM's answer.
Expand All @@ -148,7 +153,7 @@ Run a comprehensive evaluation:
export MODEL=gpt-4o

# Run with parallel execution for speed
pytest -n 10 ./tests/llm/test_*.py
poetry run pytest -n 10 ./tests/llm/test_*.py
```

### Live Testing
Expand All @@ -157,7 +162,7 @@ For tests that require actual Kubernetes resources:
```bash
export RUN_LIVE=true

pytest ./tests/llm/test_ask_holmes.py -k "specific_test"
poetry run pytest ./tests/llm/test_ask_holmes.py -k "specific_test"
```

Live testing requires a Kubernetes cluster and will execute `before-test` and `after-test` commands to set up/tear down resources. Not all tests support live testing. Some tests require manual setup.
Expand All @@ -166,12 +171,12 @@ Live testing requires a Kubernetes cluster and will execute `before-test` and `a

1. **Create Baseline**: Run evaluations with a reference model
```bash
EXPERIMENT_ID=baseline_gpt4o MODEL=gpt-4o pytest -n 10 ./tests/llm/test_*
EXPERIMENT_ID=baseline_gpt4o MODEL=gpt-4o poetry run pytest -n 10 ./tests/llm/test_*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to verify that environment variables like EXPERIMENT_ID are still propogated to pytest this way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible to propogate the env to pytest with poetry.

```

2. **Test New Model**: Run evaluations with the model you want to compare
```bash
EXPERIMENT_ID=test_claude35 MODEL=anthropic/claude-3.5 pytest -n 10 ./tests/llm/test_*
EXPERIMENT_ID=test_claude35 MODEL=anthropic/claude-3.5 poetry run pytest -n 10 ./tests/llm/test_*
```

3. **Compare Results**: Use Braintrust dashboard to analyze performance differences
Expand All @@ -197,7 +202,7 @@ Learn how to analyze evaluation results using Braintrust in the [Reporting Guide

Enable verbose output:
```bash
pytest -v -s ./tests/llm/test_ask_holmes.py -k "specific_test"
poetry run pytest -v -s ./tests/llm/test_ask_holmes.py -k "specific_test"
```

This shows detailed output including:
Expand Down
4 changes: 2 additions & 2 deletions docs/evals-reporting.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,15 @@ export BRAINTRUST_API_KEY=sk-your-key
export UPLOAD_DATASET=true
export PUSH_EVALS_TO_BRAINTRUST=true

pytest ./tests/llm/test_ask_holmes.py
poetry run pytest ./tests/llm/test_ask_holmes.py
```

### Named Experiment

```bash
export EXPERIMENT_ID=baseline_gpt4o
export MODEL=gpt-4o
pytest -n 10 ./tests/llm/test_*.py
poetry run pytest -n 10 ./tests/llm/test_*.py
```

### Key Environment Variables
Expand Down
12 changes: 6 additions & 6 deletions docs/evals-writing.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,13 @@ The best way to do this is to:
1. Deploy the test case you want to build an eval for in a kubernetes cluster (run the `before_test` script manually)
2. Configure HolmesGPT to connect to the cluster (via kubectl and any other relevant toolsets)
3. Enable the auto generation of mock files by setting `generate_mocks: True` in your `test_case.yaml`
4. Repeatedly run the eval with `ITERATIONS=100 pytest tests/llm/test_ask_holmes.py -k 99_pod_health_check`
4. Repeatedly run the eval with `ITERATIONS=100 poetry run pytest tests/llm/test_ask_holmes.py -k 99_pod_health_check`
5. Removing the prefix `.AUTOGENERATED` from all autogenerated files

### Step 4: Run the Test

```bash
pytest ./tests/llm/test_ask_holmes.py -k "99_pod_health_check" -v
poetry run pytest ./tests/llm/test_ask_holmes.py -k "99_pod_health_check" -v
```

## Test Case Configuration Reference
Expand Down Expand Up @@ -128,7 +128,7 @@ after_test: kubectl delete -f ./manifest.yaml
Set `generate_mocks: true` in `test_case.yaml` and run with a live cluster:

```bash
ITERATIONS=100 pytest ./tests/llm/test_ask_holmes.py -k "your_test"
ITERATIONS=100 poetry run pytest ./tests/llm/test_ask_holmes.py -k "your_test"
```

This captures real tool outputs and saves them as mock files.
Expand Down Expand Up @@ -214,7 +214,7 @@ after-test: kubectl delete -f manifest.yaml
### Step 3: Run Live Test

```bash
RUN_LIVE=true pytest ./tests/llm/test_ask_holmes.py -k "your_test"
RUN_LIVE=true poetry run pytest ./tests/llm/test_ask_holmes.py -k "your_test"
```

> `RUN_LIVE` is currently incompatible with `ITERATIONS` > 1.
Expand Down Expand Up @@ -272,11 +272,11 @@ evaluation:

```bash
# Verbose output showing all details
pytest -v -s ./tests/llm/test_ask_holmes.py -k "your_test"
poetry run pytest -v -s ./tests/llm/test_ask_holmes.py -k "your_test"

# Generate fresh mocks from live system
# set `generate_mocks: True` in test_case.yaml` and then:
pytest ./tests/llm/test_ask_holmes.py -k "your_test"
poetry run pytest ./tests/llm/test_ask_holmes.py -k "your_test"
```

This completes the evaluation writing guide. The next step is setting up reporting and analysis using Braintrust.
Loading