robusta-dev · aantn · Jun 6, 2025 · Jun 16, 2025 · aantn · Jun 6, 2025
diff --git a/docs/evals-introduction.md b/docs/evals-introduction.md
@@ -106,22 +106,27 @@ Evaluations serve several critical purposes:
 
 ### Basic Usage
 
+Install dependencies:
+```bash
+poetry install
+```
+
 Run all evaluations:
 ```bash
-pytest ./tests/llm/test_*.py
+poetry run pytest ./tests/llm/test_*.py
 ```
 
 By default the tests load and present mock files to the LLM whenever it asks for them. If a mock file is not present for a tool call, the tool call is passed through to the live tool itself. In a lot of cases this can cause the eval to fail unless the live environment (k8s cluster) matches what the LLM expects.
 
 Run specific test suite:
 ```bash
-pytest ./tests/llm/test_ask_holmes.py
-pytest ./tests/llm/test_investigate.py
+poetry run pytest ./tests/llm/test_ask_holmes.py
+poetry run pytest ./tests/llm/test_investigate.py
 ```
 
 Run a specific test case:
 ```bash
-pytest ./tests/llm/test_ask_holmes.py -k "01_how_many_pods"
+poetry run pytest ./tests/llm/test_ask_holmes.py -k "01_how_many_pods"
 ```
 
 > It is possible to investigate and debug why an eval fails by the output provided in the console. The output includes the correctness score, the reasoning for the score, information about what tools were called, the expected answer, as well as the LLM's answer.
@@ -148,7 +153,7 @@ Run a comprehensive evaluation:
 export MODEL=gpt-4o
 
 # Run with parallel execution for speed
-pytest -n 10 ./tests/llm/test_*.py
+poetry run pytest -n 10 ./tests/llm/test_*.py
 ```
 
 ### Live Testing
@@ -157,7 +162,7 @@ For tests that require actual Kubernetes resources:
 ```bash
 export RUN_LIVE=true
 
-pytest ./tests/llm/test_ask_holmes.py -k "specific_test"
+poetry run pytest ./tests/llm/test_ask_holmes.py -k "specific_test"
 ```
 
 Live testing requires a Kubernetes cluster and will execute `before-test` and `after-test` commands to set up/tear down resources. Not all tests support live testing. Some tests require manual setup.
@@ -166,12 +171,12 @@ Live testing requires a Kubernetes cluster and will execute `before-test` and `a
 
 1. **Create Baseline**: Run evaluations with a reference model
    ```bash
-   EXPERIMENT_ID=baseline_gpt4o MODEL=gpt-4o pytest -n 10 ./tests/llm/test_*
+   EXPERIMENT_ID=baseline_gpt4o MODEL=gpt-4o poetry run pytest -n 10 ./tests/llm/test_*
    ```
 
 2. **Test New Model**: Run evaluations with the model you want to compare
    ```bash
-   EXPERIMENT_ID=test_claude35 MODEL=anthropic/claude-3.5 pytest -n 10 ./tests/llm/test_*
+   EXPERIMENT_ID=test_claude35 MODEL=anthropic/claude-3.5 poetry run pytest -n 10 ./tests/llm/test_*
    ```
 
 3. **Compare Results**: Use Braintrust dashboard to analyze performance differences
@@ -197,7 +202,7 @@ Learn how to analyze evaluation results using Braintrust in the [Reporting Guide
 
 Enable verbose output:
 ```bash
-pytest -v -s ./tests/llm/test_ask_holmes.py -k "specific_test"
+poetry run pytest -v -s ./tests/llm/test_ask_holmes.py -k "specific_test"
 ```
 
 This shows detailed output including:

diff --git a/docs/evals-reporting.md b/docs/evals-reporting.md
@@ -42,15 +42,15 @@ export BRAINTRUST_API_KEY=sk-your-key
 export UPLOAD_DATASET=true
 export PUSH_EVALS_TO_BRAINTRUST=true
 
-pytest ./tests/llm/test_ask_holmes.py
+poetry run pytest ./tests/llm/test_ask_holmes.py
 ```
 
 ### Named Experiment
 
 ```bash
 export EXPERIMENT_ID=baseline_gpt4o
 export MODEL=gpt-4o
-pytest -n 10 ./tests/llm/test_*.py
+poetry run pytest -n 10 ./tests/llm/test_*.py
 ```
 
 ### Key Environment Variables

diff --git a/docs/evals-writing.md b/docs/evals-writing.md
@@ -68,13 +68,13 @@ The best way to do this is to:
 1. Deploy the test case you want to build an eval for in a kubernetes cluster (run the `before_test` script manually)
 2. Configure HolmesGPT to connect to the cluster (via kubectl and any other relevant toolsets)
 3. Enable the auto generation of mock files by setting `generate_mocks: True` in your `test_case.yaml`
-4. Repeatedly run the eval with `ITERATIONS=100 pytest tests/llm/test_ask_holmes.py -k 99_pod_health_check`
+4. Repeatedly run the eval with `ITERATIONS=100 poetry run pytest tests/llm/test_ask_holmes.py -k 99_pod_health_check`
 5. Removing the prefix `.AUTOGENERATED` from all autogenerated files
 
 ### Step 4: Run the Test
 
 ```bash
-pytest ./tests/llm/test_ask_holmes.py -k "99_pod_health_check" -v
+poetry run pytest ./tests/llm/test_ask_holmes.py -k "99_pod_health_check" -v
 ```
 
 ## Test Case Configuration Reference
@@ -128,7 +128,7 @@ after_test: kubectl delete -f ./manifest.yaml
 Set `generate_mocks: true` in `test_case.yaml` and run with a live cluster:
 
 ```bash
-ITERATIONS=100 pytest ./tests/llm/test_ask_holmes.py -k "your_test"
+ITERATIONS=100 poetry run pytest ./tests/llm/test_ask_holmes.py -k "your_test"
 ```
 
 This captures real tool outputs and saves them as mock files.
@@ -214,7 +214,7 @@ after-test: kubectl delete -f manifest.yaml
 ### Step 3: Run Live Test
 
 ```bash
-RUN_LIVE=true pytest ./tests/llm/test_ask_holmes.py -k "your_test"
+RUN_LIVE=true poetry run pytest ./tests/llm/test_ask_holmes.py -k "your_test"
 ```
 
 > `RUN_LIVE` is currently incompatible with `ITERATIONS` > 1.
@@ -272,11 +272,11 @@ evaluation:
 
 ```bash
 # Verbose output showing all details
-pytest -v -s ./tests/llm/test_ask_holmes.py -k "your_test"
+poetry run pytest -v -s ./tests/llm/test_ask_holmes.py -k "your_test"
 
 # Generate fresh mocks from live system
 # set `generate_mocks: True` in test_case.yaml` and then:
-pytest ./tests/llm/test_ask_holmes.py -k "your_test"
+poetry run pytest ./tests/llm/test_ask_holmes.py -k "your_test"
 ```
 
 This completes the evaluation writing guide. The next step is setting up reporting and analysis using Braintrust.