Description
Proposal summary
We like to extend the existing evaluation metrics to include a new metric called "Structured Output Compliance". Esentially we are ensuring this output is JSON and/or JSON-LD compatible. Ideal solution would have Pydantic schema support.
Example of an existing judge metric (Hallucination) is defined here:
- Docs: https://www.comet.com/docs/opik/evaluation/metrics/hallucination
- Docs Code: https://github.com/comet-ml/opik/blob/main/apps/opik-documentation/documentation/fern/docs/evaluation/metrics/hallucination.mdx
- Python SDK: https://github.com/comet-ml/opik/tree/main/sdks/python/src/opik/evaluation/metrics/llm_judges/hallucination
- Python Examples: https://github.com/comet-ml/opik/blob/main/sdks/python/examples/metrics.py
- Frontend: https://github.com/comet-ml/opik/blob/main/apps/opik-frontend/src/constants/llm.ts
Ideally this is implemented as an LLM-as-a-judge but could use normal herustics by extending the regex/heuristic metric. Expectation is the new judge is added to the frontend for using LLM-as-a-judge from the UI (Online Evaluation tab) as well as in the Python SDK. The appropriate docs needs to be updated and a video attached of the metric working.
Return should be boolean (True/False), and if its using LLM as a eval then should also contain a "reason"
Motivation
I would like to see more robust set of metrics and evaluations based on recent research. We also know structured data compliance is critical for a number of use-cases.