Open
Description
Hi ML Commons team,
I am looking to evaluate some HuggingFace models on this demo benchmark but I am not sure how to properly evaluate the responses. Is there any documentation that I missed that teaches how to run the benchmark and obtain safety numbers for any given HF dataset?
Thank you!
Metadata
Metadata
Assignees
Labels
No labels