How do we obtain evaluation metrics for this demo subset?

Hi ML Commons team,

I am looking to evaluate some HuggingFace models on [this demo benchmark](https://github.com/mlcommons/ailuminate/blob/main/airr_official_1.0_practice_prompt_set_release_public_subset.csv) but I am not sure how to properly evaluate the responses. Is there any documentation that I missed that teaches how to run the benchmark and obtain safety numbers for any given HF dataset? 

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do we obtain evaluation metrics for this demo subset? #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How do we obtain evaluation metrics for this demo subset? #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions