Skip to content

models prompt flow evaluator documentation

github-actions[bot] edited this page Dec 13, 2024 · 7 revisions

prompt flow evaluator

Models in this category


  • Bleu-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1]: higher means better quality. | | What is this metric? | BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text....

  • Coherence-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: 1 is the lowest quality and 5 is the highest quality. | | What is this metric? | Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. | | How does it work? | The coherence...

  • Content-Safety-Evaluator

    | | | | -- | -- | | Score range | Integer [0-7]: where 0 is the least harmful and 7 is the most harmful. A text label inis also provided. | | What is this metric? | Measures comprehensively the severity level of the content harm of a response, covering violence, sexual, self-harm, and hate and u...

  • ECI-Evaluator

    Definition

Election Critical Information (ECI) refers to any content related to elections, including voting processes, candidate information, and election results. The ECI evaluator uses the Azure AI Safety Evaluation service to assess the generated responses for ECI without a disclaimer.

#...

  • F1Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1]: higher means better quality. | | What is this metric? | F1 score measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. | | How does it work? | The F1-score computes the ratio...

  • Fluency-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: 1 is the lowest quality and 5 is the highest quality. | | What is this metric? | Fluency measures the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overa...

  • Gleu-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1]: higher means better quality. | | What is this metric? | The GLEU (Google-BLEU) score measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addre...

  • Groundedness-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: 1 is the lowest quality and 5 is the highest quality. | | What is this metric? | Groundedness measures how well the generated response aligns with the given context in a retrieval-augmented generation scenario, focusing on its relevance and accura...

  • Groundedness-Pro-Evaluator

    | | | | -- | -- | | Score range | Boolean: [true, false]: false if response is ungrounded and true if it's grounded. | | What is this metric? | Groundedness Pro (powered by Azure AI Content Safety) detects whether the generated text response is consistent or accurate with respect to the given ...

  • Hate-and-Unfairness-Evaluator

    Definition

Hateful and unfair content refers to any language pertaining to hate toward or unfair representations of individuals and social groups along factors including but not limited to race, ethnicity, nationality, gender, sexual orientation, religion, immigration status, ability, persona...

Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), are when jailbreak attacks are injected into the context of a document or source that may result in an altered, unexpected behavior.

Indirect attacks evaluations are broken down into three subcategories: ...

  • Meteor-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1]: higher means better quality. | | What is this metric? | METEOR score measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations ...

  • Protected-Material-Evaluator

    Definition

Protected material is any text that is under copyright, including song lyrics, recipes, and articles. Protected material evaluation leverages the Azure AI Content Safety Protected Material for Text service to perform the classification.

Labeling

Protected Material evaluations ...

  • QA-Evaluator

    | | | | -- | -- | | Score range | Float [0-1] for F1 score evaluator: the higher, the more similar is the response with ground truth. Integer [1-5] for AI-assisted quality evaluators for question-and-answering (QA) scenarios: where 1 is bad and 5 is good | | What is this metric? | Measures compr...

  • Relevance-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: 1 is the lowest quality and 5 is the highest quality. | | What is this metric? | Coherence measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A c...

  • Retrieval-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: 1 is the lowest quality and 5 is the highest quality. | | What is this metric? | Retrieval measures the quality of search without ground truth. It focuses on how relevant the context chunks (encoded as a string) are to address a query and how the ...

  • Rouge-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1]: higher means better quality. | | What is this metric? | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and...

  • Self-Harm-Related-Content-Evaluator

    Definition

Self-harm-related content includes language pertaining to actions intended to hurt, injure, or damage one's body or kill oneself.

Severity scale

Safety evaluations annotate self-harm-related content using a 0-7 scale.

Very Low (0-1) refers to

  • Content that contains self-...

Sexual content includes language pertaining to anatomical organs and genitals, romantic relationships, acts portrayed in erotic terms, pregnancy, physical sexual acts (including assault or sexual violence), prostitution, pornography, and sexual abuse.

Severity scale

Safety eva...

  • Similarity-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: 1 is the lowest quality and 5 is the highest quality. | | What is this metric? | Similarity measures the degrees of similarity between the generated text and its ground truth with respect to a query. | | How does it work? | The similarity metric i...

  • Violent-Content-Evaluator

    Definition

Violent content includes language pertaining to physical actions intended to hurt, injure, damage, or kill someone or something. It also includes descriptions of weapons and guns (and related entities such as manufacturers and associations).

Severity scale

Safety evaluations ...

Clone this wiki locally