|
| 1 | + |
| 2 | +# LlmAsJudgeEvals |
| 3 | + |
| 4 | +This library provides a service for evaluating responses from Large Language Models (LLMs) using the LLM itself as a judge. It leverages Semantic Kernel to define and execute evaluation functions based on prompt templates. |
| 5 | + |
| 6 | +**For a more precise evaluation score, the library utilizes `logprobs` and calculates a weighted total of probabilities for each evaluation criterion.** |
| 7 | + |
| 8 | +## Installation |
| 9 | + |
| 10 | +Install the package via NuGet: |
| 11 | + |
| 12 | +``` |
| 13 | +Install-Package HillPhelmuth.SemanticKernel.LlmAsJudgeEvals |
| 14 | +``` |
| 15 | + |
| 16 | +## Usage |
| 17 | + |
| 18 | +### Built-in Evaluation Functions |
| 19 | + |
| 20 | +The package includes a set of built-in evaluation functions, each focusing on a specific aspect of LLM output quality: |
| 21 | + |
| 22 | +* **Coherence:** Evaluates the logical flow and consistency of the response. |
| 23 | +* **Empathy:** Assesses the level of empathy and understanding conveyed in the response. |
| 24 | +* **Fluency:** Measures the smoothness and naturalness of the language used. |
| 25 | +* **GptGroundedness:** Determines how well the response is grounded in factual information. |
| 26 | +* **GptGroundedness2:** An alternative approach to evaluating groundedness. |
| 27 | +* **GptSimilarity:** Compares the response to a reference text or objectively correct answer for similarity. |
| 28 | +* **Helpfulness:** Assesses the degree to which the response is helpful and informative. |
| 29 | +* **PerceivedIntelligence:** Evaluates the perceived intelligence and knowledge reflected in the response. |
| 30 | +* **PerceivedIntelligenceNonRag:** A variant of PerceivedIntelligence tailored for non-Retrieval Augmented Generation (RAG) models. |
| 31 | +* **Relevance:** Measures the relevance of the response to the given prompt or question and a reference text for RAG. |
| 32 | + |
| 33 | + |
| 34 | +```csharp |
| 35 | + |
| 36 | +// Initialize the Semantic Kernel |
| 37 | +var kernel = Kernel.CreateBuilder().AddOpenAIChatCompletion("openai-model-name", "openai-apiKey").Build(); |
| 38 | + |
| 39 | +// Create an instance of the EvalService |
| 40 | +var evalService = new EvalService(kernel); |
| 41 | + |
| 42 | +// Create an input model for the built-in evaluation function |
| 43 | +var coherenceInput = InputModel.CoherenceModel("This is the answer to evaluate.", "This is the question or prompt that generated the answer"); |
| 44 | + |
| 45 | +// Execute the evaluation |
| 46 | +var result = await evalService.ExecuteEval(inputModel); |
| 47 | + |
| 48 | + |
| 49 | +Console.WriteLine($"Evaluation score: {result.Score}"); |
| 50 | + |
| 51 | +``` |
| 52 | + |
| 53 | +### Custom Evaluation Functions |
| 54 | + |
| 55 | +```csharp |
| 56 | + |
| 57 | +// Initialize the Semantic Kernel |
| 58 | +var kernel = Kernel.CreateBuilder().AddOpenAIChatCompletion("openai-model-name", "openai-apiKey").Build(); |
| 59 | + |
| 60 | +// Create an instance of the EvalService |
| 61 | +var evalService = new EvalService(kernel); |
| 62 | + |
| 63 | +// Add an evaluation function (optional) |
| 64 | +evalService.AddEvalFunction("MyEvalFunction", "This is the prompt for my evaluation function.", new PromptExecutionSettings()); |
| 65 | + |
| 66 | +// Create an input model for the evaluation function |
| 67 | +var inputModel = new InputModel |
| 68 | +{ |
| 69 | + FunctionName = "MyEvalFunction", // Replace with the name of your evaluation function |
| 70 | + RequiredInputs = new Dictionary<string, string> |
| 71 | + { |
| 72 | + { "input", "This is the text to evaluate." } |
| 73 | + } |
| 74 | +}; |
| 75 | + |
| 76 | +// Execute the evaluation |
| 77 | +var result = await evalService.ExecuteEval(inputModel); |
| 78 | + |
| 79 | + |
| 80 | +Console.WriteLine($"Evaluation score: {result.Score}"); |
| 81 | +``` |
| 82 | + |
| 83 | +## Features |
| 84 | + |
| 85 | +* **Define evaluation functions using prompt templates:** You can define evaluation functions using prompt templates written in YAML. |
| 86 | +* **Execute evaluations:** The `EvalService` provides methods for executing evaluations on input data. |
| 87 | +* **Aggregate results:** The `EvalService` can aggregate evaluation scores across multiple inputs. |
| 88 | +* **Built-in evaluation functions:** The package includes a set of pre-defined evaluation functions based on common evaluation metrics. |
| 89 | +* **Logprobs-based scoring:** Leverages `logprobs` for a more granular and precise evaluation score. |
| 90 | + |
0 commit comments