This project implements the evaluation of agents on the ML-Bench dataset using OpenHands. ML-Bench is a comprehensive benchmark designed to assess the effectiveness of Large Language Models (LLMs) in leveraging existing functions in open-source libraries for machine learning tasks. The benchmark consists of 10,040 samples spanning 130 tasks over 14 notable machine learning GitHub repositories.
The ML-Bench task presents a scenario where, given a GitHub repository, the language model has access to all files within the repository. Upon receiving a user instruction with a set of specific parameters, the Agent is required to write code that invokes models or functions from the GitHub repository. The generated code must align with the user's instruction, particularly in terms of the specified parameters, and must be executable.
The task introduces new challenges for LLMs, such as comprehending long and language-code interleaved documents, understanding complex cross-file code structures, and effectively navigating the codebase to locate relevant information. ML-Bench serves as a critical tool for assessing the efficiency and adaptability of various methods in real-world scenarios.
For more details on the ML-Bench task and dataset, please refer to the paper: ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks.
Please follow instruction here to setup your local development environment and LLM.
To run the evaluation on the ML-Bench dataset, use the following command:
./evaluation/ml_bench/scripts/run_infer.sh [model_config] [git-version] [split] [agent] [eval_limit]
# e.g., ./evaluation/ml_bench/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 full CodeActAgent 10
You can replace eval_gpt4_1106_preview
with any model you set up in config.toml
.
To score the evaluation output, use the following command:
./evaluation/ml_bench/scripts/summarise_results.py [eval_output_dir]
# e.g., ./evaluation/ml_bench/scripts/summarise_results.py evaluation/evaluation_outputs/outputs/ml_bench/CodeActAgent/gpt-4-1106-preview_maxiter_10_N_v1.5
To run error analysis on the ML-Bench dataset, use the following command:
./evaluation/ml_bench/scripts/run_analysis.sh [eval_output_dir] [model_config]
# e.g., ./evaluation/ml_bench/scripts/run_analysis.sh evaluation/evaluation_outputs/outputs/ml_bench/CodeActAgent/gpt-4-1106-preview_maxiter_10_N_v1.5/output.jsonl eval_gpt4_1106_preview
This command generates a report on the evaluation output and provides insights into the agent's performance.
For each task in the ML-Bench dataset, OpenHands provides the agent with a set number of iterations to complete the task. The history
field in the evaluation output shows each iteration's response and actions taken by the agent to complete the task.
Here's an example of the evaluation output for a single task instance:
{
"instance_id": 3,
"repo": "https://github.com/dmlc/dgl",
"instruction": "Please complete the Machine Learning task in the following repository: dgl\n\nThe task is: DGL Implementation of NGCF model\n\nI have a deep desire to embark on a journey brimming with knowledge and expertise. My objective is to train a cutting-edge NGCF Model, known for its unparalleled capabilities, on the illustrious dataset known as gowalla. To ensure swift execution, I kindly request your assistance in crafting the code, making use of the powerful GPU #3 and an embedding size of 32. Can you lend a helping hand to transform this dream into a reality?\n\nYou should create a script named `run.sh` under the specified path in the repo to run the task.\n\nYou can find the task repo at: /workspace/dgl/examples/pytorch/NGCF/NGCF\n\nYou should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).When you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n",
"metadata": {
"agent_class": "CodeActAgent",
"model_name": "gpt-4-1106-preview",
"max_iterations": 10,
"eval_output_dir": "evaluation/evaluation_outputs/outputs/ml_bench/CodeActAgent/gpt-4-1106-preview_maxiter_10_N_v1.5",
"start_time": "2024-05-26 17:39:59",
"git_commit": "dd8ee9044a94a213dc2e31d2085dbf2924ee80a1"
},
"history": [
[
{
"id": 0,
"timestamp": "2024-05-26T17:40:41.060009",
"source": "user",
"message": "Please complete the Machine Learning task in the following repository: dgl\n\nThe task is: DGL Implementation of NGCF model\n\nI have a deep desire to embark on a journey brimming with knowledge and expertise. My objective is to train a cutting-edge NGCF Model, known for its unparalleled capabilities, on the illustrious dataset known as gowalla. To ensure swift execution, I kindly request your assistance in crafting the code, making use of the powerful GPU #3 and an embedding size of 32. Can you lend a helping hand to transform this dream into a reality?\n\nYou should create a script named `run.sh` under the specified path in the repo to run the task.\n\nYou can find the task repo at: /workspace/dgl/examples/pytorch/NGCF/NGCF\n\nYou should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).When you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n",
"action": "message",
"args": {
"content": "Please complete the Machine Learning task in the following repository: dgl\n\nThe task is: DGL Implementation of NGCF model\n\nI have a deep desire to embark on a journey brimming with knowledge and expertise. My objective is to train a cutting-edge NGCF Model, known for its unparalleled capabilities, on the illustrious dataset known as gowalla. To ensure swift execution, I kindly request your assistance in crafting the code, making use of the powerful GPU #3 and an embedding size of 32. Can you lend a helping hand to transform this dream into a reality?\n\nYou should create a script named `run.sh` under the specified path in the repo to run the task.\n\nYou can find the task repo at: /workspace/dgl/examples/pytorch/NGCF/NGCF\n\nYou should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).When you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n",
"wait_for_response": false
}
},
{
"message": "No observation",
"observation": "null",
"content": "",
"extras": {}
}
],
// ... more iterations
],
"eval_exit_code": 124, // ML-Bench believes the agent is successful if it continues to run until timeout
"eval_output": "",
"eval_script": "pip install Matplotlib==2.2.2\r\n"
"cd /workspace/dgl/examples/pytorch/dgmg\r\n"
"python main.py",
"metrics": {
"success": 1
}
}
The history
field contains the agent's actions and observations at each iteration, including the commands executed, file edits, and the agent's thoughts.
The eval_exit_code
and eval_output
fields provide information about the execution of the evaluation command and its output.
The metrics
field contains the parsed evaluation metrics from the eval_output
.
You can customize the evaluation script by modifying the evaluation/ml_bench/run_infer.py
file. This script handles loading the ML-Bench dataset, running the agent on each task instance, and saving the evaluation outputs.
Feel free to adjust the configuration, logging, and output formatting to suit your needs.
If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
This project is licensed under the MIT License.