Arch-Function represents a comprehensive research and development initiative focused on creating state-of-the-art function calling capabilities in large language models. Our mission is to build AI systems that can seamlessly understand, interpret, and execute complex function calls with unprecedented accuracy and reliability.
This project encompasses multiple model families specifically engineered for function calling tasks, designed to understand complex function signatures, identify required parameters, and produce accurate function call outputs based on natural language prompts. The current release includes three major collections with models available in multiple sizes, with additional breakthrough models planned for future releases that will further advance the state-of-the-art in function calling capabilities.
- [2025-06]: πππ Arch-Agent collection released for advanced multi-turn, multi-step workflow automation, achieving Top-3 performance on the BFCL Leaderboard!
- [2025-02]: πππ Arch-Function-Chat collection launched with conversational function calling capabilities!
- [2024-12]: π₯π₯π₯ Complete model suite updated with latest improvements across all sizes for Arch-Function collection!
- [2024-09]: πππ Arch-Function collection officially launched on Hugging Face, achieving Top-7 performance on the BFCL Leaderboard!
Hugging Face Collection: Arch-Function
Model Name | Size | Key Features | Downloads |
---|---|---|---|
Arch-Function-1.5B | 1.5B | β’ Compact size for edge deployment β’ Efficient function calling β’ Low resource requirements |
π€ HuggingFace |
Arch-Function-3B | 3B | β’ Balanced performance and efficiency β’ High accuracy function calling β’ Production-ready |
π€ HuggingFace |
Arch-Function-7B | 7B | β’ Maximum performance β’ Complex function handling β’ Enterprise-grade capabilities |
π€ HuggingFace |
Hugging Face Collection: Arch-Function-Chat
Model Name | Size | Key Features | Downloads |
---|---|---|---|
Arch-Function-Chat-1.5B | 1.5B | β’ Conversational function calling β’ Interactive agent capabilities β’ Lightweight deployment |
π€ HuggingFace |
Arch-Function-Chat-3B | 3B | β’ Advanced dialogue management β’ Context-aware function usage β’ Multi-turn conversations |
π€ HuggingFace |
Arch-Function-Chat-7B | 7B | β’ Sophisticated reasoning β’ Complex multi-step workflows β’ Premium chat experience |
π€ HuggingFace |
Hugging Face Collection: Arch-Agent
Model Name | Size | Key Features | Downloads |
---|---|---|---|
Arch-Agent-1.5B | 1.5B | β’ Lightweight autonomous workflows β’ Edge-optimized performance β’ Low resource requirements |
π€ HuggingFace |
Arch-Agent-3B | 3B | β’ Balanced autonomous performance β’ Multi-step task execution β’ High accuracy workflows |
π€ HuggingFace |
Arch-Agent-7B | 7B | β’ Advanced autonomous behavior β’ Complex workflow orchestration β’ Maximum performance |
π€ HuggingFace |
Arch-Agent-32B | 32B | β’ Premium autonomous systems β’ Sophisticated multi-step workflows β’ Superior capabilities |
π€ HuggingFace |
Here we provide a script to fine-tune Arch-Function models with LLaMA-Factory:
- Create the environment following the instructions of LLaMA-Factory
- If you would like to use deepspeed and flash-attn, you can install packages with the following command:
pip install deepspeed
pip install flash-attn --no-build-isolation
LLaMA-Factory supports datasets in alpaca
and sharegpt
format. We recommend using the sharegpt
format for function calling tasks. Below is an example of dataset in:
[
{
"conversations": [
{
"from": "human",
"value": "user instruction"
},
{
"from": "function_call",
"value": "tool arguments"
},
{
"from": "observation",
"value": "tool result"
},
{
"from": "gpt",
"value": "model response"
}
],
"system": "system prompt (optional)",
"tools": "tool description (optional)"
}
]
Next, update data/dataset_info.json
with the dataset description below:
"dataset_name": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"system": "system",
"tools": "tools"
}
}
LLaMA-Factory provides diverse examples of training for LLMs under examples
. You can follow these examples and create a training script for your purpose. To kick off training, run the following command:
CUDA_VISIBLE_DEVICES={YOUR_DEVICE_IDS} llamafactory-cli train {PATH_TO_YOUR_TRAINING_SCRIPT}
Key considerations for fine-tuning:
- Prepare high-quality function calling examples with proper format
- Use gradient accumulation for larger effective batch sizes
- Monitor validation loss to prevent overfitting
- Consider using LoRA for parameter-efficient fine-tuning
To run inference with Arch-Function models for function calling tasks, follow the steps below:
Arch-Function models have been in the Hugging Face transformers library and we advise you to install latest version with the following command:
pip install transformers>=4.51.0
Below is a script demonstrating how to use Arch-Function models for function calling tasks.
You can specify the desired model name and create models and corresponding tokenizers with the following script:
import json
from typing import Any, Dict, List
from transformers import AutoModelForCausalLM, AutoTokenizer
# Specify the desired model name here
model_name = "katanemo/Arch-Agent-7B"
model = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Our models perform best when using the recommended prompt format, which can be found in the corresponding model cards on Hugging Face. You can run the following script to format prompts:
# Please use the recommended prompt for each model.
TASK_PROMPT = (
"You are a helpful assistant designed to assist with the user query by making one or more function calls if needed."
"\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\n"
"You are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{tool_text}"
"\n</tools>\n\nFor each function call, return a json object with function name and arguments within "
"""<tool_call></tool_call> XML tags:\n<tool_call>\n{{"name": <function-name>, """
""""arguments": <args-json-object>}}\n</tool_call>"""
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "str",
"description": "The city and state, e.g. San Francisco, New York",
},
"unit": {
"type": "str",
"enum": ["celsius", "fahrenheit"],
"description": "The unit of temperature to return",
},
},
"required": ["location"],
},
},
}
]
# Helper function to create the system prompt for our model
def format_prompt(tools: List[Dict[str, Any]]):
tool_text = "\n".join(
[json.dumps(tool["function"], ensure_ascii=False) for tool in tools]
)
return TASK_PROMPT.format(tool_text=tool_text)
system_prompt = format_prompt(tools)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What is the weather in Seattle?"},
]
Now, you can run the following script to do inference with Arch-Function models.
#### 2.2.3 Run inference
model_inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt", return_dict=True
).to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Inference optimization tips:
- Use appropriate temperature settings (0.0 - 0.1 for function calling)
- User proper prompt formatting for best results
- Consider batching for multiple requests
- Use quantized models for faster inference
Below we show how to deploy Arch-Function models using popular model hosting frameworks.
vLLM provides high-throughput serving with advanced optimizations. Following the steps below to deploy Arch-Function models with vLLM
# Install vLLM
pip install vllm
vllm serve katanemo/Arch-Agent-7B \
--host 127.0.0.1 \
--port 8000 \
--tensor-parallel-size 1
To get responses from the vLLM server for function calling, first format prompts following here. Then, replace messages
in the script below with the formatted prompts and run the script.
from openai import OpenAI
# Point to the local server
client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8000/v1",
)
# Send requests and get responses from the server
completion = client.chat.completions.create(
model="katanemo/Arch-Agent-7B",
messages=[
{"role": "user", "content": "Get the current temperature in San Francisco"}
],
temperature=0.01,
max_tokens=1024
)
print(completion.choices[0].message.content)
ollama provides easy local deployment with automatic model management. Below we provide scripts to show how to use ollama for deployment.
Please see ollama for installation. If necessary, use the following command to install ollama python library.
pip install ollama
Specify your desired model name below and run the follwoing command to start the ollama server. Note that ollama
only supports gguf
format.
ollama run hf.co/katanemo/Arch-Agent-7B.gguf
Format prompts following here, and the replace formatted_prompt
in the script below and run the script to get responses.
from ollama import Client
# Point to the local server. By default, it uses port 11434.
client = Client(host="http://127.0.0.1:11434")
# Send requests and get responses from the server
completion = client.chat(
model="hf.co/katanemo/Arch-Agent-1.5B.gguf",
messages=[
{"role": "user", "content": "Get the current temperature in San Francisco"}
],
options={"temperature": 0.01, "num_ctx": 1024}
)
print(completion.message.content)
SGLang offers structured generation capabilities with high performance. To use SGLang for deployment, follow the steps below.
# Install SGLang
pip install sglang[all]
python -m sglang.launch_server \
--model-path katanemo/Arch-Agent-7B \
--host 127.0.0.1 \
--port 8000 \
--tp 1 \
--trust-remote-code
As sglang provides OpenAI-compatible APIs, you can follow the same way as vLLM to get responses from the server. First format prompts following here. Then, replace messages
in the script below with the formatted prompts and run the script.
# Client code for vLLM
from openai import OpenAI
# Point to the local server
client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8000/v1",
)
#
completion = client.chat.completions.create(
model="katanemo/Arch-Agent-7B",
messages=[
{"role": "user", "content": "Get the current temperature in San Francisco"}
],
temperature=0.01,
max_tokens=1024
)
print(completion.choices[0].message.content)
The Arch-Function project is actively developing next-generation models that will:
- Further advance function calling accuracy beyond current SOTA
- Introduce novel architectures optimized for tool usage
- Expand to multimodal function calling capabilities
- Support more complex reasoning patterns in function selection
Please refer to the individual model pages on Hugging Face for specific licensing information.
We welcome contributions to improve the Arch-Function tutorials and documentation! You can help by:
- Fixing errors or improving existing tutorials
- Adding new deployment examples or use cases
- Suggesting additional framework integrations
- Improving documentation clarity
Feel free to open an issue or submit a pull request with your improvements.
For questions and support:
- Open an issue in this repository
- Visit our Hugging Face Hub
- Check the Katanemo organization on Github