feat: enable tool calling support (#25)

DifferentialityDevelopment · TimPietrusky · TimPietruskyRunPod · web-flow · commit ec910bd91b97 · 2025-08-19T14:45:37.000+02:00
* Update worker-config.json

* Update engine.py

* Update README.md

* chore: added HF_TOKEN; use meta-llama/Llama-3.2-1B-Instruct for testing

* feat: added TOOL_CALL_PARSER

---------

Co-authored-by: Tim Pietrusky &lt;timpietrusky@gmail.com&gt;
Co-authored-by: NERDDISCO &lt;492378+TimPietrusky@users.noreply.github.com&gt;
Co-authored-by: Tim Pietrusky &lt;tim.pietrusky@runpod.io&gt;
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -27,14 +27,40 @@ Welcome! This guide explains how to develop and deploy the SGLang Worker for Run
 git clone <repo-url>
 cd worker-sglang
 
+# Create .env file for Hugging Face token (required for gated models)
+echo "HF_TOKEN=your_huggingface_token_here" > .env
+
 # Build locally for testing (optional - will be built in CI)
 docker build --platform linux/amd64 -t worker-sglang-local .
 
-# Test with docker-compose
+# Test with docker-compose (will automatically use .env file)
 docker-compose up
 ```
 
-### 3. Making Changes
+### 3. Environment Configuration
+
+The project uses a `.env` file for local development. Docker Compose automatically reads this file.
+
+**Required for local testing:**
+
+```bash
+# .env file (create in project root)
+HF_TOKEN=your_huggingface_token_here
+```
+
+**Getting your HF_TOKEN:**
+
+1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
+2. Create a new token with "Read" permissions
+3. Copy the token to your `.env` file
+
+**⚠️ Security Note:**
+
+- Never commit the `.env` file to git
+- The `.env` file is already in `.gitignore`
+- Use environment variables in production/CI
+
+### 4. Making Changes
 
 1. **Create feature branch:**
 
@@ -44,7 +70,7 @@ docker-compose up
 
 2. **Make your changes** to:
 
-   - Core files in `.runpod/` directory
+   - Core files in project root
    - Configuration files
    - Documentation
 
@@ -54,8 +80,11 @@ docker-compose up
    # Test Docker build
    docker build --platform linux/amd64 -t test-build .
 
-   # Test with sample input
+   # Test with sample input (ensure .env file exists first)
    docker run --rm test-build python3 -c "import handler; print('Import successful')"
+
+   # Test with docker-compose (uses .env automatically)
+   docker-compose up
    ```
 
 4. **Commit following conventions:**
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
-!*.png
+!*.png
+.env
 .DS_Store
diff --git a/.runpod/hub.json b/.runpod/hub.json
@@ -30,6 +30,24 @@
           "required": false
         }
       },
+      {
+        "key": "TOOL_CALL_PARSER",
+        "input": {
+          "name": "Tool Call Parser",
+          "type": "string",
+          "description": "Defines the parser used to interpret tool call responses",
+          "default": "",
+          "required": false,
+          "advanced": true,
+          "options": [
+            { "value": "llama3", "label": "llama3" },
+            { "value": "llama4", "label": "llama4" },
+            { "value": "mistral", "label": "mistral" },
+            { "value": "qwen25", "label": "qwen25" },
+            { "value": "deepseekv3", "label": "deepseekv3" }
+          ]
+        }
+      },
       {
         "key": "TOKENIZER_PATH",
         "input": {
diff --git a/README.md b/README.md
@@ -51,6 +51,7 @@ All behaviour is controlled through environment variables:
 | `ENABLE_P2P_CHECK`                | Enable P2P check for GPU access                   | false                                 | boolean (true or false)                                                                   |
 | `ENABLE_FLASHINFER_MLA`           | Enable FlashInfer MLA optimization                | false                                 | boolean (true or false)                                                                   |
 | `TRITON_ATTENTION_REDUCE_IN_FP32` | Cast Triton attention reduce op to FP32           | false                                 | boolean (true or false)                                                                   |
+| `TOOL_CALL_PARSER`                | Defines the parser used to interpret responses    | qwen25                                | "llama3", "llama4", "mistral", "qwen25", "deepseekv3"                                     |
 
 ## API Usage
 
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -14,10 +14,12 @@ services:
     environment:
       - HOST=0.0.0.0
       - PORT=30000
-      - MODEL_PATH=HuggingFaceTB/SmolLM2-1.7B-Instruct
+      - MODEL_PATH=meta-llama/Llama-3.2-1B-Instruct
       - TRUST_REMOTE_CODE=true
       - ATTENTION_BACKEND=flashinfer
       - SAMPLING_BACKEND=flashinfer
+      - TOOL_CALL_PARSER=llama3
+      - HF_TOKEN=${HF_TOKEN}
 
       # make it work locally with <= 8 GB VRAM
       - MEM_FRACTION_STATIC=0.5
diff --git a/engine.py b/engine.py
@@ -60,6 +60,7 @@ def start_server(self):
             "LOAD_BALANCE_METHOD": "--load-balance-method",
             "ATTENTION_BACKEND": "--attention-backend",
             "SAMPLING_BACKEND": "--sampling-backend",
+            "TOOL_CALL_PARSER": "--tool-call-parser"
         }
 
         # Boolean flags
diff --git a/test_input.json b/test_input.json
@@ -2,7 +2,7 @@
   "input": {
     "openai_route": "/v1/chat/completions",
     "openai_input": {
-      "model": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
+      "model": "meta-llama/Llama-3.2-1B-Instruct",
       "messages": [
         { "role": "user", "content": "What is the capital of France?" }
       ]
diff --git a/worker-config.json b/worker-config.json
@@ -13,7 +13,8 @@
             "LOAD_FORMAT",
             "DTYPE",
             "CHAT_TEMPLATE",
-            "SERVED_MODEL_NAME"
+            "SERVED_MODEL_NAME",
+            "TOOL_CALL_PARSER"
           ]
         },
         {
@@ -213,6 +214,21 @@
         {"value": "float32", "label": "float32"}
       ]
     },
+    "TOOL_CALL_PARSER": {
+      "env_var_name": "TOOL_CALL_PARSER",
+      "value": "qwen25",
+      "title": "Tool Call Parser",
+      "description": "Defines the parser used to interpret responses",
+      "required": false,
+      "type": "select",
+      "options": [
+        {"value": "llama3", "label": "llama3"},
+        {"value": "llama4", "label": "llama4"},
+        {"value": "mistral", "label": "mistral"},
+        {"value": "qwen25", "label": "qwen25"},
+        {"value": "deepseekv3", "label": "deepseekv3"}
+      ]
+    },
     "CONTEXT_LENGTH": {
       "env_var_name": "CONTEXT_LENGTH",
       "value": "",

-Original file line number
+Diff line change
@@ @@ -1,2 +1,3 @@ @@
 -!*.png
 +!*.png
 +.env
 .DS_Store
Original file line number	Diff line number	Diff line change
`@@ -60,6 +60,7 @@ def start_server(self):`
`60`	`60`	`"LOAD_BALANCE_METHOD": "--load-balance-method",`
`61`	`61`	`"ATTENTION_BACKEND": "--attention-backend",`
`62`	`62`	`"SAMPLING_BACKEND": "--sampling-backend",`
	`63`	`+ "TOOL_CALL_PARSER": "--tool-call-parser"`
`63`	`64`	`}`
`64`	`65`
`65`	`66`	`# Boolean flags`
Original file line number	Diff line number	Diff line change
`@@ -2,7 +2,7 @@`
`2`	`2`	`"input": {`
`3`	`3`	`"openai_route": "/v1/chat/completions",`
`4`	`4`	`"openai_input": {`
`5`		`- "model": "HuggingFaceTB/SmolLM2-1.7B-Instruct",`
	`5`	`+ "model": "meta-llama/Llama-3.2-1B-Instruct",`
`6`	`6`	`"messages": [`
`7`	`7`	`{ "role": "user", "content": "What is the capital of France?" }`
`8`	`8`	`]`