-
Notifications
You must be signed in to change notification settings - Fork 123
Open
Description
Self-Improving Prompt Engineering System with MinionS and LangSmith
Note: Grok 3 just put a little effort in this, it has not been implemented or tested.
1. Objective
Develop a self-improving system to capture, evaluate, and store prompt engineering refinements for the MinionS protocol using LangSmith, maintaining a prompt_strategies.md
file to optimize local (e.g., Ollama with Llama 3.2) and cloud (e.g., GPT-4o) LLM collaboration.
Goals:
- Log prompts and feedback in LangSmith.
- Evaluate prompts for accuracy and efficiency.
- Store high-performing prompts in a
.md
file. - Automate updates to the
.md
file. - Reuse prompts to reduce iteration.
- Handle edge cases (e.g., missing prompts, API failures).
Success Metrics:
- 90%+ prompt reuse success rate.
- Reduce MinionS communication rounds (e.g., 3 to 1).
- Maintain prompts with >0.9 accuracy/efficiency scores.
2. Tools
- MinionS Protocol (GitHub):
- Install:
pip install torch transformers streamlit
- Run demo:
streamlit run app.py
- Config: Local LLM (Ollama, Llama 3. Runt, temperature=0.0), Cloud LLM (OpenAI GPT-4o, API key).
- Code:
From minions.clients.ollama import OllamaClient From minions.clients.openai import OpenAIClient From minions.minions import Minions Local_client = OllamaClient(model_name=”llama3.2”, temperature=0.0) Remote_client = OpenAIClient(model_name=”gpt-4o”) Minions = Minions(local_client, remote_client)
- Install:
- LangSmith (LangSmith):
- Install:
pip install langsmith
- Config: API key,
LANGCHAIN_TRACING_V2=true
,LANGCHAIN_PROJECT=MinionS_Prompts
- Tracing: Wrap OpenAI (
wrap_openai
), manual logging for Ollama.
- Install:
- Other Tools:
markdown
,re
libraries (pip install markdown
), cron/GitHub Actions for scheduling.
3. Actionable Implementation Steps
3.1 Capture Refinements
- Log prompts during task decomposition:
From langsmith import Client
Client = Client()
Def decompose_task(task, context, remote_client):
Prompt = f”Decompose the task: {task} into subtasks.”
Subtask_prompts = remote_client.chat(prompt)
For I, p in enumerate(subtask_prompts):
Client.log_run(inputs={“prompt”: p, “task”: task}, outputs={}, project_name=”MinionS_Prompts”, metadata={“subtask_id”: i})
Return subtask_prompts
- Log local LLM outputs:
Def execute_subtask(subtask_prompt, local_client):
Output = local_client.generate(subtask_prompt)
Client.log_run(inputs={“prompt”: subtask_prompt}, outputs={“response”: output}, project_name=”MinionS_Prompts”)
Return output
- Edge Case: Fallback prompt if API fails: “Summarize key findings.”
3.2 Evaluate Prompts
- Use LangSmith’s LLM-as-Judge:
Def evaluate_prompt(run):
Expected = “Expected output”
Actual = run.outputs.get(“response”, “”)
Score = 1.0 if expected in actual else 0.5
Return {“score”: score}
Client.evaluate_runs(project_name=”MinionS_Prompts”, evaluator=evaluate_prompt)
- Query top prompts:
runs = client.list_runs(project_name=”MinionS_Prompts”, filters={“score”: {“gte”: 0.9}})
3.3 Store Prompts
- Update
prompt_strategies.md
:
From datetime import datetime
Def update_prompt_md(runs, filename=”prompt_strategies.md”):
Content = “# Prompt Engineering Strategies\n\n”
Task_types = {}
For run in runs:
Task_type = run.metadata.get(“task_type”, “Unknown”)
Prompt = run.inputs.get(“prompt”, “”)
Score = run.scores.get(“score”, 0.0)
Task_types.setdefault(task_type, []).append((prompt, score))
For task_type, prompts in task_types.items():
Content += f”## Task Type: {task_type}\n\n### Effective Subtasks\n”
For I, (prompt, score) in enumerate(prompts, 1):
Content += f”{i}. \”{prompt}\”\n - Score: {score:.2f}\n - Uses: 1\n - Updated: {datetime.now().strftime(‘%Y-%m-%d’)}\n\n”
With open(filename, “w”) as f:
f.write(content)
- Example
.md
:
# Prompt Engineering Strategies
## Task Type: Medical Report Analysis
### Effective Subtasks
1. “Extract blood pressure and classify: normal (<120/80 mmHg), elevated (120-139/80-89 mmHg), high (>140/90 mmHg).”
- Score: 0.95
- Uses: 10
- Updated: 2025-07-03
3.4 Reuse Prompts
- Parse
.md
file:
Import re
Def get_prompts_for_task(task_type, filename=”prompt_strategies.md”):
Try:
With open(filename, “r”) as f:
Content = f.read()
Pattern = rf”## Task Type: {re.escape(task_type)}\n\n### Effective Subtasks\n(.*?)(?=\n\n## Task Type\Z)”
Match = re.search(pattern, content, re.DOTALL)
If match:
Return re.findall(r”\”(.*?)\””, match.group(1))
Return []
Except FileNotFoundError:
Return []
- Use in MinionS:
Def run_task(task, context, task_type, minions):
Prompts = get_prompts_for_task(task_type)
If prompts:
Output = minions(task=prompts[0], context=[context], max_rounds=1)
Else:
Output = minions(task=task, context=[context], max_rounds=2)
Client.log_run(inputs={“prompt”: task, “task”: task}, outputs={“response”: output[“final_answer”]}, metadata={“task_type”: task_type})
Return output
3.5 Automate Updates
- Schedule daily updates via cron:
0 0 * * * python update_prompts.py
- Edge Case: Log to temporary JSON if
.md
write fails:
Import json
Def save_temp_json(runs):
With open(“temp_prompts.json”, “w”) as f:
Json.dump([{“prompt”: r.inputs[“prompt”], “score”: r.scores[“score”]} for r in runs], f)
4. Edge Cases and Challenges
- No Stored Prompts: Generate and log new prompts.
- API Failures: Retry with exponential backoff or use fallback prompt.
- Low Scores: Flag runs with scores <0.5 for manual review.
- Scalability: Use GitHub repository for
.md
file to handle large teams.
5. Next Steps for Resumption
- Verify Setup: Ensure MinionS and LangSmith are configured.
- Test Logging: Run a sample task (e.g., medical report analysis) and verify LangSmith logs.
- Initialize
.md
File: Create with initial structure. - Schedule Automation: Set up cron job or GitHub Action.
- Monitor: Check LangSmith dashboard for prompt performance trends.
6. Example Workflow
For a task like “Evaluate cardiovascular risk”:
- Check
prompt_strategies.md
for “Medical Report Analysis” prompts. - If none, decompose task with cloud LLM, log prompts.
- Execute subtasks with local LLM, log outputs.
- Evaluate with LangSmith, store prompts scoring >0.9.
- Update
.md
file
Metadata
Metadata
Assignees
Labels
No labels