LLM-powered NPC dialogue with memory and TTS for Unity games
A production-ready FastAPI backend that brings NPCs to life using:
- ๐ง Gemini for contextual, personality-driven dialogue with structured JSON output
- ๐พ SQLite for lightweight NPC memory (salience ร recency retrieval)
- ๐๏ธ Google Cloud TTS for emotional voice synthesis with SSML
- ๐ WebSocket streaming for real-time typewriter effects
rpgai/
โโโ server/ # Python FastAPI Backend
โ โโโ main.py # FastAPI app: WebSocket chat, memory, TTS
โ โโโ llm_client.py # Gemini client with structured output
โ โโโ schemas.py # Pydantic models + JSON Schema
โ โโโ memory.py # SQLite DAO (salience/recency retrieval)
โ โโโ tts.py # Google Cloud TTS (SSML)
โ โโโ settings.py # Configuration management
โ โโโ requirements.txt # Python dependencies
โโโ tests/ # Unit tests
โ โโโ test_memory.py
โ โโโ test_schema.py
โโโ unity/Assets/Scripts/ # Unity C# Templates
โ โโโ Net/
โ โ โโโ HttpClient.cs # UnityWebRequest JSON helpers
โ โ โโโ LLMWebSocketClient.cs # WebSocket streaming client
โ โโโ Dialogue/
โ โ โโโ DialogueController.cs # Main dialogue orchestrator
โ โ โโโ NpcResponse.cs # Response data models
โ โโโ State/
โ โ โโโ GameContextProvider.cs # Game state context
โ โโโ Audio/
โ โโโ TTSPlayer.cs # TTS audio playback
โโโ media/ # Generated audio files (gitignored)
โโโ .env # API keys (create from .env.example)
โโโ README.md # This file
- Python 3.9+
- Gemini API Key (Get one here)
- Google Cloud Account with Text-to-Speech and Speech-to-Text APIs enabled
- Unity 2021.3+ (for client-side integration)
# Clone or navigate to the project
cd rpgai
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r server/requirements.txt
# Configure environment variables
cp .env.example .env
# Edit .env and add:
# GEMINI_API_KEY=your_key_here
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-credentials.json# Development mode (auto-reload)
uvicorn server.main:app --reload --port 8000
# Production mode
uvicorn server.main:app --host 0.0.0.0 --port 8000 --workers 4Server will be available at:
- API Docs: http://localhost:8000/docs
- Health Check: http://localhost:8000/healthz
- WebSocket: ws://localhost:8000/v1/chat.stream
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_memory.py -v
# With coverage
pytest tests/ --cov=server --cov-report=html# Gemini API
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.0-flash-exp
# Google Cloud TTS
GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-credentials.json
# Server
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=INFO
# Database
DB_PATH=npc_memory.db
# Media Storage
MEDIA_DIR=./media
MEDIA_BASE_URL=http://localhost:8000/media
# Model Parameters
TEMPERATURE=0.7
TOP_P=0.9
MAX_OUTPUT_TOKENS=220- Create a GCP project and enable:
- Text-to-Speech API (for NPC voice output)
- Speech-to-Text API (for player voice input)
- Create a service account with appropriate permissions
- Download the JSON service account key
- Set
GOOGLE_APPLICATION_CREDENTIALSto the path of your JSON key
# Enable APIs (using gcloud CLI)
gcloud services enable texttospeech.googleapis.com
gcloud services enable speech.googleapis.comEndpoint: ws://localhost:8000/v1/chat.stream
Flow:
- Client connects
- Client sends ONE JSON message (ChatTurnRequest)
- Server streams tokens:
{"type":"token", "text":"..."} - Server sends final:
{"type":"final", "json":"{...NpcDialogueResponse}"}
Request Payload:
{
"npc_id": "elenor",
"player_id": "p1",
"player_text": "Can you teach me a spell?",
"persona": {
"name": "Elenor",
"role": "Elven mage",
"values": ["order", "wisdom", "loyalty"],
"quirks": ["measured", "formal"],
"backstory": ["mentored apothecary", "distrusts smugglers"]
},
"context": {
"scene": "Silverwoods_clearing",
"time_of_day": "dusk",
"weather": "light_rain",
"last_player_action": "returned_lost_ring",
"player_reputation": 12,
"npc_health": 100,
"npc_alertness": 0.3
}
}Response Stream:
{"type": "token", "text": "Ah, "}
{"type": "token", "text": "you wish to "}
{"type": "token", "text": "learn magic? "}
{"type": "final", "json": "{\"utterance\":\"Ah, you wish to learn magic? Very well.\",\"emotion\":\"neutral\",\"behavior_directive\":\"none\",\"memory_writes\":[{\"salience\":1,\"text\":\"Player asked about magic training\"}]}"}POST /v1/memory/write
Content-Type: application/json
{
"npc_id": "elenor",
"player_id": "p1",
"text": "Player returned lost ring",
"salience": 2,
"keys": ["ring", "kindness"],
"private": true
}
# Response
{"ok": true, "id": 1}GET /v1/memory/top?npc_id=elenor&player_id=p1&k=3
# Response
{
"memories": [
{
"id": 1,
"npc_id": "elenor",
"player_id": "p1",
"text": "Player returned lost ring",
"salience": 2,
"private": true,
"keys": "[\"ring\", \"kindness\"]",
"ts": 1699564800
}
]
}POST /v1/voice/tts
Content-Type: application/json
{
"ssml": "<speak><prosody rate='95%' pitch='+1st'>Greetings, traveler.</prosody></speak>",
"voice_name": "en-US-Neural2-C"
}
# Response
{
"audio_url": "http://localhost:8000/media/abc123.mp3"
}Available Voice Presets:
en-US-Neural2-C- Feminine, calm (default)en-US-Neural2-F- Feminine, youngen-US-Neural2-D- Masculine, deepen-US-Neural2-A- Masculine, casualen-GB-Neural2-B- Elderly, wise
POST /v1/voice/stt
Content-Type: multipart/form-data
# Form fields:
# - audio: Audio file (WAV, MP3, FLAC, OGG, WEBM)
# - language_code: Language code (default: en-US)
# Response
{
"text": "Hello, I would like to buy some potions",
"confidence": 0.95
}Usage Example (curl):
# Record audio (on macOS/Linux)
ffmpeg -f avfoundation -i ":0" -t 5 recording.wav
# Or on Windows
# ffmpeg -f dshow -i audio="Microphone" -t 5 recording.wav
# Send to STT endpoint
curl -X POST http://localhost:8000/v1/voice/stt \
-F "[email protected]" \
-F "language_code=en-US"Workflow with Voice Input:
- Player records audio in Unity using
Microphone.Start() - Unity converts audio to WAV/MP3 and sends to
/v1/voice/stt - Server transcribes audio to text using GCP Speech-to-Text
- Unity receives transcribed text and uses it as
player_textin dialogue request
See Unity README for complete voice input implementation example.
Install NativeWebSocket via Unity Package Manager:
https://github.com/endel/NativeWebSocket.git#upm
- Create an empty GameObject called
DialogueSystem - Attach
DialogueControllercomponent - Attach
GameContextProvidercomponent - Create another GameObject called
TTSPlayerand attach theTTSPlayercomponent - Wire references in the Inspector
using RPGAI.Dialogue;
using RPGAI.Audio;
public class PlayerInteraction : MonoBehaviour
{
[SerializeField] private DialogueController dialogue;
[SerializeField] private TTSPlayer ttsPlayer;
private NpcPersona elenorPersona = new NpcPersona
{
name = "Elenor",
role = "Elven mage",
values = new[] { "order", "wisdom", "loyalty" },
quirks = new[] { "measured", "formal" },
backstory = new[] { "mentored apothecary", "distrusts smugglers" }
};
async void Start()
{
// Subscribe to events
dialogue.OnTokenReceived += ShowToken;
dialogue.OnResponseComplete += HandleResponse;
// Send a message
await dialogue.SendPlayerMessage(
"elenor",
"Can you teach me a spell?",
elenorPersona
);
}
private void ShowToken(string token)
{
// Update UI with typewriter effect
Debug.Log($"Token: {token}");
}
private async void HandleResponse(NpcResponse response)
{
Debug.Log($"Utterance: {response.utterance}");
Debug.Log($"Emotion: {response.emotion}");
Debug.Log($"Behavior: {response.behavior_directive}");
// Apply emotion to animator
// animator.SetTrigger(response.emotion.ToString());
// Execute behavior
// behaviorTree.Execute(response.behavior_directive);
// Play TTS
await ttsPlayer.PlayTTS(response.utterance);
}
}import asyncio
import websockets
import json
async def test_chat():
uri = "ws://localhost:8000/v1/chat.stream"
async with websockets.connect(uri) as websocket:
payload = {
"npc_id": "elenor",
"player_id": "p1",
"player_text": "Hello!",
"persona": {
"name": "Elenor",
"role": "Elven mage",
"values": ["wisdom"],
"quirks": ["formal"],
"backstory": ["lives in forest"]
},
"context": {
"scene": "forest",
"time_of_day": "noon",
"weather": "clear",
"player_reputation": 0,
"npc_health": 100,
"npc_alertness": 0.0
}
}
await websocket.send(json.dumps(payload))
async for message in websocket:
data = json.loads(message)
if data["type"] == "token":
print(data["text"], end="", flush=True)
elif data["type"] == "final":
print(f"\n\nFinal JSON: {data['json']}")
break
asyncio.run(test_chat())curl -X POST http://localhost:8000/v1/voice/tts \
-H "Content-Type: application/json" \
-d '{
"ssml": "<speak>Greetings, traveler.</speak>",
"voice_name": "en-US-Neural2-C"
}'
# Response: {"audio_url":"http://localhost:8000/media/abc123.mp3"}
# Visit the URL in your browser to play the audioThe LLM is configured to ALWAYS return JSON matching this schema:
{
"utterance": "string (max 320 chars)",
"emotion": "neutral|happy|angry|fear|sad|surprised|disgust",
"style_tags": ["formal", "whisper", ...], // optional, max 3
"behavior_directive": "none|approach|step_back|flee|attack|call_guard|give_item|start_quest|open_shop|heal_player",
"memory_writes": [ // optional, max 2
{
"salience": 0-3,
"text": "string (max 160 chars)",
"keys": ["keyword1", ...], // optional, max 4
"private": true
}
],
"public_events": [ // optional, max 1
{
"event_type": "string",
"payload": {}
}
],
"voice_hint": { // optional
"voice_preset": "string",
"ssml_style": "default|narration|whispered|shouted|urgent|calm"
}
}- Unity sends context โ WebSocket with persona + game state + player text
- Backend retrieves memories โ Top 3 by salience ร recency
- Gemini generates response โ Structured JSON with emotion, behavior, utterance
- Backend streams tokens โ Unity shows typewriter effect
- Backend sends final JSON โ Unity parses and applies
- Unity triggers actions โ Animation, behavior tree, TTS playback
- Memories auto-saved โ Backend writes memory_writes to SQLite
- Salience (0-3): Importance level (3 = critical, 0 = trivial)
- Recency: Unix timestamp
- Retrieval:
ORDER BY salience DESC, ts DESC LIMIT k - Isolation: Memories are per (npc_id, player_id) pair
Turn 1: Player steals item
โ Memory: "Player stole potion" (salience=3)
โ NPC: "How dare you! Guards!" (emotion=angry, behavior=call_guard)
Turn 2: Player returns later
โ Retrieves: "Player stole potion" (salience=3)
โ NPC: "You! Get out of my shop!" (emotion=angry, behavior=step_back)
Turn 3: Player gives gift
โ Retrieves: "Player stole potion" (salience=3)
โ Memory: "Player gave gift as apology" (salience=2)
โ NPC: "Perhaps... I misjudged you." (emotion=neutral, behavior=none)
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY server/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY server/ ./server/
EXPOSE 8000
CMD ["uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "8000"]# Build and run
docker build -t rpgai .
docker run -p 8000:8000 --env-file .env rpgai# /etc/systemd/system/rpgai.service
[Unit]
Description=RPGAI NPC Dialogue Service
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/rpgai
Environment="PATH=/opt/rpgai/venv/bin"
ExecStart=/opt/rpgai/venv/bin/uvicorn server.main:app --host 0.0.0.0 --port 8000
Restart=always
[Install]
WantedBy=multi-user.target- โ Never commit
.envto git - โ Use secret managers (AWS Secrets Manager, GCP Secret Manager)
- โ Restrict CORS origins in production
- โ Enable HTTPS (use Nginx/Caddy as reverse proxy)
- โ Rate limit WebSocket connections
โ Create .env file with GEMINI_API_KEY=your_key
โ Check GOOGLE_APPLICATION_CREDENTIALS path is correct
โ Verify GCP Text-to-Speech API is enabled
โ Check service account has TTS permissions
โ Check firewall allows WebSocket connections
โ Verify Unity's WebSocket URL matches server address
โ Check server logs: tail -f logs/rpgai.log
โ Check NPC_DIALOGUE_SCHEMA matches NpcDialogueResponse
โ Increase max_output_tokens if responses are truncated
โ Review system instruction for clarity
โ Use gemini-1.5-flash instead of gemini-2.0-flash-exp
โ Reduce MAX_OUTPUT_TOKENS (default 220)
โ Cache system instruction (future enhancement)
- Gemini API Docs
- Gemini Structured Output
- Google Cloud TTS
- FastAPI WebSockets
- NativeWebSocket for Unity
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project implements the RPGAI challenge requirements:
- โ Dynamic AI dialogues with personality-driven responses
- โ Integrated game prototype (Unity templates provided)
- โ In-game context integration (weather, reputation, actions)
- โ Character memory system with salience ranking
- โ Emotional & behavioral reactions
- โ Voice integration (Google Cloud TTS)
What makes this special:
- Production-ready architecture (not just a proof-of-concept)
- Structured output ensures Unity never receives malformed JSON
- Memory system creates persistent relationships between player and NPCs
- Streaming responses for smooth UX (typewriter effect)
- Comprehensive documentation for easy integration
Made with โค๏ธ for immersive RPG experiences.
Questions? Open an issue or check /docs for interactive API documentation.