Skip to content

Commit 835842f

Browse files
committed
Add Valkey distributed cache for horizontal scaling
Implement distributed caching using Valkey (Redis-compatible) to enable horizontal scaling of Trino Gateway across multiple instances. This allows query metadata to be shared between gateway instances, ensuring consistent routing regardless of which instance receives a request. Key features: - 3-tier caching architecture: L1 (Guava local) → L2 (Valkey distributed) → L3 (PostgreSQL) - Graceful degradation when Valkey unavailable (falls back to database) - Configurable health checks and connection pooling - Cache metrics (hits, misses, writes, errors, hit rate) - Write-through caching for backend and routing group lookups - Lazy-loading for external URL lookups - Convention over Configuration with sensible defaults Implementation: - Add ValkeyConfiguration with 11 configurable parameters (minimal 3 required) - Create DistributedCache interface and ValkeyDistributedCache implementation - Integrate distributed cache into BaseRoutingManager routing logic - Use modern Duration API (no deprecated methods) - Add comprehensive input validation and error handling - Include 31 unit tests (16 config + 15 cache tests) Configuration: valkeyConfiguration: enabled: true host: valkey.internal port: 6379 password: ${VALKEY_PASSWORD} Documentation includes: - Quick start guide with minimal configuration - Full configuration reference with tuning guidelines - Deployment scenarios (single vs. multi-instance) - Performance tuning recommendations - Security best practices - Architecture documentation and troubleshooting Single-instance deployments don't need distributed caching - local Guava cache is sufficient. Multi-instance deployments benefit from shared cache for consistent query routing.
1 parent 5a36940 commit 835842f

23 files changed

+1623
-20
lines changed

docs/config.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,11 @@ dataStore:
1111

1212
clusterStatsConfiguration:
1313
monitorType: INFO_API
14+
15+
# Valkey distributed cache (optional - for multi-instance deployments)
16+
valkeyConfiguration:
17+
enabled: false
18+
host: localhost
19+
port: 6379
20+
# password: ${VALKEY_PASSWORD} # Uncomment if Valkey requires AUTH
21+
# cacheTtlSeconds: 1800 # Cache TTL in seconds (default: 1800 = 30 minutes)

docs/installation.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,32 @@ For additional configurations, use the `log.*` properties from the
161161
[Trino logging properties documentation](https://trino.io/docs/current/admin/properties-logging.html) and specify
162162
the properties in `serverConfig`.
163163

164+
### Configure distributed cache (optional)
165+
166+
For multi-instance deployments, Trino Gateway supports distributed caching
167+
using Valkey (or Redis) to share query metadata across gateway instances.
168+
This improves query routing and enables horizontal scaling.
169+
170+
For single gateway deployments, distributed caching is not needed - the
171+
local cache is sufficient.
172+
173+
```yaml
174+
valkeyConfiguration:
175+
enabled: true
176+
host: valkey.internal.prod
177+
port: 6379
178+
password: ${ENV:VALKEY_PASSWORD}
179+
cacheTtlSeconds: 1800 # Cache TTL (default: 1800 = 30 minutes)
180+
```
181+
182+
**Optional parameters**: You can customize `cacheTtlSeconds` based on your query duration:
183+
- Short queries (< 5 min): 600 seconds (10 minutes)
184+
- Default queries: 1800 seconds (30 minutes)
185+
- Long-running queries: 3600 seconds (1 hour)
186+
187+
See [Valkey distributed cache configuration](valkey-configuration.md) for
188+
detailed configuration options, deployment scenarios, and performance tuning.
189+
164190
### Proxying additional paths
165191

166192
By default, Trino Gateway only proxies requests to paths starting with

docs/operation.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@ monitor:
5858

5959
## Monitoring <a name="monitoring"></a>
6060

61-
Trino Gateway provides a metrics endpoint that uses the OpenMetrics format at
62-
`/metrics`. Use it to monitor Trino Gateway instances with Prometheus and
61+
Trino Gateway provides a metrics endpoint that uses the OpenMetrics format at
62+
`/metrics`. Use it to monitor Trino Gateway instances with Prometheus and
6363
other compatible systems with the following Prometheus configuration:
6464

6565
```yaml
@@ -70,6 +70,20 @@ scrape_configs:
7070
- gateway1.example.com:8080
7171
```
7272

73+
### Multi-instance deployments
74+
75+
When running multiple Trino Gateway instances, enable the Valkey distributed
76+
cache to share query metadata across instances. This ensures consistent query
77+
routing regardless of which gateway instance receives the request.
78+
79+
Monitor the distributed cache performance by checking:
80+
- Cache hit rate (target: 85-95%)
81+
- Cache errors (should be near 0)
82+
- Valkey server connectivity and memory usage
83+
84+
See [Valkey distributed cache configuration](valkey-configuration.md) for
85+
setup instructions and monitoring details.
86+
7387
## Trino Gateway health endpoints
7488

7589
Trino Gateway provides two API endpoints to indicate the current status of the server:

docs/valkey-configuration.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
2+
## Performance Tuning
3+
4+
### Cache TTL (`cacheTtlSeconds`)
5+
6+
- **Default (1800s / 30min):** Good for typical workloads
7+
- **Short-lived queries (<5min):** Use 600s (10min)
8+
- **Long-running queries (hours):** Use 3600s (1 hour) or more
9+
- **Interactive development:** Use 300s (5min)
10+
11+
### Health Check Interval (`healthCheckIntervalMs`)
12+
13+
- **Default (30000ms / 30s):** Balanced check frequency
14+
- **Unstable network:** Increase to 60000ms (1 min)
15+
- **Critical systems:** Decrease to 10000ms (10s)
16+
17+
### Connection Timeouts (`timeoutMs`)
18+
19+
- **Default (2000ms):** Good for local/same-datacenter Valkey
20+
- **Cross-region:** Increase to 5000ms
21+
- **High latency network:** Increase to 10000ms
22+
23+
---
24+
25+
## Monitoring
26+
27+
Valkey cache exposes the following metrics (accessible via `ValkeyDistributedCache` instance):
28+
29+
```java
30+
long hits = cache.getCacheHits();
31+
long misses = cache.getCacheMisses();
32+
long writes = cache.getCacheWrites();
33+
long errors = cache.getCacheErrors();
34+
double hitRate = cache.getCacheHitRate(); // Percentage
35+
```
36+
37+
### Expected Metrics (Healthy System)
38+
39+
- **Cache Hit Rate:** 85-95%
40+
- **Cache Errors:** 0 (or very low)
41+
- **Cache Writes:** ~Equal to query submission rate
42+
43+
### Troubleshooting
44+
45+
**Low Hit Rate (<70%)**
46+
- Check TTL settings (may be too short)
47+
- Verify Valkey isn't evicting entries (check memory)
48+
- Check if multiple gateway versions deployed (cache key mismatch)
49+
50+
**High Error Rate**
51+
- Check Valkey connectivity
52+
- Verify password/AUTH configuration
53+
- Review Valkey server logs
54+
55+
**Connection Pool Exhaustion**
56+
- Increase `maxTotal` setting
57+
- Check for connection leaks (should be none with try-with-resources)
58+
59+
---
60+
61+
## Security Considerations
62+
63+
### Production Deployment Checklist
64+
65+
- [ ] **Enable AUTH:** Set `password` in configuration
66+
- [ ] **Use Environment Variables:** Don't hardcode passwords
67+
- [ ] **Network Security:** Deploy Valkey in private VPC/network
68+
- [ ] **Encryption at Rest:** Enable Valkey persistence encryption
69+
- [ ] **TLS/SSL:** (Future enhancement - not yet supported)
70+
- [ ] **Access Control:** Restrict Valkey port (6379) to gateway instances only
71+
72+
### Example Production Setup
73+
74+
```yaml
75+
# config.yaml
76+
valkeyConfiguration:
77+
enabled: true
78+
host: ${VALKEY_INTERNAL_HOST}
79+
port: 6379
80+
password: ${VALKEY_PASSWORD}
81+
```
82+
83+
```bash
84+
# Environment variables (set in deployment)
85+
export VALKEY_INTERNAL_HOST=valkey.vpc.internal
86+
export VALKEY_PASSWORD=$(vault read -field=password secret/valkey)
87+
```
88+
89+
---
90+
91+
## Architecture
92+
93+
### 3-Tier Caching
94+
95+
```
96+
Request Flow:
97+
1. Check L1 (Local Guava Cache) → 10k entries, 30min TTL
98+
├─ Hit: Return immediately (~1ms)
99+
└─ Miss: Continue to L2
100+
101+
2. Check L2 (Valkey Distributed Cache) → Shared across gateways
102+
├─ Hit: Populate L1, return (~5ms)
103+
└─ Miss: Continue to L3
104+
105+
3. Check L3 (PostgreSQL Database) → Source of truth
106+
├─ Found: Populate L2 + L1, return (~50ms)
107+
└─ Not Found: Search all backends via HTTP (~200ms)
108+
```
109+
110+
### Cache Keys
111+
112+
```
113+
Backend: trino:query:backend:{queryId}
114+
Routing Group: trino:query:routinggroup:{queryId}
115+
External URL: trino:query:externalurl:{queryId}
116+
```
117+
118+
---
119+
120+
## Migration Guide
121+
122+
### From Single Gateway to Multi-Gateway
123+
124+
1. **Deploy Valkey server** (standalone or cluster)
125+
2. **Update config.yaml** on all gateways:
126+
```yaml
127+
valkeyConfiguration:
128+
enabled: true
129+
host: valkey.internal
130+
port: 6379
131+
password: ${VALKEY_PASSWORD}
132+
```
133+
3. **Restart gateways** (rolling restart recommended)
134+
4. **Monitor metrics** to verify cache hit rates
135+
136+
No data migration needed - cache will populate automatically.
137+
138+
---
139+
140+
## FAQ
141+
142+
**Q: Do I need Valkey if I only have one gateway?**
143+
A: No. Local Guava cache is sufficient for single-instance deployments.
144+
145+
**Q: What happens if Valkey goes down?**
146+
A: Graceful degradation - queries continue working, falling back to database. Performance may degrade slightly.
147+
148+
**Q: Can I use Redis instead of Valkey?**
149+
A: Yes! Valkey is a Redis fork with compatible protocol. Just point to your Redis server.
150+
151+
**Q: How much memory does Valkey need?**
152+
A: Rough estimate: `(queries per minute) × (average query lifetime in minutes) × 500 bytes`
153+
Example: 1000 q/min × 30 min × 500 bytes = ~15 MB
154+
155+
**Q: Can I clear the cache?**
156+
A: Yes, via Valkey CLI: `redis-cli -h <host> -a <password> FLUSHDB`
157+
Or selectively: `redis-cli DEL trino:query:backend:*`
158+
159+
---
160+
161+
## Support
162+
163+
For issues or questions:
164+
- GitHub Issues: https://github.com/trinodb/trino-gateway/issues
165+
- Trino Community Slack: #trino-gateway channel

gateway-ha/config.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,11 @@ clusterStatsConfiguration:
2020
monitor:
2121
taskDelay: 1m
2222
clusterMetricsRegistryRefreshPeriod: 30s
23+
24+
# Valkey distributed cache (optional - for multi-instance deployments)
25+
valkeyConfiguration:
26+
enabled: false # Set to true to enable distributed caching
27+
host: localhost
28+
port: 6379
29+
# password: ${VALKEY_PASSWORD} # Uncomment if Valkey requires AUTH
30+
# cacheTtlSeconds: 1800 # Cache TTL in seconds (default: 1800 = 30 minutes)

gateway-ha/pom.xml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,12 @@
191191
<version>${dep.trino.version}</version>
192192
</dependency>
193193

194+
<dependency>
195+
<groupId>io.valkey</groupId>
196+
<artifactId>valkey-java</artifactId>
197+
<version>5.5.0</version>
198+
</dependency>
199+
194200
<dependency>
195201
<groupId>jakarta.annotation</groupId>
196202
<artifactId>jakarta.annotation-api</artifactId>

gateway-ha/src/main/java/io/trino/gateway/ha/config/HaGatewayConfiguration.java

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ public class HaGatewayConfiguration
4848

4949
private UIConfiguration uiConfiguration = new UIConfiguration();
5050

51+
private ValkeyConfiguration valkeyConfiguration = new ValkeyConfiguration();
52+
5153
// List of Modules with FQCN (Fully Qualified Class Name)
5254
private List<String> modules;
5355

@@ -278,6 +280,16 @@ public void setProxyResponseConfiguration(ProxyResponseConfiguration proxyRespon
278280
this.proxyResponseConfiguration = proxyResponseConfiguration;
279281
}
280282

283+
public ValkeyConfiguration getValkeyConfiguration()
284+
{
285+
return valkeyConfiguration;
286+
}
287+
288+
public void setValkeyConfiguration(ValkeyConfiguration valkeyConfiguration)
289+
{
290+
this.valkeyConfiguration = valkeyConfiguration;
291+
}
292+
281293
private void validateStatementPath(String statementPath, List<String> statementPaths)
282294
{
283295
if (statementPath.startsWith(V1_STATEMENT_PATH) ||

0 commit comments

Comments
 (0)