Skip to content

Commit 17ae63d

Browse files
committed
Add Valkey distributed cache for horizontal scaling
This commit adds Valkey (Redis-compatible) distributed cache support to enable horizontal scaling of Trino Gateway across multiple instances. Key features: - 3-tier caching: Guava (local) -> Valkey (distributed) -> PostgreSQL (persistent) - Graceful degradation if Valkey is unavailable - Connection pooling with health checks - Configurable TTL and connection parameters - Comprehensive documentation New files: - ValkeyConfiguration.java: Configuration class - DistributedCache.java: Cache interface - ValkeyDistributedCache.java: Valkey implementation - NoopDistributedCache.java: Test implementation - docs/valkey-configuration.md: Complete documentation Modified files: - Integrated cache into routing managers - Updated dependency injection - Added Jedis dependency - Updated configuration files - Updated tests
1 parent f88ba97 commit 17ae63d

23 files changed

+1770
-20
lines changed

docs/config.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,11 @@ dataStore:
1111

1212
clusterStatsConfiguration:
1313
monitorType: INFO_API
14+
15+
# Valkey distributed cache (optional - for multi-instance deployments)
16+
valkeyConfiguration:
17+
enabled: false
18+
host: localhost
19+
port: 6379
20+
# password: ${VALKEY_PASSWORD} # Uncomment if Valkey requires AUTH
21+
# cacheTtlSeconds: 1800 # Cache TTL in seconds (default: 1800 = 30 minutes)

docs/installation.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,32 @@ For additional configurations, use the `log.*` properties from the
161161
[Trino logging properties documentation](https://trino.io/docs/current/admin/properties-logging.html) and specify
162162
the properties in `serverConfig`.
163163

164+
### Configure distributed cache (optional)
165+
166+
For multi-instance deployments, Trino Gateway supports distributed caching
167+
using Valkey (or Redis) to share query metadata across gateway instances.
168+
This improves query routing and enables horizontal scaling.
169+
170+
For single gateway deployments, distributed caching is not needed - the
171+
local cache is sufficient.
172+
173+
```yaml
174+
valkeyConfiguration:
175+
enabled: true
176+
host: valkey.internal.prod
177+
port: 6379
178+
password: ${ENV:VALKEY_PASSWORD}
179+
cacheTtlSeconds: 1800 # Cache TTL (default: 1800 = 30 minutes)
180+
```
181+
182+
**Optional parameters**: You can customize `cacheTtlSeconds` based on your query duration:
183+
- Short queries (< 5 min): 600 seconds (10 minutes)
184+
- Default queries: 1800 seconds (30 minutes)
185+
- Long-running queries: 3600 seconds (1 hour)
186+
187+
See [Valkey distributed cache configuration](valkey-configuration.md) for
188+
detailed configuration options, deployment scenarios, and performance tuning.
189+
164190
### Proxying additional paths
165191

166192
By default, Trino Gateway only proxies requests to paths starting with

docs/operation.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@ monitor:
5858

5959
## Monitoring <a name="monitoring"></a>
6060

61-
Trino Gateway provides a metrics endpoint that uses the OpenMetrics format at
62-
`/metrics`. Use it to monitor Trino Gateway instances with Prometheus and
61+
Trino Gateway provides a metrics endpoint that uses the OpenMetrics format at
62+
`/metrics`. Use it to monitor Trino Gateway instances with Prometheus and
6363
other compatible systems with the following Prometheus configuration:
6464

6565
```yaml
@@ -70,6 +70,20 @@ scrape_configs:
7070
- gateway1.example.com:8080
7171
```
7272

73+
### Multi-instance deployments
74+
75+
When running multiple Trino Gateway instances, enable the Valkey distributed
76+
cache to share query metadata across instances. This ensures consistent query
77+
routing regardless of which gateway instance receives the request.
78+
79+
Monitor the distributed cache performance by checking:
80+
- Cache hit rate (target: 85-95%)
81+
- Cache errors (should be near 0)
82+
- Valkey server connectivity and memory usage
83+
84+
See [Valkey distributed cache configuration](valkey-configuration.md) for
85+
setup instructions and monitoring details.
86+
7387
## Trino Gateway health endpoints
7488

7589
Trino Gateway provides two API endpoints to indicate the current status of the server:

docs/valkey-configuration.md

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Valkey Distributed Cache Configuration
2+
3+
## Overview
4+
5+
Valkey distributed cache enables horizontal scaling of Trino Gateway by sharing query metadata across multiple gateway instances. When disabled, each gateway maintains its own local cache.
6+
7+
## Quick Start (Minimal Configuration)
8+
9+
```yaml
10+
valkeyConfiguration:
11+
enabled: true
12+
host: localhost
13+
port: 6379
14+
# password: ${VALKEY_PASSWORD} # Optional: if AUTH required
15+
```
16+
17+
**That's it!** Sensible defaults are provided for all other settings.
18+
19+
---
20+
21+
## Configuration Reference
22+
23+
### Basic Settings
24+
25+
| Parameter | Type | Default | Description |
26+
|-----------|------|---------|-------------|
27+
| `enabled` | boolean | `false` | Enable/disable distributed caching |
28+
| `host` | string | `localhost` | Valkey server hostname |
29+
| `port` | int | `6379` | Valkey server port |
30+
| `password` | string | `null` | Optional password for AUTH |
31+
| `database` | int | `0` | Database index (0-15) |
32+
33+
### Advanced Settings (Optional)
34+
35+
These settings have sensible defaults and should only be changed for specific performance tuning needs.
36+
37+
| Parameter | Type | Default | Description |
38+
|-----------|------|---------|-------------|
39+
| `maxTotal` | int | `20` | Maximum total connections in pool |
40+
| `maxIdle` | int | `10` | Maximum idle connections |
41+
| `minIdle` | int | `5` | Minimum idle connections |
42+
| `timeoutMs` | int | `2000` | Connection timeout in milliseconds |
43+
| `cacheTtlSeconds` | long | `1800` | Cache entry TTL (30 minutes) |
44+
| `healthCheckIntervalMs` | long | `30000` | Health check interval (30 seconds) |
45+
46+
---
47+
48+
## Environment Variables
49+
50+
Use environment variable substitution for sensitive values:
51+
52+
```yaml
53+
valkeyConfiguration:
54+
enabled: true
55+
host: ${VALKEY_HOST:localhost}
56+
port: ${VALKEY_PORT:6379}
57+
password: ${VALKEY_PASSWORD}
58+
```
59+
60+
---
61+
62+
## Deployment Scenarios
63+
64+
### Single Gateway Instance
65+
66+
```yaml
67+
valkeyConfiguration:
68+
enabled: false # Not needed - local cache is sufficient
69+
```
70+
71+
### Multiple Gateway Instances (Recommended)
72+
73+
```yaml
74+
valkeyConfiguration:
75+
enabled: true
76+
host: valkey.internal.prod
77+
port: 6379
78+
password: ${VALKEY_PASSWORD}
79+
```
80+
81+
### High-Traffic Production (Advanced Tuning)
82+
83+
```yaml
84+
valkeyConfiguration:
85+
enabled: true
86+
host: valkey.internal.prod
87+
port: 6379
88+
password: ${VALKEY_PASSWORD}
89+
maxTotal: 100 # More connections for high concurrency
90+
maxIdle: 50
91+
minIdle: 25
92+
timeoutMs: 5000 # Longer timeout for slower networks
93+
cacheTtlSeconds: 3600 # 1 hour for long-running queries
94+
```
95+
96+
---
97+
98+
## Connection Pool Sizing Guidelines
99+
100+
| Deployment Size | Gateway Instances | Recommended `maxTotal` | Recommended `maxIdle` |
101+
|-----------------|-------------------|------------------------|----------------------|
102+
| Small | 1-2 | 20 (default) | 10 (default) |
103+
| Medium | 3-5 | 50 | 25 |
104+
| Large | 6-10 | 100 | 50 |
105+
| Enterprise | 10+ | 200 | 100 |
106+
107+
**Formula:** `maxTotal = (number of gateways) × 10` is a good starting point.
108+
109+
---
110+
111+
## Performance Tuning
112+
113+
### Cache TTL (`cacheTtlSeconds`)
114+
115+
- **Default (1800s / 30min):** Good for typical workloads
116+
- **Short-lived queries (<5min):** Use 600s (10min)
117+
- **Long-running queries (hours):** Use 3600s (1 hour) or more
118+
- **Interactive development:** Use 300s (5min)
119+
120+
### Health Check Interval (`healthCheckIntervalMs`)
121+
122+
- **Default (30000ms / 30s):** Balanced check frequency
123+
- **Unstable network:** Increase to 60000ms (1 min)
124+
- **Critical systems:** Decrease to 10000ms (10s)
125+
126+
### Connection Timeouts (`timeoutMs`)
127+
128+
- **Default (2000ms):** Good for local/same-datacenter Valkey
129+
- **Cross-region:** Increase to 5000ms
130+
- **High latency network:** Increase to 10000ms
131+
132+
---
133+
134+
## Monitoring
135+
136+
Valkey cache exposes the following metrics (accessible via `ValkeyDistributedCache` instance):
137+
138+
```java
139+
long hits = cache.getCacheHits();
140+
long misses = cache.getCacheMisses();
141+
long writes = cache.getCacheWrites();
142+
long errors = cache.getCacheErrors();
143+
double hitRate = cache.getCacheHitRate(); // Percentage
144+
```
145+
146+
### Expected Metrics (Healthy System)
147+
148+
- **Cache Hit Rate:** 85-95%
149+
- **Cache Errors:** 0 (or very low)
150+
- **Cache Writes:** ~Equal to query submission rate
151+
152+
### Troubleshooting
153+
154+
**Low Hit Rate (<70%)**
155+
- Check TTL settings (may be too short)
156+
- Verify Valkey isn't evicting entries (check memory)
157+
- Check if multiple gateway versions deployed (cache key mismatch)
158+
159+
**High Error Rate**
160+
- Check Valkey connectivity
161+
- Verify password/AUTH configuration
162+
- Review Valkey server logs
163+
164+
**Connection Pool Exhaustion**
165+
- Increase `maxTotal` setting
166+
- Check for connection leaks (should be none with try-with-resources)
167+
168+
---
169+
170+
## Security Considerations
171+
172+
### Production Deployment Checklist
173+
174+
- [ ] **Enable AUTH:** Set `password` in configuration
175+
- [ ] **Use Environment Variables:** Don't hardcode passwords
176+
- [ ] **Network Security:** Deploy Valkey in private VPC/network
177+
- [ ] **Encryption at Rest:** Enable Valkey persistence encryption
178+
- [ ] **TLS/SSL:** (Future enhancement - not yet supported)
179+
- [ ] **Access Control:** Restrict Valkey port (6379) to gateway instances only
180+
181+
### Example Production Setup
182+
183+
```yaml
184+
# config.yaml
185+
valkeyConfiguration:
186+
enabled: true
187+
host: ${VALKEY_INTERNAL_HOST}
188+
port: 6379
189+
password: ${VALKEY_PASSWORD}
190+
```
191+
192+
```bash
193+
# Environment variables (set in deployment)
194+
export VALKEY_INTERNAL_HOST=valkey.vpc.internal
195+
export VALKEY_PASSWORD=$(vault read -field=password secret/valkey)
196+
```
197+
198+
---
199+
200+
## Architecture
201+
202+
### 3-Tier Caching
203+
204+
```
205+
Request Flow:
206+
1. Check L1 (Local Guava Cache) → 10k entries, 30min TTL
207+
├─ Hit: Return immediately (~1ms)
208+
└─ Miss: Continue to L2
209+
210+
2. Check L2 (Valkey Distributed Cache) → Shared across gateways
211+
├─ Hit: Populate L1, return (~5ms)
212+
└─ Miss: Continue to L3
213+
214+
3. Check L3 (PostgreSQL Database) → Source of truth
215+
├─ Found: Populate L2 + L1, return (~50ms)
216+
└─ Not Found: Search all backends via HTTP (~200ms)
217+
```
218+
219+
### Cache Keys
220+
221+
```
222+
Backend: trino:query:backend:{queryId}
223+
Routing Group: trino:query:routinggroup:{queryId}
224+
External URL: trino:query:externalurl:{queryId}
225+
```
226+
227+
---
228+
229+
## Migration Guide
230+
231+
### From Single Gateway to Multi-Gateway
232+
233+
1. **Deploy Valkey server** (standalone or cluster)
234+
2. **Update config.yaml** on all gateways:
235+
```yaml
236+
valkeyConfiguration:
237+
enabled: true
238+
host: valkey.internal
239+
port: 6379
240+
password: ${VALKEY_PASSWORD}
241+
```
242+
3. **Restart gateways** (rolling restart recommended)
243+
4. **Monitor metrics** to verify cache hit rates
244+
245+
No data migration needed - cache will populate automatically.
246+
247+
---
248+
249+
## FAQ
250+
251+
**Q: Do I need Valkey if I only have one gateway?**
252+
A: No. Local Guava cache is sufficient for single-instance deployments.
253+
254+
**Q: What happens if Valkey goes down?**
255+
A: Graceful degradation - queries continue working, falling back to database. Performance may degrade slightly.
256+
257+
**Q: Can I use Redis instead of Valkey?**
258+
A: Yes! Valkey is a Redis fork with compatible protocol. Just point to your Redis server.
259+
260+
**Q: How much memory does Valkey need?**
261+
A: Rough estimate: `(queries per minute) × (average query lifetime in minutes) × 500 bytes`
262+
Example: 1000 q/min × 30 min × 500 bytes = ~15 MB
263+
264+
**Q: Can I clear the cache?**
265+
A: Yes, via Valkey CLI: `redis-cli -h <host> -a <password> FLUSHDB`
266+
Or selectively: `redis-cli DEL trino:query:backend:*`
267+
268+
---
269+
270+
## Support
271+
272+
For issues or questions:
273+
- GitHub Issues: https://github.com/trinodb/trino-gateway/issues
274+
- Trino Community Slack: #trino-gateway channel

gateway-ha/config.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,11 @@ clusterStatsConfiguration:
2323
monitor:
2424
taskDelay: 1m
2525
clusterMetricsRegistryRefreshPeriod: 30s
26+
27+
# Valkey distributed cache (optional - for multi-instance deployments)
28+
valkeyConfiguration:
29+
enabled: false # Set to true to enable distributed caching
30+
host: localhost
31+
port: 6379
32+
# password: ${VALKEY_PASSWORD} # Uncomment if Valkey requires AUTH
33+
# cacheTtlSeconds: 1800 # Cache TTL in seconds (default: 1800 = 30 minutes)

gateway-ha/pom.xml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,12 @@
189189
<version>${dep.trino.version}</version>
190190
</dependency>
191191

192+
<dependency>
193+
<groupId>io.valkey</groupId>
194+
<artifactId>valkey-java</artifactId>
195+
<version>5.5.0</version>
196+
</dependency>
197+
192198
<dependency>
193199
<groupId>jakarta.annotation</groupId>
194200
<artifactId>jakarta.annotation-api</artifactId>

0 commit comments

Comments
 (0)