Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,11 @@ dataStore:

clusterStatsConfiguration:
monitorType: INFO_API

# Valkey distributed cache (optional - for multi-instance deployments)
valkeyConfiguration:
enabled: false
host: localhost
port: 6379
# password: ${VALKEY_PASSWORD} # Uncomment if Valkey requires AUTH
# cacheTtlSeconds: 1800 # Cache TTL in seconds (default: 1800 = 30 minutes)
26 changes: 26 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,32 @@ For additional configurations, use the `log.*` properties from the
[Trino logging properties documentation](https://trino.io/docs/current/admin/properties-logging.html) and specify
the properties in `serverConfig`.

### Configure distributed cache (optional)

For multi-instance deployments, Trino Gateway supports distributed caching
using Valkey (or Redis) to share query metadata across gateway instances.
This improves query routing and enables horizontal scaling.

For single gateway deployments, distributed caching is not needed - the
local cache is sufficient.

```yaml
valkeyConfiguration:
enabled: true
host: valkey.internal.prod
port: 6379
password: ${ENV:VALKEY_PASSWORD}
cacheTtlSeconds: 1800 # Cache TTL (default: 1800 = 30 minutes)
```

**Optional parameters**: You can customize `cacheTtlSeconds` based on your query duration:
- Short queries (< 5 min): 600 seconds (10 minutes)
- Default queries: 1800 seconds (30 minutes)
- Long-running queries: 3600 seconds (1 hour)

See [Valkey distributed cache configuration](valkey-configuration.md) for
detailed configuration options, deployment scenarios, and performance tuning.

### Proxying additional paths

By default, Trino Gateway only proxies requests to paths starting with
Expand Down
18 changes: 16 additions & 2 deletions docs/operation.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ monitor:

## Monitoring <a name="monitoring"></a>

Trino Gateway provides a metrics endpoint that uses the OpenMetrics format at
`/metrics`. Use it to monitor Trino Gateway instances with Prometheus and
Trino Gateway provides a metrics endpoint that uses the OpenMetrics format at
`/metrics`. Use it to monitor Trino Gateway instances with Prometheus and
other compatible systems with the following Prometheus configuration:

```yaml
Expand All @@ -70,6 +70,20 @@ scrape_configs:
- gateway1.example.com:8080
```

### Multi-instance deployments

When running multiple Trino Gateway instances, enable the Valkey distributed
cache to share query metadata across instances. This ensures consistent query
routing regardless of which gateway instance receives the request.

Monitor the distributed cache performance by checking:
- Cache hit rate (target: 85-95%)
- Cache errors (should be near 0)
- Valkey server connectivity and memory usage

See [Valkey distributed cache configuration](valkey-configuration.md) for
setup instructions and monitoring details.

## Trino Gateway health endpoints

Trino Gateway provides two API endpoints to indicate the current status of the server:
Expand Down
274 changes: 274 additions & 0 deletions docs/valkey-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
# Valkey Distributed Cache Configuration

## Overview

Valkey distributed cache enables horizontal scaling of Trino Gateway by sharing query metadata across multiple gateway instances. When disabled, each gateway maintains its own local cache.

## Quick Start (Minimal Configuration)

```yaml
valkeyConfiguration:
enabled: true
host: localhost
port: 6379
# password: ${VALKEY_PASSWORD} # Optional: if AUTH required
```

**That's it!** Sensible defaults are provided for all other settings.

---

## Configuration Reference

### Basic Settings

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `enabled` | boolean | `false` | Enable/disable distributed caching |
| `host` | string | `localhost` | Valkey server hostname |
| `port` | int | `6379` | Valkey server port |
| `password` | string | `null` | Optional password for AUTH |
| `database` | int | `0` | Database index (0-15) |

### Advanced Settings (Optional)

These settings have sensible defaults and should only be changed for specific performance tuning needs.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `maxTotal` | int | `20` | Maximum total connections in pool |
| `maxIdle` | int | `10` | Maximum idle connections |
| `minIdle` | int | `5` | Minimum idle connections |
| `timeoutMs` | int | `2000` | Connection timeout in milliseconds |
| `cacheTtlSeconds` | long | `1800` | Cache entry TTL (30 minutes) |
| `healthCheckIntervalMs` | long | `30000` | Health check interval (30 seconds) |

---

## Environment Variables

Use environment variable substitution for sensitive values:

```yaml
valkeyConfiguration:
enabled: true
host: ${VALKEY_HOST:localhost}
port: ${VALKEY_PORT:6379}
password: ${VALKEY_PASSWORD}
```
---
## Deployment Scenarios
### Single Gateway Instance
```yaml
valkeyConfiguration:
enabled: false # Not needed - local cache is sufficient
```
### Multiple Gateway Instances (Recommended)
```yaml
valkeyConfiguration:
enabled: true
host: valkey.internal.prod
port: 6379
password: ${VALKEY_PASSWORD}
```
### High-Traffic Production (Advanced Tuning)
```yaml
valkeyConfiguration:
enabled: true
host: valkey.internal.prod
port: 6379
password: ${VALKEY_PASSWORD}
maxTotal: 100 # More connections for high concurrency
maxIdle: 50
minIdle: 25
timeoutMs: 5000 # Longer timeout for slower networks
cacheTtlSeconds: 3600 # 1 hour for long-running queries
```
---
## Connection Pool Sizing Guidelines
| Deployment Size | Gateway Instances | Recommended `maxTotal` | Recommended `maxIdle` |
|-----------------|-------------------|------------------------|----------------------|
| Small | 1-2 | 20 (default) | 10 (default) |
| Medium | 3-5 | 50 | 25 |
| Large | 6-10 | 100 | 50 |
| Enterprise | 10+ | 200 | 100 |

**Formula:** `maxTotal = (number of gateways) × 10` is a good starting point.

---

## Performance Tuning

### Cache TTL (`cacheTtlSeconds`)

- **Default (1800s / 30min):** Good for typical workloads
- **Short-lived queries (<5min):** Use 600s (10min)
- **Long-running queries (hours):** Use 3600s (1 hour) or more
- **Interactive development:** Use 300s (5min)

### Health Check Interval (`healthCheckIntervalMs`)

- **Default (30000ms / 30s):** Balanced check frequency
- **Unstable network:** Increase to 60000ms (1 min)
- **Critical systems:** Decrease to 10000ms (10s)

### Connection Timeouts (`timeoutMs`)

- **Default (2000ms):** Good for local/same-datacenter Valkey
- **Cross-region:** Increase to 5000ms
- **High latency network:** Increase to 10000ms

---

## Monitoring

Valkey cache exposes the following metrics (accessible via `ValkeyDistributedCache` instance):

```java
long hits = cache.getCacheHits();
long misses = cache.getCacheMisses();
long writes = cache.getCacheWrites();
long errors = cache.getCacheErrors();
double hitRate = cache.getCacheHitRate(); // Percentage
```

### Expected Metrics (Healthy System)

- **Cache Hit Rate:** 85-95%
- **Cache Errors:** 0 (or very low)
- **Cache Writes:** ~Equal to query submission rate

### Troubleshooting

**Low Hit Rate (<70%)**
- Check TTL settings (may be too short)
- Verify Valkey isn't evicting entries (check memory)
- Check if multiple gateway versions deployed (cache key mismatch)

**High Error Rate**
- Check Valkey connectivity
- Verify password/AUTH configuration
- Review Valkey server logs

**Connection Pool Exhaustion**
- Increase `maxTotal` setting
- Check for connection leaks (should be none with try-with-resources)

---

## Security Considerations

### Production Deployment Checklist

- [ ] **Enable AUTH:** Set `password` in configuration
- [ ] **Use Environment Variables:** Don't hardcode passwords
- [ ] **Network Security:** Deploy Valkey in private VPC/network
- [ ] **Encryption at Rest:** Enable Valkey persistence encryption
- [ ] **TLS/SSL:** (Future enhancement - not yet supported)
- [ ] **Access Control:** Restrict Valkey port (6379) to gateway instances only

### Example Production Setup

```yaml
# config.yaml
valkeyConfiguration:
enabled: true
host: ${VALKEY_INTERNAL_HOST}
port: 6379
password: ${VALKEY_PASSWORD}
```

```bash
# Environment variables (set in deployment)
export VALKEY_INTERNAL_HOST=valkey.vpc.internal
export VALKEY_PASSWORD=$(vault read -field=password secret/valkey)
```

---

## Architecture

### 3-Tier Caching

```
Request Flow:
1. Check L1 (Local Guava Cache) → 10k entries, 30min TTL
├─ Hit: Return immediately (~1ms)
└─ Miss: Continue to L2

2. Check L2 (Valkey Distributed Cache) → Shared across gateways
├─ Hit: Populate L1, return (~5ms)
└─ Miss: Continue to L3

3. Check L3 (PostgreSQL Database) → Source of truth
├─ Found: Populate L2 + L1, return (~50ms)
└─ Not Found: Search all backends via HTTP (~200ms)
```
### Cache Keys
```
Backend: trino:query:backend:{queryId}
Routing Group: trino:query:routinggroup:{queryId}
External URL: trino:query:externalurl:{queryId}
```
---
## Migration Guide
### From Single Gateway to Multi-Gateway
1. **Deploy Valkey server** (standalone or cluster)
2. **Update config.yaml** on all gateways:
```yaml
valkeyConfiguration:
enabled: true
host: valkey.internal
port: 6379
password: ${VALKEY_PASSWORD}
```
3. **Restart gateways** (rolling restart recommended)
4. **Monitor metrics** to verify cache hit rates

No data migration needed - cache will populate automatically.

---

## FAQ

**Q: Do I need Valkey if I only have one gateway?**
A: No. Local Guava cache is sufficient for single-instance deployments.

**Q: What happens if Valkey goes down?**
A: Graceful degradation - queries continue working, falling back to database. Performance may degrade slightly.

**Q: Can I use Redis instead of Valkey?**
A: Yes! Valkey is a Redis fork with compatible protocol. Just point to your Redis server.

**Q: How much memory does Valkey need?**
A: Rough estimate: `(queries per minute) × (average query lifetime in minutes) × 500 bytes`
Example: 1000 q/min × 30 min × 500 bytes = ~15 MB

**Q: Can I clear the cache?**
A: Yes, via Valkey CLI: `redis-cli -h <host> -a <password> FLUSHDB`
Or selectively: `redis-cli DEL trino:query:backend:*`

---

## Support

For issues or questions:
- GitHub Issues: https://github.com/trinodb/trino-gateway/issues
- Trino Community Slack: #trino-gateway channel
8 changes: 8 additions & 0 deletions gateway-ha/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,11 @@ clusterStatsConfiguration:
monitor:
taskDelay: 1m
clusterMetricsRegistryRefreshPeriod: 30s

# Valkey distributed cache (optional - for multi-instance deployments)
valkeyConfiguration:
enabled: false # Set to true to enable distributed caching
host: localhost
port: 6379
# password: ${VALKEY_PASSWORD} # Uncomment if Valkey requires AUTH
# cacheTtlSeconds: 1800 # Cache TTL in seconds (default: 1800 = 30 minutes)
6 changes: 6 additions & 0 deletions gateway-ha/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,12 @@
<version>${dep.trino.version}</version>
</dependency>

<dependency>
<groupId>io.valkey</groupId>
<artifactId>valkey-java</artifactId>
<version>5.5.0</version>
</dependency>

<dependency>
<groupId>jakarta.annotation</groupId>
<artifactId>jakarta.annotation-api</artifactId>
Expand Down
Loading
Loading