Added valkey dependency #875

hpopuri2 · 2026-01-27T11:37:13Z

Add Valkey Distributed Cache for Horizontal Scaling

Summary

This PR implements distributed caching using Valkey (Redis-compatible) to enable horizontal scaling of Trino Gateway. Multiple gateway instances can now share query metadata through a distributed cache layer, ensuring consistent query routing across all
instances.

Motivation

Currently, Trino Gateway uses local Guava caches that are not shared between instances. In multi-instance deployments, this can lead to:

Inconsistent query routing when requests hit different gateway instances
Cache misses requiring expensive database lookups
Inability to leverage cache across horizontally scaled deployments

This implementation addresses these limitations while maintaining backward compatibility and graceful degradation.

Architecture

3-Tier Caching Strategy

Request Flow:

L1 Cache (Local Guava) → ~1ms
├─ Hit: Return immediately
└─ Miss: Check L2
L2 Cache (Valkey Distributed) → ~5ms
├─ Hit: Populate L1, return
└─ Miss: Check L3
L3 Cache (PostgreSQL Database) → 50ms
├─ Found: Populate L2 + L1, return
└─ Not Found: Search backends via HTTP (200ms)

Cache Keys

trino:query:backend:{queryId} - Backend URL for query
trino:query:routinggroup:{queryId} - Routing group for query
trino:query:externalurl:{queryId} - External URL (lazy-loaded)(@VisibleForTesting)

Implementation Details

Core Components

ValkeyConfiguration (gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java)

11 configurable parameters with sensible defaults
Input validation (port range, positive values, health check intervals)
Convention over Configuration - only 3 params required (enabled, host, port)

ValkeyDistributedCache (gateway-ha/src/main/java/io/trino/gateway/ha/router/ValkeyDistributedCache.java)

Implements DistributedCache interface
JedisPool connection pooling with configurable pool size
Modern Duration API (no deprecated methods)
Graceful degradation when disabled or unhealthy
Periodic health checks (PING) with configurable interval
Metrics tracking: hits, misses, writes, errors, hit rate

DistributedCache Interface (gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java)

Clean abstraction for cache operations (get, set, delete, isHealthy)
Enables future alternative implementations

Integration

BaseRoutingManager - Updated routing logic:

Write-through caching for backend and routing group (cache on write)
Lazy-loading for external URLs (cache on first read)
Cache key documentation with Javadoc
Graceful fallback to database when cache unavailable

HaGatewayProviderModule - Dependency injection:

Provides DistributedCache singleton
Wires ValkeyConfiguration to ValkeyDistributedCache

Configuration

Minimal (Recommended for Getting Started)

  ```yaml
  valkeyConfiguration:
    enabled: true
    host: localhost
    port: 6379
    # password: ${VALKEY_PASSWORD}  # Optional: if AUTH required
    # cacheTtlSeconds: 1800  # Optional: Cache TTL (default: 1800 = 30 minutes)

Advanced (Production Tuning)

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0
  maxTotal: 100              # More connections for high concurrency
  maxIdle: 50
  minIdle: 25
  timeoutMs: 5000            # Longer timeout for slower networks
  cacheTtlSeconds: 3600      # 1 hour for long-running queries
  healthCheckIntervalMs: 60000  # 1 minute health checks

Single Instance (No Changes Required)

valkeyConfiguration:
enabled: false # Default - local cache sufficient

Testing

Unit Tests (31 total, all passing)

TestValkeyConfiguration (16 tests)

Default values verification
Setter/getter correctness
Input validation (port range, positive values, etc.)
Edge cases (null password, various database indices)

TestValkeyDistributedCache (15 tests)

Disabled cache behavior (returns empty, fails gracefully)
Invalid host handling (marks unhealthy)
Close operations (safe cleanup)
All cache operations (get, set, delete with variations)
Null/empty array handling
Password and database configuration
Pool configuration acceptance

Integration Tests (existing tests updated)

All routing manager tests updated with distributed cache
NoopDistributedCache for tests not requiring real cache
No regression in existing functionality

Documentation

Comprehensive documentation added:

New File: docs/valkey-configuration.md (273 lines)

Quick start with minimal config
Full configuration reference
Deployment scenarios (single vs. multi-instance)
Performance tuning guidelines
Connection pool sizing recommendations
Monitoring and troubleshooting
Security checklist
Architecture explanation
Migration guide
FAQ

Updated Files:

docs/installation.md - Added "Configure distributed cache" section
docs/operation.md - Added "Multi-instance deployments" monitoring section
mkdocs.yml - Added navigation entry for Valkey docs

Backward Compatibility

✅ Fully backward compatible

Disabled by default (enabled: false)
No changes required to existing configs
Single-instance deployments work exactly as before
Existing tests pass without modification (after updating constructors)

Migration Path

From Single to Multi-Gateway

Deploy Valkey server
Update config.yaml on all gateways:
valkeyConfiguration:
enabled: true
host: valkey.internal
port: 6379
password: ${VALKEY_PASSWORD}
Rolling restart gateways
Monitor cache hit rates

No data migration needed - cache populates automatically.

Graceful Degradation

When Valkey is unavailable:

✅ Queries continue working
✅ Falls back to database lookups
✅ Logs warnings (not errors)
✅ Marks cache as unhealthy
✅ Auto-recovery when Valkey returns

Monitoring

Cache metrics available via ValkeyDistributedCache:

getCacheHits() - Total cache hits
getCacheMisses() - Total cache misses
getCacheWrites() - Total successful writes
getCacheErrors() - Total operation errors
getCacheHitRate() - Hit rate percentage (0-100)

Future work: Expose these via /metrics endpoint for Prometheus.

Dependencies

Added: io.valkey:valkey-java:5.5.0

Valkey is a Redis fork with compatible protocol
Works with both Valkey and Redis servers
Apache 2.0 licensed

Files Changed

New Files (7)

gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java
gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java
gateway-ha/src/main/java/io/trino/gateway/ha/router/ValkeyDistributedCache.java
gateway-ha/src/test/java/io/trino/gateway/ha/config/TestValkeyConfiguration.java
gateway-ha/src/test/java/io/trino/gateway/ha/router/NoopDistributedCache.java
gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCache.java
docs/valkey-configuration.md

Modified Files (16)

Configuration: gateway-ha/config.yaml, docs/config.yaml
Core: HaGatewayConfiguration.java, HaGatewayProviderModule.java, BaseRoutingManager.java
Routing: QueryCountBasedRouter.java, StochasticRoutingManager.java, ProxyRequestHandler.java
Tests: 4 routing manager tests updated
Docs: installation.md, operation.md, mkdocs.yml

Checklist

Code follows project style guidelines
No deprecated APIs used (Duration instead of milliseconds)
Comprehensive input validation
Unit tests written and passing (31 tests)
Integration tests updated and passing
Documentation complete and accurate
Configuration examples provided
Backward compatible (disabled by default)
Graceful degradation implemented
Security considerations documented
Performance tuning guidelines included

Future Enhancements

Expose cache metrics via /metrics OpenMetrics endpoint
Add TLS/SSL support for Valkey connections
Support Redis Cluster mode
Add cache warming on startup
Implement cache eviction strategies
Add circuit breaker pattern for cache failures

Testing Instructions

Local Testing (Single Instance)

No changes needed - works as before

java -jar gateway-ha.jar config.yaml

Multi-Instance with Valkey

Start Valkey

docker run -d -p 6379:6379 valkey/valkey:latest

Update config.yaml

valkeyConfiguration:
enabled: true
host: localhost
port: 6379

Start multiple gateways

java -jar gateway-ha.jar config.yaml

Verify Cache Working

Check logs for:

"Valkey distributed cache initialized: localhost:6379"

"Valkey health check passed"

Submit queries and verify cache hits increase

Implement distributed caching using Valkey (Redis-compatible) to enable horizontal scaling of Trino Gateway across multiple instances. This allows query metadata to be shared between gateway instances, ensuring consistent routing regardless of which instance receives a request. Key features: - 3-tier caching architecture: L1 (Guava local) → L2 (Valkey distributed) → L3 (PostgreSQL) - Graceful degradation when Valkey unavailable (falls back to database) - Configurable health checks and connection pooling - Cache metrics (hits, misses, writes, errors, hit rate) - Write-through caching for backend and routing group lookups - Lazy-loading for external URL lookups - Convention over Configuration with sensible defaults Implementation: - Add ValkeyConfiguration with 11 configurable parameters (minimal 3 required) - Create DistributedCache interface and ValkeyDistributedCache implementation - Integrate distributed cache into BaseRoutingManager routing logic - Use modern Duration API (no deprecated methods) - Add comprehensive input validation and error handling - Include 31 unit tests (16 config + 15 cache tests) Configuration: valkeyConfiguration: enabled: true host: valkey.internal port: 6379 password: ${VALKEY_PASSWORD} Documentation includes: - Quick start guide with minimal configuration - Full configuration reference with tuning guidelines - Deployment scenarios (single vs. multi-instance) - Performance tuning recommendations - Security best practices - Architecture documentation and troubleshooting Single-instance deployments don't need distributed caching - local Guava cache is sufficient. Multi-instance deployments benefit from shared cache for consistent query routing.

…endency # Conflicts: # docs/config.yaml # docs/installation.md # docs/valkey-configuration.md # gateway-ha/config.yaml

After merge with main branch, UserConfiguration and ApiAuthenticator imports are no longer used due to code refactoring. Removing them to fix checkstyle violations. - Remove unused import: io.trino.gateway.ha.config.UserConfiguration - Remove unused import: io.trino.gateway.ha.security.ApiAuthenticator

After merge, code uses ImmutableList, MonitorConfiguration, and List but imports were missing causing compilation failures. - Add import: com.google.common.collect.ImmutableList (for getClusterStatsObservers) - Add import: io.trino.gateway.ha.config.MonitorConfiguration (for getMonitorConfiguration) - Add import: java.util.List (for method return type)

…endency

hpopuri2 added 3 commits January 27, 2026 16:33

Merge remote-tracking branch 'origin/valkeydependency' into valkeydep…

d7d65f0

…endency # Conflicts: # docs/config.yaml # docs/installation.md # docs/valkey-configuration.md # gateway-ha/config.yaml

cla-bot bot added the cla-signed label Jan 27, 2026

hpopuri2 added 5 commits January 27, 2026 17:09

Merge branch 'main' into valkeydependency

eb5d034

changes for build

11a3878

Merge remote-tracking branch 'origin/valkeydependency' into valkeydep…

db5df0c

…endency

hpopuri2 closed this Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added valkey dependency #875

Added valkey dependency #875

Uh oh!

hpopuri2 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Added valkey dependency #875

Added valkey dependency #875

Uh oh!

Conversation

hpopuri2 commented Jan 27, 2026

Add Valkey Distributed Cache for Horizontal Scaling

Summary

Motivation

Architecture

3-Tier Caching Strategy

Cache Keys

Implementation Details

Core Components

Integration

Configuration

Minimal (Recommended for Getting Started)

No changes needed - works as before

Start Valkey

Update config.yaml

Start multiple gateways

Check logs for:

"Valkey distributed cache initialized: localhost:6379"

"Valkey health check passed"

Submit queries and verify cache hits increase

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant