Skip to content

Conversation

@hpopuri2
Copy link

Add Valkey Distributed Cache for Horizontal Scaling

Summary

This PR implements distributed caching using Valkey (Redis-compatible) to enable horizontal scaling of Trino Gateway. Multiple gateway instances can now share query metadata through a distributed cache layer, ensuring consistent query routing across all
instances.

Motivation

Currently, Trino Gateway uses local Guava caches that are not shared between instances. In multi-instance deployments, this can lead to:

  • Inconsistent query routing when requests hit different gateway instances
  • Cache misses requiring expensive database lookups
  • Inability to leverage cache across horizontally scaled deployments

This implementation addresses these limitations while maintaining backward compatibility and graceful degradation.

Architecture

3-Tier Caching Strategy

Request Flow:

  1. L1 Cache (Local Guava) → ~1ms
    ├─ Hit: Return immediately
    └─ Miss: Check L2
  2. L2 Cache (Valkey Distributed) → ~5ms
    ├─ Hit: Populate L1, return
    └─ Miss: Check L3
  3. L3 Cache (PostgreSQL Database) → 50ms
    ├─ Found: Populate L2 + L1, return
    └─ Not Found: Search backends via HTTP (200ms)

Cache Keys

  • trino:query:backend:{queryId} - Backend URL for query
  • trino:query:routinggroup:{queryId} - Routing group for query
  • trino:query:externalurl:{queryId} - External URL (lazy-loaded)(@VisibleForTesting)

Implementation Details

Core Components

ValkeyConfiguration (gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java)

  • 11 configurable parameters with sensible defaults
  • Input validation (port range, positive values, health check intervals)
  • Convention over Configuration - only 3 params required (enabled, host, port)

ValkeyDistributedCache (gateway-ha/src/main/java/io/trino/gateway/ha/router/ValkeyDistributedCache.java)

  • Implements DistributedCache interface
  • JedisPool connection pooling with configurable pool size
  • Modern Duration API (no deprecated methods)
  • Graceful degradation when disabled or unhealthy
  • Periodic health checks (PING) with configurable interval
  • Metrics tracking: hits, misses, writes, errors, hit rate

DistributedCache Interface (gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java)

  • Clean abstraction for cache operations (get, set, delete, isHealthy)
  • Enables future alternative implementations

Integration

BaseRoutingManager - Updated routing logic:

  • Write-through caching for backend and routing group (cache on write)
  • Lazy-loading for external URLs (cache on first read)
  • Cache key documentation with Javadoc
  • Graceful fallback to database when cache unavailable

HaGatewayProviderModule - Dependency injection:

  • Provides DistributedCache singleton
  • Wires ValkeyConfiguration to ValkeyDistributedCache

Configuration

Minimal (Recommended for Getting Started)

  ```yaml
  valkeyConfiguration:
    enabled: true
    host: localhost
    port: 6379
    # password: ${VALKEY_PASSWORD}  # Optional: if AUTH required
    # cacheTtlSeconds: 1800  # Optional: Cache TTL (default: 1800 = 30 minutes)

      

Advanced (Production Tuning)

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0
  maxTotal: 100              # More connections for high concurrency
  maxIdle: 50
  minIdle: 25
  timeoutMs: 5000            # Longer timeout for slower networks
  cacheTtlSeconds: 3600      # 1 hour for long-running queries
  healthCheckIntervalMs: 60000  # 1 minute health checks

Single Instance (No Changes Required)

valkeyConfiguration:
enabled: false # Default - local cache sufficient

Testing

Unit Tests (31 total, all passing)

TestValkeyConfiguration (16 tests)

  • Default values verification
  • Setter/getter correctness
  • Input validation (port range, positive values, etc.)
  • Edge cases (null password, various database indices)

TestValkeyDistributedCache (15 tests)

  • Disabled cache behavior (returns empty, fails gracefully)
  • Invalid host handling (marks unhealthy)
  • Close operations (safe cleanup)
  • All cache operations (get, set, delete with variations)
  • Null/empty array handling
  • Password and database configuration
  • Pool configuration acceptance

Integration Tests (existing tests updated)

  • All routing manager tests updated with distributed cache
  • NoopDistributedCache for tests not requiring real cache
  • No regression in existing functionality

Documentation

Comprehensive documentation added:

New File: docs/valkey-configuration.md (273 lines)

  • Quick start with minimal config
  • Full configuration reference
  • Deployment scenarios (single vs. multi-instance)
  • Performance tuning guidelines
  • Connection pool sizing recommendations
  • Monitoring and troubleshooting
  • Security checklist
  • Architecture explanation
  • Migration guide
  • FAQ

Updated Files:

  • docs/installation.md - Added "Configure distributed cache" section
  • docs/operation.md - Added "Multi-instance deployments" monitoring section
  • mkdocs.yml - Added navigation entry for Valkey docs

Backward Compatibility

✅ Fully backward compatible

  • Disabled by default (enabled: false)
  • No changes required to existing configs
  • Single-instance deployments work exactly as before
  • Existing tests pass without modification (after updating constructors)

Migration Path

From Single to Multi-Gateway

  1. Deploy Valkey server
  2. Update config.yaml on all gateways:
    valkeyConfiguration:
    enabled: true
    host: valkey.internal
    port: 6379
    password: ${VALKEY_PASSWORD}
  3. Rolling restart gateways
  4. Monitor cache hit rates

No data migration needed - cache populates automatically.

Graceful Degradation

When Valkey is unavailable:

  • ✅ Queries continue working
  • ✅ Falls back to database lookups
  • ✅ Logs warnings (not errors)
  • ✅ Marks cache as unhealthy
  • ✅ Auto-recovery when Valkey returns

Monitoring

Cache metrics available via ValkeyDistributedCache:

  • getCacheHits() - Total cache hits
  • getCacheMisses() - Total cache misses
  • getCacheWrites() - Total successful writes
  • getCacheErrors() - Total operation errors
  • getCacheHitRate() - Hit rate percentage (0-100)

Future work: Expose these via /metrics endpoint for Prometheus.

Dependencies

Added: io.valkey:valkey-java:5.5.0

  • Valkey is a Redis fork with compatible protocol
  • Works with both Valkey and Redis servers
  • Apache 2.0 licensed

Files Changed

New Files (7)

  • gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/ValkeyDistributedCache.java
  • gateway-ha/src/test/java/io/trino/gateway/ha/config/TestValkeyConfiguration.java
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/NoopDistributedCache.java
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCache.java
  • docs/valkey-configuration.md

Modified Files (16)

  • Configuration: gateway-ha/config.yaml, docs/config.yaml
  • Core: HaGatewayConfiguration.java, HaGatewayProviderModule.java, BaseRoutingManager.java
  • Routing: QueryCountBasedRouter.java, StochasticRoutingManager.java, ProxyRequestHandler.java
  • Tests: 4 routing manager tests updated
  • Docs: installation.md, operation.md, mkdocs.yml

Checklist

  • Code follows project style guidelines
  • No deprecated APIs used (Duration instead of milliseconds)
  • Comprehensive input validation
  • Unit tests written and passing (31 tests)
  • Integration tests updated and passing
  • Documentation complete and accurate
  • Configuration examples provided
  • Backward compatible (disabled by default)
  • Graceful degradation implemented
  • Security considerations documented
  • Performance tuning guidelines included

Future Enhancements

  • Expose cache metrics via /metrics OpenMetrics endpoint
  • Add TLS/SSL support for Valkey connections
  • Support Redis Cluster mode
  • Add cache warming on startup
  • Implement cache eviction strategies
  • Add circuit breaker pattern for cache failures

Testing Instructions

Local Testing (Single Instance)

No changes needed - works as before

java -jar gateway-ha.jar config.yaml

Multi-Instance with Valkey

Start Valkey

docker run -d -p 6379:6379 valkey/valkey:latest

Update config.yaml

valkeyConfiguration:
enabled: true
host: localhost
port: 6379

Start multiple gateways

java -jar gateway-ha.jar config.yaml

Verify Cache Working

Check logs for:

"Valkey distributed cache initialized: localhost:6379"

"Valkey health check passed"

Submit queries and verify cache hits increase

  Implement distributed caching using Valkey (Redis-compatible) to enable
  horizontal scaling of Trino Gateway across multiple instances. This allows
  query metadata to be shared between gateway instances, ensuring consistent
  routing regardless of which instance receives a request.

  Key features:
  - 3-tier caching architecture: L1 (Guava local) → L2 (Valkey distributed) → L3 (PostgreSQL)
  - Graceful degradation when Valkey unavailable (falls back to database)
  - Configurable health checks and connection pooling
  - Cache metrics (hits, misses, writes, errors, hit rate)
  - Write-through caching for backend and routing group lookups
  - Lazy-loading for external URL lookups
  - Convention over Configuration with sensible defaults

  Implementation:
  - Add ValkeyConfiguration with 11 configurable parameters (minimal 3 required)
  - Create DistributedCache interface and ValkeyDistributedCache implementation
  - Integrate distributed cache into BaseRoutingManager routing logic
  - Use modern Duration API (no deprecated methods)
  - Add comprehensive input validation and error handling
  - Include 31 unit tests (16 config + 15 cache tests)

  Configuration:
  valkeyConfiguration:
    enabled: true
    host: valkey.internal
    port: 6379
    password: ${VALKEY_PASSWORD}

  Documentation includes:
  - Quick start guide with minimal configuration
  - Full configuration reference with tuning guidelines
  - Deployment scenarios (single vs. multi-instance)
  - Performance tuning recommendations
  - Security best practices
  - Architecture documentation and troubleshooting

  Single-instance deployments don't need distributed caching - local Guava
  cache is sufficient. Multi-instance deployments benefit from shared cache
  for consistent query routing.
  Implement distributed caching using Valkey (Redis-compatible) to enable
  horizontal scaling of Trino Gateway across multiple instances. This allows
  query metadata to be shared between gateway instances, ensuring consistent
  routing regardless of which instance receives a request.

  Key features:
  - 3-tier caching architecture: L1 (Guava local) → L2 (Valkey distributed) → L3 (PostgreSQL)
  - Graceful degradation when Valkey unavailable (falls back to database)
  - Configurable health checks and connection pooling
  - Cache metrics (hits, misses, writes, errors, hit rate)
  - Write-through caching for backend and routing group lookups
  - Lazy-loading for external URL lookups
  - Convention over Configuration with sensible defaults

  Implementation:
  - Add ValkeyConfiguration with 11 configurable parameters (minimal 3 required)
  - Create DistributedCache interface and ValkeyDistributedCache implementation
  - Integrate distributed cache into BaseRoutingManager routing logic
  - Use modern Duration API (no deprecated methods)
  - Add comprehensive input validation and error handling
  - Include 31 unit tests (16 config + 15 cache tests)

  Configuration:
  valkeyConfiguration:
    enabled: true
    host: valkey.internal
    port: 6379
    password: ${VALKEY_PASSWORD}

  Documentation includes:
  - Quick start guide with minimal configuration
  - Full configuration reference with tuning guidelines
  - Deployment scenarios (single vs. multi-instance)
  - Performance tuning recommendations
  - Security best practices
  - Architecture documentation and troubleshooting

  Single-instance deployments don't need distributed caching - local Guava
  cache is sufficient. Multi-instance deployments benefit from shared cache
  for consistent query routing.
…endency

# Conflicts:
#	docs/config.yaml
#	docs/installation.md
#	docs/valkey-configuration.md
#	gateway-ha/config.yaml
@cla-bot cla-bot bot added the cla-signed label Jan 27, 2026
After merge with main branch, UserConfiguration and ApiAuthenticator
imports are no longer used due to code refactoring. Removing them to
fix checkstyle violations.

- Remove unused import: io.trino.gateway.ha.config.UserConfiguration
- Remove unused import: io.trino.gateway.ha.security.ApiAuthenticator
After merge, code uses ImmutableList, MonitorConfiguration, and List
but imports were missing causing compilation failures.

- Add import: com.google.common.collect.ImmutableList (for getClusterStatsObservers)
- Add import: io.trino.gateway.ha.config.MonitorConfiguration (for getMonitorConfiguration)
- Add import: java.util.List (for method return type)
@hpopuri2 hpopuri2 closed this Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

1 participant