|
| 1 | +# Cassandra FindTraceIDs Duration Query Behavior |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Accepted |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +The Cassandra spanstore implementation in Jaeger handles trace queries with duration filters (DurationMin/DurationMax) through a separate code path that cannot efficiently intersect with other query parameters like tags or general operation name filters. This behavior differs from other storage backends like Badger and may seem counterintuitive to users. |
| 10 | + |
| 11 | +### Data Model and Cassandra Constraints |
| 12 | + |
| 13 | +Cassandra's data model imposes specific constraints on query patterns. The `duration_index` table is defined with the following schema structure (as referenced in the CQL insertion query in [`internal/storage/v1/cassandra/spanstore/writer.go`](../../internal/storage/v1/cassandra/spanstore/writer.go)): |
| 14 | + |
| 15 | +```cql |
| 16 | +INSERT INTO duration_index(service_name, operation_name, bucket, duration, start_time, trace_id) |
| 17 | +VALUES (?, ?, ?, ?, ?, ?) |
| 18 | +``` |
| 19 | + |
| 20 | +This schema uses a composite partition key consisting of `service_name`, `operation_name`, and `bucket` (an hourly time bucket), with `duration` as a clustering column. In Cassandra, **partition keys require equality constraints** in WHERE clauses - you cannot perform range queries or arbitrary intersections across different partition keys efficiently. |
| 21 | + |
| 22 | +### Duration Index Structure |
| 23 | + |
| 24 | +The duration index is bucketed by hour to limit partition size and improve query performance. From [`internal/storage/v1/cassandra/spanstore/writer.go`](../../internal/storage/v1/cassandra/spanstore/writer.go) (line 57): |
| 25 | + |
| 26 | +```go |
| 27 | +durationBucketSize = time.Hour |
| 28 | +``` |
| 29 | + |
| 30 | +When a span is indexed, its start time is rounded to the nearest hour bucket (line 231 in writer.go): |
| 31 | + |
| 32 | +```go |
| 33 | +timeBucket := startTime.Round(durationBucketSize) |
| 34 | +``` |
| 35 | + |
| 36 | +The indexing function in `indexByDuration` (lines 229-243) creates two index entries per span: |
| 37 | +1. One indexed by service name alone (with empty operation name) |
| 38 | +2. One indexed by both service name and operation name |
| 39 | + |
| 40 | +```go |
| 41 | +indexByOperationName("") // index by service name alone |
| 42 | +indexByOperationName(span.OperationName) // index by service name and operation name |
| 43 | +``` |
| 44 | + |
| 45 | +### Query Path Implementation |
| 46 | + |
| 47 | +In [`internal/storage/v1/cassandra/spanstore/reader.go`](../../internal/storage/v1/cassandra/spanstore/reader.go), the `findTraceIDs` method (lines 275-301) performs an early return when duration parameters are present: |
| 48 | + |
| 49 | +```go |
| 50 | +func (s *SpanReader) findTraceIDs(ctx context.Context, traceQuery *spanstore.TraceQueryParameters) (dbmodel.UniqueTraceIDs, error) { |
| 51 | + if traceQuery.DurationMin != 0 || traceQuery.DurationMax != 0 { |
| 52 | + return s.queryByDuration(ctx, traceQuery) |
| 53 | + } |
| 54 | + // ... other query paths |
| 55 | +} |
| 56 | +``` |
| 57 | + |
| 58 | +This early return means that when a duration query is detected, **all other query parameters except ServiceName and OperationName are effectively ignored** (tags, for instance, are not processed). |
| 59 | + |
| 60 | +The `queryByDuration` method (lines 333-375) iterates over hourly buckets within the query time range and issues a Cassandra query for each bucket: |
| 61 | + |
| 62 | +```go |
| 63 | +startTimeByHour := traceQuery.StartTimeMin.Round(durationBucketSize) |
| 64 | +endTimeByHour := traceQuery.StartTimeMax.Round(durationBucketSize) |
| 65 | + |
| 66 | +for timeBucket := endTimeByHour; timeBucket.After(startTimeByHour) || timeBucket.Equal(startTimeByHour); timeBucket = timeBucket.Add(-1 * durationBucketSize) { |
| 67 | + query := s.session.Query( |
| 68 | + queryByDuration, |
| 69 | + timeBucket, |
| 70 | + traceQuery.ServiceName, |
| 71 | + traceQuery.OperationName, |
| 72 | + minDurationMicros, |
| 73 | + maxDurationMicros, |
| 74 | + traceQuery.NumTraces*limitMultiple) |
| 75 | + // execute query... |
| 76 | +} |
| 77 | +``` |
| 78 | + |
| 79 | +Each query specifies exact values for `bucket`, `service_name`, and `operation_name` (the partition key components), along with a range filter on `duration` (the clustering column). The query definition (lines 51-55) is: |
| 80 | + |
| 81 | +```cql |
| 82 | +SELECT trace_id |
| 83 | +FROM duration_index |
| 84 | +WHERE bucket = ? AND service_name = ? AND operation_name = ? AND duration > ? AND duration < ? |
| 85 | +LIMIT ? |
| 86 | +``` |
| 87 | + |
| 88 | +### Why Not Intersect with Other Indices? |
| 89 | + |
| 90 | +Unlike storage backends such as Badger (which can perform hash-joins and arbitrary index intersections), Cassandra's partition-based architecture makes cross-index intersections expensive and impractical: |
| 91 | + |
| 92 | +1. **Partition key constraints**: The duration index requires equality on `(service_name, operation_name, bucket)`. You cannot efficiently query across multiple operations or join with the tag index without scanning many partitions. |
| 93 | + |
| 94 | +2. **No server-side joins**: Cassandra does not support server-side joins. To intersect duration results with tag results, the client would need to: |
| 95 | + - Query the duration index for all matching trace IDs |
| 96 | + - Query the tag index for all matching trace IDs |
| 97 | + - Perform a client-side intersection |
| 98 | + |
| 99 | + This would be inefficient for large result sets and would require fetching potentially many trace IDs over the network. |
| 100 | + |
| 101 | +3. **Hourly bucket iteration**: The duration query already iterates over hourly buckets. Adding tag intersections would multiply the number of queries and result sets to merge. |
| 102 | + |
| 103 | +### Comparison with Badger |
| 104 | + |
| 105 | +The Badger storage backend handles duration queries differently. In [`internal/storage/v1/badger/spanstore/reader.go`](../../internal/storage/v1/badger/spanstore/reader.go) (around line 486), the `FindTraceIDs` method performs duration queries and then uses the results as a filter (`hashOuter`) that can be intersected with other index results: |
| 106 | + |
| 107 | +```go |
| 108 | +if query.DurationMax != 0 || query.DurationMin != 0 { |
| 109 | + plan.hashOuter = r.durationQueries(plan, query) |
| 110 | +} |
| 111 | +``` |
| 112 | + |
| 113 | +Badger uses an embedded key-value store where range scans and in-memory filtering are efficient, allowing it to merge results from multiple indices. This is a fundamental difference from Cassandra's distributed, partition-oriented design. |
| 114 | + |
| 115 | +## Decision |
| 116 | + |
| 117 | +**The Cassandra spanstore will continue to treat duration queries as a separate query path that does not intersect with tag indices or other non-service/operation filters.** |
| 118 | + |
| 119 | +When a `TraceQueryParameters` contains `DurationMin` or `DurationMax`: |
| 120 | +- The query will use the `duration_index` table exclusively |
| 121 | +- Only `ServiceName` and `OperationName` parameters will be respected (used as partition key components) |
| 122 | +- Tag filters and other parameters will be ignored |
| 123 | +- The code will iterate over hourly time buckets within the query time range |
| 124 | + |
| 125 | +This approach is documented in code comments and in this ADR to set proper expectations. |
| 126 | + |
| 127 | +## Consequences |
| 128 | + |
| 129 | +### Positive |
| 130 | + |
| 131 | +1. **Performance**: Duration queries execute efficiently by scanning only relevant Cassandra partitions (scoped to service, operation, and hourly bucket). |
| 132 | +2. **Scalability**: The bucketed partition strategy prevents hot partitions and distributes load across the cluster. |
| 133 | +3. **Simplicity**: The implementation is straightforward and leverages Cassandra's strengths (partition-scoped queries with range filtering on clustering columns). |
| 134 | + |
| 135 | +### Negative |
| 136 | + |
| 137 | +1. **Limited query expressiveness**: Users cannot combine duration filters with tag filters in a single query. They must choose one or the other. |
| 138 | +2. **Expectation mismatch**: Users familiar with other backends (like Badger) may expect duration and tags to be combinable. |
| 139 | +3. **Workarounds required**: Applications that need both duration and tag filtering must: |
| 140 | + - Issue separate queries (one with duration, one with tags) |
| 141 | + - Perform client-side intersection of results |
| 142 | + - Or use a different storage backend that supports combined queries |
| 143 | + |
| 144 | +### Guidance for Users |
| 145 | + |
| 146 | +- **When using Cassandra spanstore**: Be aware that specifying `DurationMin` or `DurationMax` will cause tag filters to be ignored. Validate that `ErrDurationAndTagQueryNotSupported` is returned if both are specified (enforced in `validateQuery` at line 227-229 in reader.go). |
| 147 | + |
| 148 | +- **For combined filtering needs**: Consider using the Badger backend, or implement client-side filtering by: |
| 149 | + 1. Querying with duration filters to get a candidate set of trace IDs |
| 150 | + 2. Fetching those traces |
| 151 | + 3. Filtering the results by tag values in your application code |
| 152 | + |
| 153 | +- **Query design**: Structure queries to leverage the indices available. Use `ServiceName` and `OperationName` in conjunction with duration queries for best results. |
| 154 | + |
| 155 | +## References |
| 156 | + |
| 157 | +- Implementation files: |
| 158 | + - [`internal/storage/v1/cassandra/spanstore/reader.go`](../../internal/storage/v1/cassandra/spanstore/reader.go) - Query logic and duration query path |
| 159 | + - [`internal/storage/v1/cassandra/spanstore/writer.go`](../../internal/storage/v1/cassandra/spanstore/writer.go) - Duration index schema and insertion logic |
| 160 | + - [`internal/storage/v1/badger/spanstore/reader.go`](../../internal/storage/v1/badger/spanstore/reader.go) - Badger implementation for comparison |
| 161 | + |
| 162 | +- Cassandra documentation: |
| 163 | + - [Cassandra Data Modeling](https://cassandra.apache.org/doc/latest/data_modeling/index.html) |
| 164 | + - [CQL Partition Keys and Clustering Columns](https://cassandra.apache.org/doc/latest/cql/ddl.html#partition-key) |
| 165 | + |
| 166 | +- Related code: |
| 167 | + - `durationIndex` constant (writer.go line 47-50): CQL insert statement |
| 168 | + - `queryByDuration` constant (reader.go line 51-55): CQL select statement |
| 169 | + - `durationBucketSize` constant (writer.go line 57): Hourly bucketing |
| 170 | + - Error `ErrDurationAndTagQueryNotSupported` (reader.go line 77): Validation that prevents combining duration and tag queries |
0 commit comments