|
| 1 | +# Timeseries and aggregations |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Aggregations are declared in the subgraph schema through two types: one that |
| 6 | +stores the raw data points for the time series, and one that defines how raw |
| 7 | +data points are to be aggregated. A very simple aggregation can be declared like this: |
| 8 | + |
| 9 | +```graphql |
| 10 | +type Data @entity(timeseries: true) { |
| 11 | + id: Int8! |
| 12 | + timestamp: Int8! |
| 13 | + price: BigDecimal! |
| 14 | +} |
| 15 | + |
| 16 | +type Stats @aggregation(intervals: ["hour", "day"], source: "Data") { |
| 17 | + id: Int8! |
| 18 | + timestamp: Int8! |
| 19 | + sum: BigDecimal! @aggregate(fn: "sum", arg: "price") |
| 20 | +} |
| 21 | +``` |
| 22 | + |
| 23 | +Mappings for this schema will add data points by creating `Data` entities |
| 24 | +just as they would for normal entities. `graph-node` will then automatically |
| 25 | +populate the `Stats` aggregations whenever a given hour or day ends. |
| 26 | + |
| 27 | +The type for the raw data points is defined with an `@entity(timeseries: |
| 28 | +true)` annotation. Timeseries types are immutable, and must have an `id` |
| 29 | +field and a `timestamp` field. The `timestamp` is set automatically by |
| 30 | +`graph-node` to the timestamp of the current block; if mappings set this |
| 31 | +field, it is silently overridden when the entity is saved. |
| 32 | + |
| 33 | +Aggregations are declared with an `@aggregation` annotation instead of an |
| 34 | +`@entity` annotation. They must have an `id` field and a `timestamp` field. |
| 35 | +Both fields are set automatically by `graph-node`. The `timestamp` is set to |
| 36 | +the beginning of the time period that that aggregation instance represents, |
| 37 | +for example, to the beginning of the hour for an hourly aggregation. The |
| 38 | +`id` field is set to the `id` of one of the raw data points that went into |
| 39 | +the aggregation. Which one is chosen is not specified and should not be |
| 40 | +relied on. |
| 41 | + |
| 42 | +**TODO**: add a `Timestamp` type and use that for `timestamp` |
| 43 | + |
| 44 | +**TODO**: figure out whether we should just automatically add `id` and |
| 45 | +`timestamp` and have validation just check that these fields don't exist |
| 46 | + |
| 47 | +Aggregations can also contain _dimensions_, which are fields that are not |
| 48 | +aggregated but are used to group the data points. For example, the |
| 49 | +`TokenStats` aggregation below has a `token` field that is used to group the |
| 50 | +data points by token: |
| 51 | + |
| 52 | +```graphql |
| 53 | +# Normal entity |
| 54 | +type Token @entity { .. } |
| 55 | + |
| 56 | +# Raw data points |
| 57 | +type TokenData @entity(timeseries: true) { |
| 58 | + id: Bytes! |
| 59 | + timestamp: Int8! |
| 60 | + token: Token! |
| 61 | + amount: BigDecimal! |
| 62 | + priceUSD: BigDecimal! |
| 63 | +} |
| 64 | + |
| 65 | +# Aggregations over TokenData |
| 66 | +type TokenStats @aggregation(intervals: ["hour", "day"], source: "TokenData") { |
| 67 | + id: Int8! |
| 68 | + timestamp: Int8! |
| 69 | + token: Token! |
| 70 | + totalVolume: BigDecimal! @aggregate(fn: "sum", arg: "amount") |
| 71 | + priceUSD: BigDecimal! @aggregate(fn: "last", arg: "priceUSD") |
| 72 | + count: Int8! @aggregate(fn: "count") |
| 73 | +} |
| 74 | +``` |
| 75 | + |
| 76 | +Fields in aggregations without the `@aggregate` directive are called |
| 77 | +_dimensions_, and fields with the `@aggregate` directive are called |
| 78 | +_aggregates_. A timeseries type really represents many timeseries, one for |
| 79 | +each combination of values for the dimensions. |
| 80 | + |
| 81 | +**TODO** As written, this supports buckets that start at zero with every new |
| 82 | +hour/day. We also want to support cumulative statistics, i.e., snapshotting |
| 83 | +of time series where a new bucket starts with the values of the previous |
| 84 | +bucket. |
| 85 | + |
| 86 | +**TODO** Since average is a little more complicated to handle for cumulative |
| 87 | +aggregations, and it doesn't seem like it used in practice, we won't |
| 88 | +initially support it. (same for variance, stddev etc.) |
| 89 | + |
| 90 | +**TODO** The timeseries type can be simplified for some situations if |
| 91 | +aggregations can be done over expressions, for example over `priceUSD * |
| 92 | +amount` to track `totalVolumeUSD` |
| 93 | + |
| 94 | +**TODO** It might be necessary to allow `@aggregate` fields that are only |
| 95 | +used for some intervals. We could allow that with syntax like |
| 96 | +`@aggregate(fn: .., arg: .., interval: "day")` |
| 97 | + |
| 98 | +## Specification |
| 99 | + |
| 100 | +### Timeseries |
| 101 | + |
| 102 | +A timeseries is an entity type with the annotation `@entity(timeseries: |
| 103 | +true)`. It must have an `id` attribute and a `timestamp` attribute of type |
| 104 | +`Int8`. It must not also be annotated with `immutable: false` as timeseries |
| 105 | +are always immutable. |
| 106 | + |
| 107 | +### Aggregations |
| 108 | + |
| 109 | +An aggregation is defined with an `@aggregation` annotation. The annotation |
| 110 | +must have two arguments: |
| 111 | + |
| 112 | +- `intervals`: a non-empty array of intervals; currently, only `hour` and `day` |
| 113 | + are supported |
| 114 | +- `source`: the name of a timeseries type. Aggregates are computed based on |
| 115 | + the attributes of the timeseries type. |
| 116 | + |
| 117 | +The aggregation type must have an `id` attribute and a `timestamp` attribute |
| 118 | +of type `Int8`. |
| 119 | + |
| 120 | +The aggregation type must have at least one attribute with the `@aggregate` |
| 121 | +annotation. These attributes must be of a numeric type (`Int`, `Int8`, |
| 122 | +`BigInt`, or `BigDecimal`) The annotation must have two arguments: |
| 123 | + |
| 124 | +- `fn`: the name of an aggregation function |
| 125 | +- `arg`: the name of an attribute in the timeseries type |
| 126 | + |
| 127 | +The following aggregation functions are currently supported: |
| 128 | + |
| 129 | +| Name | Description | |
| 130 | +| ------- | ----------------- | |
| 131 | +| `sum` | Sum of all values | |
| 132 | +| `count` | Number of values | |
| 133 | +| `min` | Minimum value | |
| 134 | +| `max` | Maximum value | |
| 135 | +| `first` | First value | |
| 136 | +| `last` | Last value | |
| 137 | + |
| 138 | +## Querying |
| 139 | + |
| 140 | +_This section is not implemented yet, and will require a bit more thought |
| 141 | +about details_ |
| 142 | + |
| 143 | +**TODO** As written, timeseries points like `TokenData` can be queried like |
| 144 | +any other entity. It would be nice to restrict how these data points can be |
| 145 | +queried, maybe even forbid it, as that would give us more latitude in how we |
| 146 | +store that data. |
| 147 | + |
| 148 | +We create a toplevel query field for each aggregation. That query field |
| 149 | +accepts the following arguments: |
| 150 | + |
| 151 | +- For each dimension, an optional filter to test for equality of that |
| 152 | + dimension |
| 153 | +- A mandatory `interval` |
| 154 | +- An optional `current` to indicate whether to include the current, |
| 155 | + partially filled bucket in the response. Can be either `ignore` (the |
| 156 | + default) or `include` |
| 157 | +- Optional `timestamp_{gte|gt|lt|lte|eq}` filters to restrict the range of |
| 158 | + timestamps to return |
| 159 | +- Timeseries are always sorted by the dimensions in the order in which they |
| 160 | + are declared in the schema and the `timestamp` in descending order |
| 161 | + |
| 162 | +```graphql |
| 163 | +token_stats(interval: "hour", |
| 164 | + current: ignore, |
| 165 | + where: { |
| 166 | + token: "0x1234", |
| 167 | + timestamp_gte: 1234567890, |
| 168 | + timestamp_lt: 1234567890 }) { |
| 169 | + id |
| 170 | + timestamp |
| 171 | + token |
| 172 | + totalVolume |
| 173 | + avgVolume |
| 174 | +} |
| 175 | +``` |
| 176 | + |
| 177 | +**TODO**: what about time-travel? Is it possible to include a block |
| 178 | +constraint? |
0 commit comments