docs: Document aggregations

lutter · lutter · commit a54af2b3024a · 2024-01-23T09:56:56.000-08:00
diff --git a/docs/aggregations.md b/docs/aggregations.md
@@ -0,0 +1,178 @@
+# Timeseries and aggregations
+
+## Overview
+
+Aggregations are declared in the subgraph schema through two types: one that
+stores the raw data points for the time series, and one that defines how raw
+data points are to be aggregated. A very simple aggregation can be declared like this:
+
+```graphql
+type Data @entity(timeseries: true) {
+  id: Int8!
+  timestamp: Int8!
+  price: BigDecimal!
+}
+
+type Stats @aggregation(intervals: ["hour", "day"], source: "Data") {
+  id: Int8!
+  timestamp: Int8!
+  sum: BigDecimal! @aggregate(fn: "sum", arg: "price")
+}
+```
+
+Mappings for this schema will add data points by creating `Data` entities
+just as they would for normal entities. `graph-node` will then automatically
+populate the `Stats` aggregations whenever a given hour or day ends.
+
+The type for the raw data points is defined with an `@entity(timeseries:
+true)` annotation. Timeseries types are immutable, and must have an `id`
+field and a `timestamp` field. The `timestamp` is set automatically by
+`graph-node` to the timestamp of the current block; if mappings set this
+field, it is silently overridden when the entity is saved.
+
+Aggregations are declared with an `@aggregation` annotation instead of an
+`@entity` annotation. They must have an `id` field and a `timestamp` field.
+Both fields are set automatically by `graph-node`. The `timestamp` is set to
+the beginning of the time period that that aggregation instance represents,
+for example, to the beginning of the hour for an hourly aggregation. The
+`id` field is set to the `id` of one of the raw data points that went into
+the aggregation. Which one is chosen is not specified and should not be
+relied on.
+
+**TODO**: add a `Timestamp` type and use that for `timestamp`
+
+**TODO**: figure out whether we should just automatically add `id` and
+`timestamp` and have validation just check that these fields don't exist
+
+Aggregations can also contain _dimensions_, which are fields that are not
+aggregated but are used to group the data points. For example, the
+`TokenStats` aggregation below has a `token` field that is used to group the
+data points by token:
+
+```graphql
+# Normal entity
+type Token @entity { .. }
+
+# Raw data points
+type TokenData @entity(timeseries: true) {
+    id: Bytes!
+    timestamp: Int8!
+    token: Token!
+    amount: BigDecimal!
+    priceUSD: BigDecimal!
+}
+
+# Aggregations over TokenData
+type TokenStats @aggregation(intervals: ["hour", "day"], source: "TokenData") {
+  id: Int8!
+  timestamp: Int8!
+  token: Token!
+  totalVolume: BigDecimal! @aggregate(fn: "sum", arg: "amount")
+  priceUSD: BigDecimal! @aggregate(fn: "last", arg: "priceUSD")
+  count: Int8! @aggregate(fn: "count")
+}
+```
+
+Fields in aggregations without the `@aggregate` directive are called
+_dimensions_, and fields with the `@aggregate` directive are called
+_aggregates_. A timeseries type really represents many timeseries, one for
+each combination of values for the dimensions.
+
+**TODO** As written, this supports buckets that start at zero with every new
+hour/day. We also want to support cumulative statistics, i.e., snapshotting
+of time series where a new bucket starts with the values of the previous
+bucket.
+
+**TODO** Since average is a little more complicated to handle for cumulative
+aggregations, and it doesn't seem like it used in practice, we won't
+initially support it. (same for variance, stddev etc.)
+
+**TODO** The timeseries type can be simplified for some situations if
+aggregations can be done over expressions, for example over `priceUSD *
+amount` to track `totalVolumeUSD`
+
+**TODO** It might be necessary to allow `@aggregate` fields that are only
+used for some intervals. We could allow that with syntax like
+`@aggregate(fn: .., arg: .., interval: "day")`
+
+## Specification
+
+### Timeseries
+
+A timeseries is an entity type with the annotation `@entity(timeseries:
+true)`. It must have an `id` attribute and a `timestamp` attribute of type
+`Int8`. It must not also be annotated with `immutable: false` as timeseries
+are always immutable.
+
+### Aggregations
+
+An aggregation is defined with an `@aggregation` annotation. The annotation
+must have two arguments:
+
+- `intervals`: a non-empty array of intervals; currently, only `hour` and `day`
+  are supported
+- `source`: the name of a timeseries type. Aggregates are computed based on
+  the attributes of the timeseries type.
+
+The aggregation type must have an `id` attribute and a `timestamp` attribute
+of type `Int8`.
+
+The aggregation type must have at least one attribute with the `@aggregate`
+annotation. These attributes must be of a numeric type (`Int`, `Int8`,
+`BigInt`, or `BigDecimal`) The annotation must have two arguments:
+
+- `fn`: the name of an aggregation function
+- `arg`: the name of an attribute in the timeseries type
+
+The following aggregation functions are currently supported:
+
+| Name    | Description       |
+| ------- | ----------------- |
+| `sum`   | Sum of all values |
+| `count` | Number of values  |
+| `min`   | Minimum value     |
+| `max`   | Maximum value     |
+| `first` | First value       |
+| `last`  | Last value        |
+
+## Querying
+
+_This section is not implemented yet, and will require a bit more thought
+about details_
+
+**TODO** As written, timeseries points like `TokenData` can be queried like
+any other entity. It would be nice to restrict how these data points can be
+queried, maybe even forbid it, as that would give us more latitude in how we
+store that data.
+
+We create a toplevel query field for each aggregation. That query field
+accepts the following arguments:
+
+- For each dimension, an optional filter to test for equality of that
+  dimension
+- A mandatory `interval`
+- An optional `current` to indicate whether to include the current,
+  partially filled bucket in the response. Can be either `ignore` (the
+  default) or `include`
+- Optional `timestamp_{gte|gt|lt|lte|eq}` filters to restrict the range of
+  timestamps to return
+- Timeseries are always sorted by the dimensions in the order in which they
+  are declared in the schema and the `timestamp` in descending order
+
+```graphql
+token_stats(interval: "hour",
+      current: ignore,
+      where: {
+        token: "0x1234",
+        timestamp_gte: 1234567890,
+        timestamp_lt: 1234567890 }) {
+  id
+  timestamp
+  token
+  totalVolume
+  avgVolume
+}
+```
+
+**TODO**: what about time-travel? Is it possible to include a block
+constraint?
diff --git a/docs/implementation/schema-generation.md b/docs/implementation/schema-generation.md
@@ -5,13 +5,13 @@ table definition in Postgres.
 
 Schema generation follows a few simple rules:
 
-* the data for a subgraph is entirely stored in a Postgres namespace whose
+- the data for a subgraph is entirely stored in a Postgres namespace whose
   name is `sgdNNNN`. The mapping between namespace name and deployment id is
   kept in `deployment_schemas`
-* the data for each entity type is stored in a table whose structure follows
+- the data for each entity type is stored in a table whose structure follows
   the declaration of the type in the GraphQL schema
-* enums in the GraphQL schema are stored as enum types in Postgres
-* interfaces are not stored in the database, only the concrete types that
+- enums in the GraphQL schema are stored as enum types in Postgres
+- interfaces are not stored in the database, only the concrete types that
   implement the interface are stored
 
 Any table for an entity type has the following structure:
@@ -32,20 +32,20 @@ queries](./time-travel.md).
 The attributes of the GraphQL type correspond directly to columns in the
 generated table. The types of these columns are
 
-* the `id` column can have type `ID`, `String`, and `Bytes`, where `ID` is
+- the `id` column can have type `ID`, `String`, and `Bytes`, where `ID` is
   an alias for `String` for historical reasons.
-* if the attribute has a primitive type, the column has the SQL type that
+- if the attribute has a primitive type, the column has the SQL type that
   most closely mirrors the GraphQL type. `BigDecimal` and `BigInt` are
   stored as `numeric`, `Bytes` is stored as `bytea`, etc.
-* if the attribute references another entity, the column has the type of the
+- if the attribute references another entity, the column has the type of the
   `id` type of the referenced entity type. We do not use foreign key
   constraints to allow storing an entity that references an entity that will
   only be created later. Foreign key constraint violations will therefore
   only be detected when a query is issued, or simply lead to the reference
   missing from the query result.
-* if the attribute has an enum type, we generate a SQL enum type and use
+- if the attribute has an enum type, we generate a SQL enum type and use
   that as the type of the column.
-* if the attribute has a list type, like `[String]`, the corresponding
+- if the attribute has a list type, like `[String]`, the corresponding
   column uses an array type. We do not allow nested arrays like `[[String]]`
   in GraphQL, so arrays will only ever contain entries of a primitive type.
 
@@ -70,6 +70,22 @@ constraint `unique(id)` to such tables, and can avoid expensive GiST
 indexes in favor of simple BTree indexes since the `block$` column is an
 integer.
 
+### Timeseries
+
+Entity types declared with `@entity(timeseries: true)` are represented in
+the same way as immutable entities. The only difference is that timeseries
+also must have a `timestamp` attribute.
+
+### Aggregations
+
+Entity types declared with `@aggregation` are represented by several tables,
+one for each `interval` from the `@aggregation` directive. The tables are
+named `TYPE_INTERVAL` where `TYPE` is the name of the aggregation, and
+`INTERVAL` is the name of the interval; they do not support mutating
+entities as aggregations are never updated, only appended to. The tables
+have one column for each dimension and aggregate. The type of the columns is
+determined in the same way as for those of normal entity types.
+
 ## Indexing
 
 We do not know ahead of time which queries will be issued and therefore
@@ -79,17 +95,17 @@ are open issues at this time.
 
 We generate the following indexes for each table:
 
-* for mutable entity types
-  * an exclusion index over `(id, block_range)` that ensures that the
+- for mutable entity types
+  - an exclusion index over `(id, block_range)` that ensures that the
     versions for the same entity `id` have disjoint block ranges
-  * a BRIN index on `(lower(block_range), COALESCE(upper(block_range),
-    2147483647), vid)` that helps speed up some operations, especially
+  - a BRIN index on `(lower(block_range), COALESCE(upper(block_range),
+2147483647), vid)` that helps speed up some operations, especially
     reversion, in tables that have good data locality, for example, tables
     where entities are never updated or deleted
-* for immutable entity types
-  * a unique index on `id`
-  * a BRIN index on `(block$, vid)`
-* for each attribute, an index called `attr_N_M_..` where `N` is the number
+- for immutable and timeseries entity types
+  - a unique index on `id`
+  - a BRIN index on `(block$, vid)`
+- for each attribute, an index called `attr_N_M_..` where `N` is the number
   of the entity type in the GraphQL schema, and `M` is the number of the
   attribute within that type. For attributes of a primitive type, the index
   is a BTree index. For attributes that reference other entities, the index