Skip to content

Commit a54af2b

Browse files
committed
docs: Document aggregations
1 parent ac6a1dd commit a54af2b

File tree

2 files changed

+211
-17
lines changed

2 files changed

+211
-17
lines changed

docs/aggregations.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Timeseries and aggregations
2+
3+
## Overview
4+
5+
Aggregations are declared in the subgraph schema through two types: one that
6+
stores the raw data points for the time series, and one that defines how raw
7+
data points are to be aggregated. A very simple aggregation can be declared like this:
8+
9+
```graphql
10+
type Data @entity(timeseries: true) {
11+
id: Int8!
12+
timestamp: Int8!
13+
price: BigDecimal!
14+
}
15+
16+
type Stats @aggregation(intervals: ["hour", "day"], source: "Data") {
17+
id: Int8!
18+
timestamp: Int8!
19+
sum: BigDecimal! @aggregate(fn: "sum", arg: "price")
20+
}
21+
```
22+
23+
Mappings for this schema will add data points by creating `Data` entities
24+
just as they would for normal entities. `graph-node` will then automatically
25+
populate the `Stats` aggregations whenever a given hour or day ends.
26+
27+
The type for the raw data points is defined with an `@entity(timeseries:
28+
true)` annotation. Timeseries types are immutable, and must have an `id`
29+
field and a `timestamp` field. The `timestamp` is set automatically by
30+
`graph-node` to the timestamp of the current block; if mappings set this
31+
field, it is silently overridden when the entity is saved.
32+
33+
Aggregations are declared with an `@aggregation` annotation instead of an
34+
`@entity` annotation. They must have an `id` field and a `timestamp` field.
35+
Both fields are set automatically by `graph-node`. The `timestamp` is set to
36+
the beginning of the time period that that aggregation instance represents,
37+
for example, to the beginning of the hour for an hourly aggregation. The
38+
`id` field is set to the `id` of one of the raw data points that went into
39+
the aggregation. Which one is chosen is not specified and should not be
40+
relied on.
41+
42+
**TODO**: add a `Timestamp` type and use that for `timestamp`
43+
44+
**TODO**: figure out whether we should just automatically add `id` and
45+
`timestamp` and have validation just check that these fields don't exist
46+
47+
Aggregations can also contain _dimensions_, which are fields that are not
48+
aggregated but are used to group the data points. For example, the
49+
`TokenStats` aggregation below has a `token` field that is used to group the
50+
data points by token:
51+
52+
```graphql
53+
# Normal entity
54+
type Token @entity { .. }
55+
56+
# Raw data points
57+
type TokenData @entity(timeseries: true) {
58+
id: Bytes!
59+
timestamp: Int8!
60+
token: Token!
61+
amount: BigDecimal!
62+
priceUSD: BigDecimal!
63+
}
64+
65+
# Aggregations over TokenData
66+
type TokenStats @aggregation(intervals: ["hour", "day"], source: "TokenData") {
67+
id: Int8!
68+
timestamp: Int8!
69+
token: Token!
70+
totalVolume: BigDecimal! @aggregate(fn: "sum", arg: "amount")
71+
priceUSD: BigDecimal! @aggregate(fn: "last", arg: "priceUSD")
72+
count: Int8! @aggregate(fn: "count")
73+
}
74+
```
75+
76+
Fields in aggregations without the `@aggregate` directive are called
77+
_dimensions_, and fields with the `@aggregate` directive are called
78+
_aggregates_. A timeseries type really represents many timeseries, one for
79+
each combination of values for the dimensions.
80+
81+
**TODO** As written, this supports buckets that start at zero with every new
82+
hour/day. We also want to support cumulative statistics, i.e., snapshotting
83+
of time series where a new bucket starts with the values of the previous
84+
bucket.
85+
86+
**TODO** Since average is a little more complicated to handle for cumulative
87+
aggregations, and it doesn't seem like it used in practice, we won't
88+
initially support it. (same for variance, stddev etc.)
89+
90+
**TODO** The timeseries type can be simplified for some situations if
91+
aggregations can be done over expressions, for example over `priceUSD *
92+
amount` to track `totalVolumeUSD`
93+
94+
**TODO** It might be necessary to allow `@aggregate` fields that are only
95+
used for some intervals. We could allow that with syntax like
96+
`@aggregate(fn: .., arg: .., interval: "day")`
97+
98+
## Specification
99+
100+
### Timeseries
101+
102+
A timeseries is an entity type with the annotation `@entity(timeseries:
103+
true)`. It must have an `id` attribute and a `timestamp` attribute of type
104+
`Int8`. It must not also be annotated with `immutable: false` as timeseries
105+
are always immutable.
106+
107+
### Aggregations
108+
109+
An aggregation is defined with an `@aggregation` annotation. The annotation
110+
must have two arguments:
111+
112+
- `intervals`: a non-empty array of intervals; currently, only `hour` and `day`
113+
are supported
114+
- `source`: the name of a timeseries type. Aggregates are computed based on
115+
the attributes of the timeseries type.
116+
117+
The aggregation type must have an `id` attribute and a `timestamp` attribute
118+
of type `Int8`.
119+
120+
The aggregation type must have at least one attribute with the `@aggregate`
121+
annotation. These attributes must be of a numeric type (`Int`, `Int8`,
122+
`BigInt`, or `BigDecimal`) The annotation must have two arguments:
123+
124+
- `fn`: the name of an aggregation function
125+
- `arg`: the name of an attribute in the timeseries type
126+
127+
The following aggregation functions are currently supported:
128+
129+
| Name | Description |
130+
| ------- | ----------------- |
131+
| `sum` | Sum of all values |
132+
| `count` | Number of values |
133+
| `min` | Minimum value |
134+
| `max` | Maximum value |
135+
| `first` | First value |
136+
| `last` | Last value |
137+
138+
## Querying
139+
140+
_This section is not implemented yet, and will require a bit more thought
141+
about details_
142+
143+
**TODO** As written, timeseries points like `TokenData` can be queried like
144+
any other entity. It would be nice to restrict how these data points can be
145+
queried, maybe even forbid it, as that would give us more latitude in how we
146+
store that data.
147+
148+
We create a toplevel query field for each aggregation. That query field
149+
accepts the following arguments:
150+
151+
- For each dimension, an optional filter to test for equality of that
152+
dimension
153+
- A mandatory `interval`
154+
- An optional `current` to indicate whether to include the current,
155+
partially filled bucket in the response. Can be either `ignore` (the
156+
default) or `include`
157+
- Optional `timestamp_{gte|gt|lt|lte|eq}` filters to restrict the range of
158+
timestamps to return
159+
- Timeseries are always sorted by the dimensions in the order in which they
160+
are declared in the schema and the `timestamp` in descending order
161+
162+
```graphql
163+
token_stats(interval: "hour",
164+
current: ignore,
165+
where: {
166+
token: "0x1234",
167+
timestamp_gte: 1234567890,
168+
timestamp_lt: 1234567890 }) {
169+
id
170+
timestamp
171+
token
172+
totalVolume
173+
avgVolume
174+
}
175+
```
176+
177+
**TODO**: what about time-travel? Is it possible to include a block
178+
constraint?

docs/implementation/schema-generation.md

Lines changed: 33 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ table definition in Postgres.
55

66
Schema generation follows a few simple rules:
77

8-
* the data for a subgraph is entirely stored in a Postgres namespace whose
8+
- the data for a subgraph is entirely stored in a Postgres namespace whose
99
name is `sgdNNNN`. The mapping between namespace name and deployment id is
1010
kept in `deployment_schemas`
11-
* the data for each entity type is stored in a table whose structure follows
11+
- the data for each entity type is stored in a table whose structure follows
1212
the declaration of the type in the GraphQL schema
13-
* enums in the GraphQL schema are stored as enum types in Postgres
14-
* interfaces are not stored in the database, only the concrete types that
13+
- enums in the GraphQL schema are stored as enum types in Postgres
14+
- interfaces are not stored in the database, only the concrete types that
1515
implement the interface are stored
1616

1717
Any table for an entity type has the following structure:
@@ -32,20 +32,20 @@ queries](./time-travel.md).
3232
The attributes of the GraphQL type correspond directly to columns in the
3333
generated table. The types of these columns are
3434

35-
* the `id` column can have type `ID`, `String`, and `Bytes`, where `ID` is
35+
- the `id` column can have type `ID`, `String`, and `Bytes`, where `ID` is
3636
an alias for `String` for historical reasons.
37-
* if the attribute has a primitive type, the column has the SQL type that
37+
- if the attribute has a primitive type, the column has the SQL type that
3838
most closely mirrors the GraphQL type. `BigDecimal` and `BigInt` are
3939
stored as `numeric`, `Bytes` is stored as `bytea`, etc.
40-
* if the attribute references another entity, the column has the type of the
40+
- if the attribute references another entity, the column has the type of the
4141
`id` type of the referenced entity type. We do not use foreign key
4242
constraints to allow storing an entity that references an entity that will
4343
only be created later. Foreign key constraint violations will therefore
4444
only be detected when a query is issued, or simply lead to the reference
4545
missing from the query result.
46-
* if the attribute has an enum type, we generate a SQL enum type and use
46+
- if the attribute has an enum type, we generate a SQL enum type and use
4747
that as the type of the column.
48-
* if the attribute has a list type, like `[String]`, the corresponding
48+
- if the attribute has a list type, like `[String]`, the corresponding
4949
column uses an array type. We do not allow nested arrays like `[[String]]`
5050
in GraphQL, so arrays will only ever contain entries of a primitive type.
5151

@@ -70,6 +70,22 @@ constraint `unique(id)` to such tables, and can avoid expensive GiST
7070
indexes in favor of simple BTree indexes since the `block$` column is an
7171
integer.
7272

73+
### Timeseries
74+
75+
Entity types declared with `@entity(timeseries: true)` are represented in
76+
the same way as immutable entities. The only difference is that timeseries
77+
also must have a `timestamp` attribute.
78+
79+
### Aggregations
80+
81+
Entity types declared with `@aggregation` are represented by several tables,
82+
one for each `interval` from the `@aggregation` directive. The tables are
83+
named `TYPE_INTERVAL` where `TYPE` is the name of the aggregation, and
84+
`INTERVAL` is the name of the interval; they do not support mutating
85+
entities as aggregations are never updated, only appended to. The tables
86+
have one column for each dimension and aggregate. The type of the columns is
87+
determined in the same way as for those of normal entity types.
88+
7389
## Indexing
7490

7591
We do not know ahead of time which queries will be issued and therefore
@@ -79,17 +95,17 @@ are open issues at this time.
7995

8096
We generate the following indexes for each table:
8197

82-
* for mutable entity types
83-
* an exclusion index over `(id, block_range)` that ensures that the
98+
- for mutable entity types
99+
- an exclusion index over `(id, block_range)` that ensures that the
84100
versions for the same entity `id` have disjoint block ranges
85-
* a BRIN index on `(lower(block_range), COALESCE(upper(block_range),
86-
2147483647), vid)` that helps speed up some operations, especially
101+
- a BRIN index on `(lower(block_range), COALESCE(upper(block_range),
102+
2147483647), vid)` that helps speed up some operations, especially
87103
reversion, in tables that have good data locality, for example, tables
88104
where entities are never updated or deleted
89-
* for immutable entity types
90-
* a unique index on `id`
91-
* a BRIN index on `(block$, vid)`
92-
* for each attribute, an index called `attr_N_M_..` where `N` is the number
105+
- for immutable and timeseries entity types
106+
- a unique index on `id`
107+
- a BRIN index on `(block$, vid)`
108+
- for each attribute, an index called `attr_N_M_..` where `N` is the number
93109
of the entity type in the GraphQL schema, and `M` is the number of the
94110
attribute within that type. For attributes of a primitive type, the index
95111
is a BTree index. For attributes that reference other entities, the index

0 commit comments

Comments
 (0)