This repository was archived by the owner on Jan 19, 2020. It is now read-only.
This repository was archived by the owner on Jan 19, 2020. It is now read-only.
[doc][design] Observation #4
Open
Description
- time series data are hardly updated
- the cost of update can be very high since it's seldom used
- values with close time stamp is likely to have close value in most cases
- compression ratio can be pretty high
- size of data is extremely large
- even a single series can blow a single node given enough time
- by introducing hierarchy we can reduce duplicate data and pressure for compression
- does computing delta of two 64 bit value (full timestamp) cost more time than delta of two 16 bit value (offset)
- how to support time older than 1970? use negative number? Yes
- by using delta for compression of both time stamp and data value, different precision may result in very similar space cost?
- what if I XOR some part of time stamp with value, there is always correlation between time and value, which would enable more efficient compression? The predictor method used by Akumuli seems pretty similar to this idea
- column style is good even when you need to query across multiple series, because you can use parallel hardware like vector processor, GPU, and process vectors instead of do it row by row
- got the idea from http://stackoverflow.com/questions/43098851/cassandra-time-series-managment-regroup-by-date
- a library that might be useful https://github.com/arrayfire/arrayfire works for both CPU and GPU
- In Memory and Disk can use different compression strategy (real-time and historical), i.e. In Memory can use simple delta to allow direct computation, or generation based compression
- Let user annotate data
- Automatically annotate data when anomaly is detected and load them into cache because they are likely to be used later
- To reach eventual consistency, i.e. remove duplicate/merge, we may use tree structure like DynamoDB
- store timestamp and value separately? one timestamp could have multiple value, but this is more like a row store? and how to make sure of the size?
- a scheduling of request is needed if running in a multi tenant environment, or run on top of mesos
- for the Series Churn problem mentioned by prometheus, you can use annotation to enable better compression be putting small series that is close related but no time overlap into a bigger series
- data come from same source may/may not be queried together
- for monitoring data, cpu, mem, are reported in same payload and often queried together
- but for sensors, it is more often to query multiple different sensors
- using annotations might be better than using tags, and it may even solve the series burst problem, which divides a large series into several chunks
- series with same name have same unit and range, this semantic is useful for drawing graph
- using TTL is Cassandra may not be a good retention policy, also auto rollup may not be a good policy as well
- the tree model can not be easily extended to the table model
Metadata
Metadata
Assignees
Labels
No labels