Description
Key points
- Collection state is saved when a JSONL chunk is written
- If CLI conversion fails, data is lost
Solution -> do not delete JSONL files - CLI continues the conversion next time it collects the same partition.
Implementation
Change to Jsonl strategy
As far as possible, maintain a mapping between source files and JSONL files - 1 JSONL file per source file. Split into multiple JSONL files if source file is longer than 10K (?) to limit buffering within the plugin Collector
Plugin Collector change
Collector buffering will need to change - store a map of rows, keyed by artifact (or trunk??)
Update Chunk event to include:
// trunk name
Trunk string
// artifact name
Artifact string
// earliest time in the artifact
StartTime time.Time
// latest time in the artifact
EndTime time.Time
Now when CLI receives a chunk it knows where it came from - so if the chunk conversion fails, it can invalidate the collection state for that period
Conversion State
The CLI Collector maintains a conversion state - this is a struct containing map of all unprocessed JSONL chunks, with their source info.
It also contains the PID of the CLI process and the collection params
- when collection starts, the collector serialises the state
- NOTE: OS level mutex required
- every chunk that is received, the state is updated and saved.
- every time a chunk is deleted (following successful conversion), the chunk is removed from the state and it is saved
- add a deleteChunk function to the ConversionState which does both
- when conversion is complete the state file is deleted
When starting a collection, check for conversion state file for that partition
- if the state file exists,
- read the PID from the state file
- if the PID exists, another collection is in progress for this partition - return an error
- if the PID does not exist, previous collection must have failed. Use params in the state file to complete the previous collection
- delete the state file
- read the PID from the state file
- delete other collection temp files for this partition - they must be inactive
Design considerations
-
Think of all the possible collection failures - are they all recoverable in a future run? If not, should we indicate that in the state file
-
what if a JSON conversion fails - and is not recoverable
- invalidate the collection state for the period and trunks covered by the failed chunk - use the chunk info containing trunk and time range
-
What to do if a recovery process fails - how do we revert the collection state to before the failed json
- rely on the chunk failure handling to update the collection state?
- If it fails altogether, fail all chunks??