Implement conversion state to allow CLI restartability

## Key points
- Collection state is saved when a JSONL chunk is written
- If CLI conversion fails, data is lost

Solution -> do not delete JSONL files - CLI continues the conversion next time it collects the same partition.

## Implementation
### Change to Jsonl strategy
As far as possible, maintain a mapping between source files and JSONL files  - 1 JSONL file per source file. Split into multiple JSONL files if source file is longer than 10K (?) to limit buffering within the plugin Collector

### Plugin Collector change
Collector buffering will need to change - store a map of rows, keyed by artifact (or trunk??)

Update Chunk event to include: 
```
	// trunk name
	Trunk string
	// artifact name
	Artifact string
	// earliest time in the artifact
	StartTime time.Time
	// latest time in the artifact
	EndTime time.Time
```

Now when CLI receives a chunk it knows where it came from - so if the chunk conversion fails, it can invalidate the collection state for that period

### Conversion State
The CLI Collector maintains a conversion state - this is a struct containing map of all unprocessed JSONL chunks, with their source info.
It also contains the  PID of the CLI process and the collection params
- when collection starts, the collector serialises the state
	- NOTE: OS level mutex required
- every chunk that is received, the state is updated and saved. 
- every time a chunk is deleted (following successful conversion), the chunk is removed from the state and it is saved
	- add a deleteChunk function to the ConversionState which does both
- when conversion is complete the state file is deleted

When starting a collection, check for conversion state file for that partition
- if the state file exists, 
	- read the PID from the state file
		- if the PID exists, another collection is in progress for this partition - return an error 
		- if the PID does not exist, previous collection must have failed. Use params in the state file to complete the previous collection
	- delete the state file
- delete other collection temp files for this partition - they must be inactive


### Design considerations
- Think of all the possible collection failures - are they all recoverable in a future run? If not, should we indicate that in the state file
- what if a JSON conversion fails - and is not recoverable 
	- invalidate the collection state for the period and trunks covered by the failed chunk - use the chunk info containing trunk and time range
	
- What to do if a recovery process fails - how do we revert the collection state to before the failed json 
	- rely on the chunk failure handling to update the collection state? 
	- If it fails altogether, fail all chunks??


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement conversion state to allow CLI restartability #407

Key points

Implementation

Change to Jsonl strategy

Plugin Collector change

Conversion State

Design considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement conversion state to allow CLI restartability #407

Description

Key points

Implementation

Change to Jsonl strategy

Plugin Collector change

Conversion State

Design considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions