This repository has been archived by the owner on Sep 26, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 9
Snowflake Manifest
Anton Parkhomenko edited this page Dec 11, 2018
·
1 revision
Snowflake Manifest is a DynamoDB table maintaining the state of pipeline and used by both Transformer and Loader to coordinate their efforts.
Every directory represented as a DynamoDB record with follwing possible attributes:
-
RunId
- path to original enriched directory in archive (note missings3://
prefix) -
AddedAt
- timestamp, when Transformer discovered the directory and added it to manifest -
AddedBy
- version of Transformer - purely informational (but still required) -
ToSkip
- bollean flag indicating that nothing should process this directory (if we want to blacklist certain period of time) -
ProcessedAt
- timestamp, when Transformer finished processing -
ShredTypes
- array of shredded types found in this directory -
SavedTo
- path to processed directory in snowflake stage (see at configuration) -
LoadedAt
- timestamp, when Loader finished loading this direcotry -
LoadedBy
- version of Transformer - purely information (but still requried)
Loader/Transformer can change the record at three steps:
- Transformer, when first discovers the folder. At this point it contains
RunId
,AddedAt
,AddedBy
,ToSkip
(NEW
state) - Transformer, when finishss transformation. At this point it contains previous attributes plus
ProcessedAt
,ShredTypes
,SavedTo
(PROCESSED
state) - Loader, when finishes loading. At this point it contains previous attributes plus
LoadedAt
,LoadedBy
(LOADED
state)
There are multiple scenarios when manifest can get inconsistent. Most common one when Transformer fails in the middle of job.
Or also Transformer can finish processing, but fails to write the new state back.
In that case, the record remains in NEW
state and unless record edited manually - both Transformer and Loader will ignore it in future runs.
In order to manually transit folder from NEW
state to PROCESSED
(to tell Loader that it can load it) operator needs to:
- Check if directory in stage contains
_SUCCESS
file - it means that Transformer actually processed the directory. Otherwise - just delete the record - If folder is ok - manually add
ProcessedAt
,ShredTypes
,SavedTo
records.ProcessedAt
can be random values, slightly bigger thanAddedAt
,SavedTo
must correspond to actual directory (just copyrun=XYZ
fromRunId
and append it to stage path). If there's a small chance that new shredded type was added - operator either neeeds to add it to the list or delete whole record, because if Loader will find an unkown type - it will abort the load