Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Snowflake Manifest

Anton Parkhomenko edited this page Dec 11, 2018 · 1 revision

Snowflake Manifest is a DynamoDB table maintaining the state of pipeline and used by both Transformer and Loader to coordinate their efforts.

Structure

Every directory represented as a DynamoDB record with follwing possible attributes:

  • RunId - path to original enriched directory in archive (note missing s3:// prefix)
  • AddedAt - timestamp, when Transformer discovered the directory and added it to manifest
  • AddedBy - version of Transformer - purely informational (but still required)
  • ToSkip - bollean flag indicating that nothing should process this directory (if we want to blacklist certain period of time)
  • ProcessedAt - timestamp, when Transformer finished processing
  • ShredTypes - array of shredded types found in this directory
  • SavedTo - path to processed directory in snowflake stage (see at configuration)
  • LoadedAt - timestamp, when Loader finished loading this direcotry
  • LoadedBy - version of Transformer - purely information (but still requried)

States

Loader/Transformer can change the record at three steps:

  • Transformer, when first discovers the folder. At this point it contains RunId, AddedAt, AddedBy, ToSkip (NEW state)
  • Transformer, when finishss transformation. At this point it contains previous attributes plus ProcessedAt, ShredTypes, SavedTo (PROCESSED state)
  • Loader, when finishes loading. At this point it contains previous attributes plus LoadedAt, LoadedBy (LOADED state)

Recovery

There are multiple scenarios when manifest can get inconsistent. Most common one when Transformer fails in the middle of job. Or also Transformer can finish processing, but fails to write the new state back. In that case, the record remains in NEW state and unless record edited manually - both Transformer and Loader will ignore it in future runs.

In order to manually transit folder from NEW state to PROCESSED (to tell Loader that it can load it) operator needs to:

  1. Check if directory in stage contains _SUCCESS file - it means that Transformer actually processed the directory. Otherwise - just delete the record
  2. If folder is ok - manually add ProcessedAt, ShredTypes, SavedTo records. ProcessedAt can be random values, slightly bigger than AddedAt, SavedTo must correspond to actual directory (just copy run=XYZ from RunId and append it to stage path). If there's a small chance that new shredded type was added - operator either neeeds to add it to the list or delete whole record, because if Loader will find an unkown type - it will abort the load
Clone this wiki locally