Previously, only the first config matching bad row was triggered. Now, all configs for which conditions are fulfilled will be combined
and ran in a sequence. This makes modularising configurations and building reusable scenarios much easier.

Local testing CLI

Previously we used out-of-tree script which was hard to manage and got deprecated along the way. N
The current CLI allows for validating configurations:

snowplow-event-recovery validate --config /opt/scenarios.json
OK! Config valid

And running small-scale tests outputting both good and bad events into files:

snowplow-event-recovery run --config $PWD/scenarios.json -o /tmp -i $PWD/input/ --resolver $PWD/resolver.json
Config loaded
Enriching event...
Enriched event 2dfeb9b7-5a87-4214-8a97-a8b23176856b
...
Total Lines: 156, Recovered: 142

It allows for configuring enrichments (minus ones that depend on upstream storage for now) and setting custom iglu resolver:

snowplow-event-recovery run --config $PWD/scenarios.json -o /tmp -i $PWD/input/ --resolver $PWD/resolver.json --enrichments $PWD/enrichments

Total Lines: 3, Recovered: 0
OK! Successfully ran recovery.

Assets 6

23 Nov 11:42

github-actions

0.6.1

a648767

Version 0.6.1

This patch release brings a couple of small improvements to the process of recovering base64 encoded values. It also provides dependency upgrades to mitigate security vulnerabilities.

Changes

Allow filtering base64-encoded fields in conditions (#134)
Upgrade dependencies to mitigate security vulnerabilities (#136)
Handle url-unsafe Base64 encodings (#137)

Assets 5

25 May 19:35

github-actions

0.6.0

dffa369

Version 0.6.0

This release deprecates Spark runner. Spark Kinesis connection is not reliable enough for data recovery.
Meanwhile we've had overall good experience with Flink's connector. Therefore we bring Flink runner to event-recovery.

Spark runner is unlikely to see any new major releases in future. We advise moving to Flink.

Flink runner

Event recovery jobs in EMR are ran using our dataflow-runner CLI. An example configuration for running is available in this repository.

To run a transient EMR cluster and execute the job, run:

dataflow-runner run-transient \
  --emr-playbook $REPO_DIR/spark-playbook.json.tmpl \
  --emr-config $REPO_DIR/spark-cluster.json.tmpl \
  --vars bucket,$BUCKET_INPUT,region,AWS_$REGION,subnet,$AWS_SUBNET,role,$AWS_IAM_ROLE,keypair,$AWS_KEYPAIR,client,$JOB_OWNER,version,$RECOVERY_VERSION,config,$RECOVERY_CONFIG,resolver,$IGLU_RESOLVER,output,$KINESIS_OUTPUT,inputdir,$BUCKET_INPUT_DIRECTORY,interval,$INTERVAL

Checkpointing

The job conditionally allows checkpointing to allow for job restarts. Kinesis connector which is a part of the checkpointing mechanism guarantees at-least-once delivery semantics and provides backpressure to the job.
To enable checkpointing set --checkpoint-interval and --checkpoint-dir flags.

Monitoring

EMR Flink Job emits a wide range of metrics through Cloudwatch Agent which is installed during cluster bootstrap using a provided script. Custom metrics include:

number of unrecoverable events in the run: taskmanager_container_XXXX_XXX__Process_0_events_unrecoverable
number of failed recovery attempts taskmanager_container_XXXX_XXX__Process_0_events_failed
number of recovered events: taskmanager_container_XXXX_XXX__Process_0_events_recovered
total number of events submitted for processing: taskmanager_container_XXXX_XXX__Process_0_numRecordsIn

The metrics are delivered to snowplow/event-recovery Cloudwatch namespace.

Given a one-second aggregation in Cloudwatch the diagram should show an always increasing metrics that should balance themselves up to total sum.

ℹ️ Please note, that Flink's numRecordsOut metrics for sinks does not reflect an actual number of records saved in the sink.

Changes

Set event recovery version dynamically in bad rows (#84)
Add stable Flink kinesis job and deprecate Spark (#132)
Add devenv shell and pre-commit hooks
Add dataflow-runner templates

Assets 5

28 Apr 06:15

github-actions

0.5.2

3573c51

Version 0.5.2

This is a minor release adding exporting Spark metrics to Cloudwatch.
To configure the export set two new flags --cloudwatch-namespace and --cloudwatch-dimensions.
For example, to use snowplow/event-recovery namespace and customer: com.snplw.eng, job: recovery dimensions, dataflow-runner section for this job may look like this:

{
        "type": "CUSTOM_JAR",
        "name": "snowplow-event-recovery",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "command-runner.jar",
        "arguments": [
          "spark-submit",
          "--class",
          "com.snowplowanalytics.snowplow.event.recovery.Main",
          "--master",
          "yarn",
          "--deploy-mode",
          "cluster",
          "s3://recovery-test/snowplow-event-recovery-spark-0.5.2.jar",
          "--input",
          "hdfs:///local/to-recover/",
          "--directoryOutput",
          "hdfs:///local/recovered/",
          "--region",
          "eu-central-1",
          "--failedOutput",
          "s3://recovery-test/bad-rows/left",
          "--unrecoverableOutput",
          "s3://ecovery-test/bad-rows/left/unrecoverable",
          "--resolver",
          "eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5pZ2x1L3Jlc29sdmVyLWNvbmZpZy9qc29uc2NoZW1hLzEtMC0xIiwiZGF0YSI6eyJjYWNoZVNpemUiOjAsInJlcG9zaXRvcmllcyI6W3sibmFtZSI6ICJJZ2x1IENlbnRyYWwiLCJwcmlvcml0eSI6IDAsInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLCJjb25uZWN0aW9uIjogeyJodHRwIjp7InVyaSI6Imh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20ifX19XX19",
          "--config",
          "eyAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLnNub3dwbG93L3JlY292ZXJpZXMvanNvbnNjaGVtYS80LTAtMCIsICJkYXRhIjogeyAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRpY3Muc25vd3Bsb3cuYmFkcm93cy9lbnJpY2htZW50X2ZhaWx1cmVzL2pzb25zY2hlbWEvMS0wLTAiOiBbeyJuYW1lIjogInBhc3N0aHJvdWdoIiwgImNvbmRpdGlvbnMiOiBbXSwgInN0ZXBzIjogW119XX19",
          "--cloudwatch-namespace",
          "snowplow/event-recovery",
          "--cloudwatch-dimensions",
          "customer:com.snplw.eng;job:recovery"
        ]
      }

Changes

Add basic cloudwatch metrics (#128)
Remove flink (#127)
Add CPFV handling for double Base64-encoded payloads (#127)

Assets 5

12 Aug 18:49

github-actions

0.5.1

f0bb8ff

Version 0.5.1

Bump framework dependencies (#121)
Fix Snyk github action (#119)

Assets 5

15 Apr 15:59

github-actions

0.5.0

db9cac6

Version 0.5.0

A maintenance dependency-update release. Does not include new features or bugfixes.

Changes

Bump cats to 2.7.0 (#112)
Bump circe-optics to 0.14.1 (#109)
Bump circe to 0.14.1 (#108)
Bump cats-effect to 2.5.3 (#107)
Bump monocle-macro to 2.1.0 (#106)
Bump cats-core to 2.6.1 (#105)
Bump snowplow-badrows to 2.1.1 (#104)
Bump iglu-scala-client to 1.1.1 (#103)
Bump spark to 3.1.2 (#102)
Bump jackson-databind to 2.10.5.1 (#101)
Bump slf4j to 1.7.36 (#100)
Bump beam-runners-google-cloud-dataflow-java to 2.36.0 (#99)
Bump scio to 0.11.5 (#98)
Add tag for beam sink (#115)
Add bootstrap script for EMR (#118)
Add detailed check for integration spec (#110)
Change Docker base image to eclipse-temurin:11-jre-focal (#93)

Assets 4

28 Mar 08:30

github-actions

0.4.1

3cf739c

Version 0.4.1

This is a bugfix release addressing too generic handling of ue_px and iglu-webhook logic.

Changes

Only extract iglu-webhook messages to top-level event (#96)

Assets 4

17 Feb 12:32

github-actions

0.4.0

6262900

Version 0.4.0

Fixes and improvements release adding ue_pr support, Add step and array multi-matching.

Changes

Bump sbt-native-packager to 1.7.6 (#89)
Bump beam base image to 0.2.0 (#88)
Bump scalafmt to 1.6.1 (#85)
Use github actions for CI (#87)
Raise instead of not modifying when querystring invalid (#86)
Coerce ue_pr to CollectorPayload from RawEvent (#81)
Add check transformation for url-encoded fields (closes #77)
Greedy array matcher (close #91)
Add Add step (#80)

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration semantics change

Local testing CLI

Changes

Flink runner

Checkpointing

Monitoring

Changes

Changes

Changes

Changes

Changes

Releases: snowplow-incubator/snowplow-event-recovery

Version 0.7.2

Version 0.7.1

Version 0.7.0

Configuration semantics change

Local testing CLI

Version 0.6.1

Changes

Version 0.6.0

Flink runner

Checkpointing

Monitoring

Changes

Version 0.5.2

Changes

Version 0.5.1

Version 0.5.0

Changes

Version 0.4.1

Changes

Version 0.4.0

Changes