Releases: snowplow-incubator/snowplow-event-recovery
Version 0.7.2
This version adds an iglu-webhook workaround for snowplow-badrows
inconsistency in encoding query parameters.
As a reminder, since 0.7.0, recovery comes with a useful cli for simplifying testing and validating configurations.
Version 0.7.1
Fixes recovering failed events for querystring objects containing name-value pair objects with values not encoded as string.
Version 0.7.0
This release changes configuration semantics and adds local testing cli.
Configuration semantics change
Previously, only the first config matching bad row was triggered. Now, all configs for which conditions are fulfilled will be combined
and ran in a sequence. This makes modularising configurations and building reusable scenarios much easier.
Local testing CLI
Previously we used out-of-tree script which was hard to manage and got deprecated along the way. N
The current CLI allows for validating configurations:
snowplow-event-recovery validate --config /opt/scenarios.json
OK! Config valid
And running small-scale tests outputting both good and bad events into files:
snowplow-event-recovery run --config $PWD/scenarios.json -o /tmp -i $PWD/input/ --resolver $PWD/resolver.json
Config loaded
Enriching event...
Enriched event 2dfeb9b7-5a87-4214-8a97-a8b23176856b
...
Total Lines: 156, Recovered: 142
It allows for configuring enrichments (minus ones that depend on upstream storage for now) and setting custom iglu resolver:
snowplow-event-recovery run --config $PWD/scenarios.json -o /tmp -i $PWD/input/ --resolver $PWD/resolver.json --enrichments $PWD/enrichments
Total Lines: 3, Recovered: 0
OK! Successfully ran recovery.
Version 0.6.1
This patch release brings a couple of small improvements to the process of recovering base64 encoded values. It also provides dependency upgrades to mitigate security vulnerabilities.
Changes
Allow filtering base64-encoded fields in conditions (#134)
Upgrade dependencies to mitigate security vulnerabilities (#136)
Handle url-unsafe Base64 encodings (#137)
Version 0.6.0
This release deprecates Spark runner. Spark Kinesis connection is not reliable enough for data recovery.
Meanwhile we've had overall good experience with Flink's connector. Therefore we bring Flink runner to event-recovery.
Spark runner is unlikely to see any new major releases in future. We advise moving to Flink.
Flink runner
Event recovery jobs in EMR are ran using our dataflow-runner CLI. An example configuration for running is available in this repository.
To run a transient EMR cluster and execute the job, run:
dataflow-runner run-transient \
--emr-playbook $REPO_DIR/spark-playbook.json.tmpl \
--emr-config $REPO_DIR/spark-cluster.json.tmpl \
--vars bucket,$BUCKET_INPUT,region,AWS_$REGION,subnet,$AWS_SUBNET,role,$AWS_IAM_ROLE,keypair,$AWS_KEYPAIR,client,$JOB_OWNER,version,$RECOVERY_VERSION,config,$RECOVERY_CONFIG,resolver,$IGLU_RESOLVER,output,$KINESIS_OUTPUT,inputdir,$BUCKET_INPUT_DIRECTORY,interval,$INTERVAL
Checkpointing
The job conditionally allows checkpointing to allow for job restarts. Kinesis connector which is a part of the checkpointing mechanism guarantees at-least-once delivery semantics and provides backpressure to the job.
To enable checkpointing set --checkpoint-interval
and --checkpoint-dir
flags.
Monitoring
EMR Flink Job emits a wide range of metrics through Cloudwatch Agent which is installed during cluster bootstrap using a provided script. Custom metrics include:
- number of unrecoverable events in the run:
taskmanager_container_XXXX_XXX__Process_0_events_unrecoverable
- number of failed recovery attempts
taskmanager_container_XXXX_XXX__Process_0_events_failed
- number of recovered events:
taskmanager_container_XXXX_XXX__Process_0_events_recovered
- total number of events submitted for processing:
taskmanager_container_XXXX_XXX__Process_0_numRecordsIn
The metrics are delivered to snowplow/event-recovery
Cloudwatch namespace.
Given a one-second aggregation in Cloudwatch the diagram should show an always increasing metrics that should balance themselves up to total sum.
ℹ️ Please note, that Flink's numRecordsOut metrics for sinks does not reflect an actual number of records saved in the sink.
Changes
Set event recovery version dynamically in bad rows (#84)
Add stable Flink kinesis job and deprecate Spark (#132)
Add devenv shell and pre-commit hooks
Add dataflow-runner templates
Version 0.5.2
This is a minor release adding exporting Spark metrics to Cloudwatch.
To configure the export set two new flags --cloudwatch-namespace
and --cloudwatch-dimensions
.
For example, to use snowplow/event-recovery
namespace and customer: com.snplw.eng
, job: recovery
dimensions, dataflow-runner section for this job may look like this:
{
"type": "CUSTOM_JAR",
"name": "snowplow-event-recovery",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "command-runner.jar",
"arguments": [
"spark-submit",
"--class",
"com.snowplowanalytics.snowplow.event.recovery.Main",
"--master",
"yarn",
"--deploy-mode",
"cluster",
"s3://recovery-test/snowplow-event-recovery-spark-0.5.2.jar",
"--input",
"hdfs:///local/to-recover/",
"--directoryOutput",
"hdfs:///local/recovered/",
"--region",
"eu-central-1",
"--failedOutput",
"s3://recovery-test/bad-rows/left",
"--unrecoverableOutput",
"s3://ecovery-test/bad-rows/left/unrecoverable",
"--resolver",
"eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5pZ2x1L3Jlc29sdmVyLWNvbmZpZy9qc29uc2NoZW1hLzEtMC0xIiwiZGF0YSI6eyJjYWNoZVNpemUiOjAsInJlcG9zaXRvcmllcyI6W3sibmFtZSI6ICJJZ2x1IENlbnRyYWwiLCJwcmlvcml0eSI6IDAsInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLCJjb25uZWN0aW9uIjogeyJodHRwIjp7InVyaSI6Imh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20ifX19XX19",
"--config",
"eyAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLnNub3dwbG93L3JlY292ZXJpZXMvanNvbnNjaGVtYS80LTAtMCIsICJkYXRhIjogeyAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRpY3Muc25vd3Bsb3cuYmFkcm93cy9lbnJpY2htZW50X2ZhaWx1cmVzL2pzb25zY2hlbWEvMS0wLTAiOiBbeyJuYW1lIjogInBhc3N0aHJvdWdoIiwgImNvbmRpdGlvbnMiOiBbXSwgInN0ZXBzIjogW119XX19",
"--cloudwatch-namespace",
"snowplow/event-recovery",
"--cloudwatch-dimensions",
"customer:com.snplw.eng;job:recovery"
]
}
Changes
Add basic cloudwatch metrics (#128)
Remove flink (#127)
Add CPFV handling for double Base64-encoded payloads (#127)
Version 0.5.1
Version 0.5.0
A maintenance dependency-update release. Does not include new features or bugfixes.
Changes
Bump cats to 2.7.0 (#112)
Bump circe-optics to 0.14.1 (#109)
Bump circe to 0.14.1 (#108)
Bump cats-effect to 2.5.3 (#107)
Bump monocle-macro to 2.1.0 (#106)
Bump cats-core to 2.6.1 (#105)
Bump snowplow-badrows to 2.1.1 (#104)
Bump iglu-scala-client to 1.1.1 (#103)
Bump spark to 3.1.2 (#102)
Bump jackson-databind to 2.10.5.1 (#101)
Bump slf4j to 1.7.36 (#100)
Bump beam-runners-google-cloud-dataflow-java to 2.36.0 (#99)
Bump scio to 0.11.5 (#98)
Add tag for beam sink (#115)
Add bootstrap script for EMR (#118)
Add detailed check for integration spec (#110)
Change Docker base image to eclipse-temurin:11-jre-focal (#93)
Version 0.4.1
This is a bugfix release addressing too generic handling of ue_px and iglu-webhook logic.
Changes
Only extract iglu-webhook messages to top-level event (#96)
Version 0.4.0
Fixes and improvements release adding ue_pr
support, Add
step and array multi-matching.
Changes
Bump sbt-native-packager to 1.7.6 (#89)
Bump beam base image to 0.2.0 (#88)
Bump scalafmt to 1.6.1 (#85)
Use github actions for CI (#87)
Raise instead of not modifying when querystring invalid (#86)
Coerce ue_pr
to CollectorPayload
from RawEvent
(#81)
Add check transformation for url-encoded fields (closes #77)
Greedy array matcher (close #91)
Add Add
step (#80)