-
Notifications
You must be signed in to change notification settings - Fork 5
refactor(braze): replace gauge with histogram for batch size metrics #4822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Note
|
| Cohort / File(s) | Change Summary |
|---|---|
Prometheus Metric Definitions src/util/prometheus.js |
Converted braze_batch_attributes_pack_size, braze_batch_events_pack_size, braze_batch_purchase_pack_size from type 'gauge' to 'histogram'; updated help text to distribution phrasing and added buckets [1, 5, 10, 20, 30, 40, 50, 60, 70, 75]. |
Metric Usage src/v0/destinations/braze/util.js |
Updated recording of the three Braze batch pack-size metrics to align with histogram semantics (type change reflected where metrics are emitted). |
Estimated code review effort
🎯 2 (Simple) | ⏱️ ~10 minutes
Areas to focus on:
- Verify histogram bucket boundaries
[1, 5, 10, 20, 30, 40, 50, 60, 70, 75]suit expected batch sizes. - Confirm all metric emissions use histogram methods (e.g.,
.observe()), not gauge methods like.set(). - Search repository for any other usages of these metric names that assume gauge APIs and update accordingly.
Possibly related PRs
- feat(braze): optimize Track API batching algorithm #4685 — Changes to Braze batching metrics and related batching/statistics emission logic that touch the same metric names and area.
Suggested reviewers
- vinayteki95
- ItsSudip
- koladilip
Pre-merge checks and finishing touches
✅ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title accurately summarizes the main change: replacing gauge metrics with histogram metrics for Braze batch size tracking. |
| Docstring Coverage | ✅ Passed | No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check. |
| Description check | ✅ Passed | The pull request description is comprehensive and well-structured, covering all required template sections with detailed explanations of the problem, solution, and technical rationale. |
✨ Finishing touches
- 📝 Generate docstrings
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
- Commit unit tests in branch
refactor.braze_object_metrics
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/util/prometheus.js(1 hunks)src/v0/destinations/braze/util.js(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.js
⚙️ CodeRabbit configuration file
Focus on ESLint errors (max 3) and warnings (max 5).
Files:
src/util/prometheus.jssrc/v0/destinations/braze/util.js
🧠 Learnings (2)
📚 Learning: 2025-05-29T13:29:39.436Z
Learnt from: maheshkutty
Repo: rudderlabs/rudder-transformer PR: 4359
File: src/v0/util/index.js:2272-2272
Timestamp: 2025-05-29T13:29:39.436Z
Learning: The `combineBatchRequestsWithSameJobIds` function in `src/v0/util/index.js` is used by both Mixpanel and Google AdWords Offline Conversions destinations in production. The function assumes metadata is always an array in multiple operations (forEach, filter, sort, map) and needs defensive programming to handle non-array metadata cases to prevent runtime errors.
Applied to files:
src/v0/destinations/braze/util.js
📚 Learning: 2025-07-15T06:27:07.528Z
Learnt from: ItsSudip
Repo: rudderlabs/rudder-transformer PR: 4497
File: src/v0/destinations/tiktok_ads/transformV2.js:686-691
Timestamp: 2025-07-15T06:27:07.528Z
Learning: For TikTok Ads destination in src/v0/destinations/tiktok_ads/transformV2.js, there are plans for a future enhancement to group events by eventSource when batching to ensure all events in a batch have the same event_source value, as acknowledged by ItsSudip in PR #4497.
Applied to files:
src/v0/destinations/braze/util.js
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Build Transformer Docker Image - PR / Build Transformer Docker Image AMD64
- GitHub Check: Build Transformer Docker Image - PR / Build Transformer Docker Image ARM64
- GitHub Check: Build User Transformer Docker Image - PR / Build Transformer Docker Image AMD64
- GitHub Check: Build User Transformer Docker Image - PR / Build Transformer Docker Image ARM64
- GitHub Check: Code Coverage
- GitHub Check: check-health
- GitHub Check: test_and_publish
- GitHub Check: UT Tests
- GitHub Check: Check for formatting & lint errors
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (3)
src/util/prometheus.js (1)
478-497: Bucket selection is well-aligned with API limits.The histogram buckets
[1, 5, 10, 20, 30, 40, 50, 60, 70, 75]are appropriately chosen to match the Braze API maximum request count of 75. This will provide good granularity for analyzing batch fill rates and identifying underutilization patterns.src/v0/destinations/braze/util.js (2)
559-576: Histogram metric usage is correctly implemented.The conversion from
stats.gauge()tostats.histogram()in theaddTrackStatsfunction is properly implemented. Each chunk's size is recorded as a histogram observation, which will correctly capture the distribution of batch sizes across all chunks processed. The method signature and parameters remain compatible.
467-476: Subscription metrics correctly remain as gauges.Good decision to keep
braze_batch_subscription_sizeandbraze_batch_subscription_combined_sizeas gauge metrics. Since subscription batches undergo deduplication viacombineSubscriptionGroups()and represent a final deduplicated state rather than individual observations, gauges are the appropriate metric type here.
|
Allure Test reports for this run are available at:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #4822 +/- ##
========================================
Coverage 92.25% 92.25%
========================================
Files 654 654
Lines 35358 35384 +26
Branches 8315 8328 +13
========================================
+ Hits 32620 32645 +25
- Misses 2503 2504 +1
Partials 235 235 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This PR is considered to be stale. It has been open for 20 days with no further activity thus it is going to be closed in 7 days. To avoid such a case please consider removing the stale label manually or add a comment to the PR. |
## Problem Statement The previous implementation used Prometheus gauge metrics for tracking Braze batch sizes, which caused significant data loss and measurement inaccuracies: 1. **Data Loss from Overwrites**: The `processBatch` function iterates through multiple batch chunks, calling `addTrackStats()` for each chunk. Each call overwrote the previous gauge value, meaning only the LAST chunk size was retained, losing all other batch size data. 2. **Race Conditions**: Multiple concurrent `processBatch()` calls updating the same gauge (same destination_id label) created race conditions where gauge values were randomly overwritten by whichever request completed last. 3. **No Distribution Insights**: Gauges only store a single value, making it impossible to calculate averages, percentiles, or understand the distribution of batch sizes across time. 4. **Incorrect max_over_time Results**: Production graphs using `max_over_time()` on gauges showed unreliable values due to the overwrite behavior and race conditions, not reflecting actual maximum batch sizes. ## Solution: Histogram Metrics Changed to Prometheus histogram type for these metrics: - braze_batch_attributes_pack_size - braze_batch_events_pack_size - braze_batch_purchase_pack_size ### Why Histogram is Correct 1. **Preserves All Observations**: Every batch chunk size is recorded as an observation, eliminating data loss from overwrites. 2. **Automatic Statistics**: Histogram provides _count, _sum, and _bucket metrics enabling calculation of: - Average batch size: sum/count - Percentiles: p50, p95, p99 via histogram_quantile() - Total batches created: count metric - Total items processed: sum metric 3. **Distribution Analysis**: Buckets [1,5,10,20,30,40,50,60,70,75] aligned with Braze API limit (75 items max) enable tracking: - Batch fill rate (capacity utilization) - Small batch detection (underutilization) - Near-capacity batch percentage 4. **Concurrency-Safe**: Histogram accumulates observations atomically, eliminating race conditions between concurrent processBatch() calls. 5. **Better Operational Insights**: Can answer critical questions: - What's the 95th percentile batch size? (efficiency) - How many batches are we creating per second? (throughput) - What percentage of batches are < 25% capacity? (optimization opportunity) - Is batching working effectively? (compare avg to max capacity) ### Technical Details Buckets chosen based on TRACK_BRAZE_MAX_REQ_COUNT=75 constraint: - Lower buckets (1,5,10): Detect severely underutilized batches - Middle buckets (20,30,40,50): Track typical batch sizes - Upper buckets (60,70,75): Track high-efficiency batching Note: Subscription metrics kept as gauge since they use different batching logic (deduplication) and are updated once per batch call, not per chunk. ## Testing - ✅ All 87 Braze component tests pass - ✅ TypeScript compilation successful - ✅ No breaking changes to metric names (backward compatible for dashboards) ## Impact This change enables proper monitoring of Braze batching efficiency and eliminates measurement inaccuracies caused by gauge overwrites and race conditions. The histogram approach follows Prometheus best practices for measuring distributions of observed values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
cb9b4f4 to
cf40329
Compare
|
Allure Test reports for this run are available at:
|
|



What are the changes introduced in this PR?
Problem Statement
The previous implementation used Prometheus gauge metrics for tracking Braze batch sizes, which caused significant data loss and measurement inaccuracies:
Data Loss from Overwrites: The
processBatchfunction iterates through multiple batch chunks, callingaddTrackStats()for each chunk. Each call overwrote the previous gauge value, meaning only the LAST chunk size was retained, losing all other batch size data.No Distribution Insights: Gauges only store a single value, making it impossible to calculate averages, percentiles, or understand the distribution of batch sizes across time.
Incorrect max_over_time Results: Production graphs using
max_over_time()on gauges showed unreliable values due to the overwrite behavior, not reflecting actual maximum batch sizes.Solution: Histogram Metrics
Changed to Prometheus histogram type for these metrics:
Why Histogram is Correct
Preserves All Observations: Every batch chunk size is recorded as an observation, eliminating data loss from overwrites.
Automatic Statistics: Histogram provides _count, _sum, and _bucket metrics enabling calculation of:
Distribution Analysis: Buckets [1,5,10,20,30,40,50,60,70,75] aligned with Braze API limit (75 items max) enable tracking:
Concurrency-Safe: Histogram accumulates observations atomically, eliminating race conditions between concurrent processBatch() calls.
Better Operational Insights: Can answer critical questions:
Technical Details
Buckets chosen based on TRACK_BRAZE_MAX_REQ_COUNT=75 constraint:
Note: Subscription metrics kept as gauge since they use different batching logic (deduplication) and are updated once per batch call, not per chunk.
Testing
Impact
This change enables proper monitoring of Braze batching efficiency and eliminates measurement inaccuracies caused by gauge overwrites and race conditions. The histogram approach follows Prometheus best practices for measuring distributions of observed values.
What is the related Linear task?
https://linear.app/rudderstack/issue/INT-5500/braze-use-histogram-to-capture-stats-for-object-length-size
Resolves INT-5500
Please explain the objectives of your changes below
Put down any required details on the broader aspect of your changes. If there are any dependent changes, mandatorily mention them here
Any changes to existing capabilities/behaviour, mention the reason & what are the changes ?
N/A
Any new dependencies introduced with this change?
N/A
Any new generic utility introduced or modified. Please explain the changes.
N/A
Any technical or performance related pointers to consider with the change?
N/A
@coderabbitai review
Developer checklist
My code follows the style guidelines of this project
No breaking changes are being introduced.
All related docs linked with the PR?
All changes manually tested?
Any documentation changes needed with this change?
Is the PR limited to 10 file changes?
Is the PR limited to one linear task?
Are relevant unit and component test-cases added in new readability format?
Reviewer checklist
Is the type of change in the PR title appropriate as per the changes?
Verified that there are no credentials or confidential data exposed with the changes.