refactor(braze): replace gauge with histogram for batch size metrics #4822

maheshkutty · 2025-11-26T08:40:13Z

What are the changes introduced in this PR?

Problem Statement

The previous implementation used Prometheus gauge metrics for tracking Braze batch sizes, which caused significant data loss and measurement inaccuracies:

Data Loss from Overwrites: The processBatch function iterates through multiple batch chunks, calling addTrackStats() for each chunk. Each call overwrote the previous gauge value, meaning only the LAST chunk size was retained, losing all other batch size data.
No Distribution Insights: Gauges only store a single value, making it impossible to calculate averages, percentiles, or understand the distribution of batch sizes across time.
Incorrect max_over_time Results: Production graphs using max_over_time() on gauges showed unreliable values due to the overwrite behavior, not reflecting actual maximum batch sizes.

Solution: Histogram Metrics

Changed to Prometheus histogram type for these metrics:

braze_batch_attributes_pack_size
braze_batch_events_pack_size
braze_batch_purchase_pack_size

Why Histogram is Correct

Preserves All Observations: Every batch chunk size is recorded as an observation, eliminating data loss from overwrites.
Automatic Statistics: Histogram provides _count, _sum, and _bucket metrics enabling calculation of:
- Average batch size: sum/count
- Percentiles: p50, p95, p99 via histogram_quantile()
- Total batches created: count metric
- Total items processed: sum metric
Distribution Analysis: Buckets [1,5,10,20,30,40,50,60,70,75] aligned with Braze API limit (75 items max) enable tracking:
- Batch fill rate (capacity utilization)
- Small batch detection (underutilization)
- Near-capacity batch percentage
Concurrency-Safe: Histogram accumulates observations atomically, eliminating race conditions between concurrent processBatch() calls.
Better Operational Insights: Can answer critical questions:
- What's the 95th percentile batch size? (efficiency)
- How many batches are we creating per second? (throughput)
- What percentage of batches are < 25% capacity? (optimization opportunity)
- Is batching working effectively? (compare avg to max capacity)

Technical Details

Buckets chosen based on TRACK_BRAZE_MAX_REQ_COUNT=75 constraint:

Lower buckets (1,5,10): Detect severely underutilized batches
Middle buckets (20,30,40,50): Track typical batch sizes
Upper buckets (60,70,75): Track high-efficiency batching

Note: Subscription metrics kept as gauge since they use different batching logic (deduplication) and are updated once per batch call, not per chunk.

Testing

✅ All 87 Braze component tests pass
✅ TypeScript compilation successful
✅ No breaking changes to metric names (backward compatible for dashboards)

Impact

This change enables proper monitoring of Braze batching efficiency and eliminates measurement inaccuracies caused by gauge overwrites and race conditions. The histogram approach follows Prometheus best practices for measuring distributions of observed values.

What is the related Linear task?

https://linear.app/rudderstack/issue/INT-5500/braze-use-histogram-to-capture-stats-for-object-length-size
Resolves INT-5500

Please explain the objectives of your changes below

Put down any required details on the broader aspect of your changes. If there are any dependent changes, mandatorily mention them here

Any changes to existing capabilities/behaviour, mention the reason & what are the changes ?

N/A

Any new dependencies introduced with this change?

N/A

Any new generic utility introduced or modified. Please explain the changes.

N/A

Any technical or performance related pointers to consider with the change?

N/A

@coderabbitai review

Developer checklist

My code follows the style guidelines of this project
No breaking changes are being introduced.
All related docs linked with the PR?
All changes manually tested?
Any documentation changes needed with this change?
Is the PR limited to 10 file changes?
Is the PR limited to one linear task?
Are relevant unit and component test-cases added in new readability format?

Reviewer checklist

Is the type of change in the PR title appropriate as per the changes?
Verified that there are no credentials or confidential data exposed with the changes.

coderabbitai · 2025-11-26T08:40:28Z

Note

`.coderabbit.yaml` has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)

Validation error: Unrecognized key(s) in object: 'auto_resolve_threads'

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Walkthrough

Three Braze batch pack-size metrics were changed from Gauge to Histogram in both the Prometheus metric definitions and their usage; help text was updated to reflect distribution measurement and explicit histogram buckets were added.

Changes

Cohort / File(s)	Change Summary
Prometheus Metric Definitions `src/util/prometheus.js`	Converted `braze_batch_attributes_pack_size`, `braze_batch_events_pack_size`, `braze_batch_purchase_pack_size` from type `'gauge'` to `'histogram'`; updated help text to distribution phrasing and added buckets `[1, 5, 10, 20, 30, 40, 50, 60, 70, 75]`.
Metric Usage `src/v0/destinations/braze/util.js`	Updated recording of the three Braze batch pack-size metrics to align with histogram semantics (type change reflected where metrics are emitted).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Areas to focus on:

Verify histogram bucket boundaries [1, 5, 10, 20, 30, 40, 50, 60, 70, 75] suit expected batch sizes.
Confirm all metric emissions use histogram methods (e.g., .observe()), not gauge methods like .set().
Search repository for any other usages of these metric names that assume gauge APIs and update accordingly.

Possibly related PRs

feat(braze): optimize Track API batching algorithm #4685 — Changes to Braze batching metrics and related batching/statistics emission logic that touch the same metric names and area.

Suggested reviewers

vinayteki95
ItsSudip
koladilip

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: replacing gauge metrics with histogram metrics for Braze batch size tracking.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, covering all required template sections with detailed explanations of the problem, solution, and technical rationale.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch refactor.braze_object_metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3925624 and cb9b4f4.

📒 Files selected for processing (2)

src/util/prometheus.js (1 hunks)
src/v0/destinations/braze/util.js (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.js

⚙️ CodeRabbit configuration file

Focus on ESLint errors (max 3) and warnings (max 5).

Files:

src/util/prometheus.js
src/v0/destinations/braze/util.js

🧠 Learnings (2)

📚 Learning: 2025-05-29T13:29:39.436Z

Learnt from: maheshkutty
Repo: rudderlabs/rudder-transformer PR: 4359
File: src/v0/util/index.js:2272-2272
Timestamp: 2025-05-29T13:29:39.436Z
Learning: The `combineBatchRequestsWithSameJobIds` function in `src/v0/util/index.js` is used by both Mixpanel and Google AdWords Offline Conversions destinations in production. The function assumes metadata is always an array in multiple operations (forEach, filter, sort, map) and needs defensive programming to handle non-array metadata cases to prevent runtime errors.

Applied to files:

src/v0/destinations/braze/util.js

📚 Learning: 2025-07-15T06:27:07.528Z

Learnt from: ItsSudip
Repo: rudderlabs/rudder-transformer PR: 4497
File: src/v0/destinations/tiktok_ads/transformV2.js:686-691
Timestamp: 2025-07-15T06:27:07.528Z
Learning: For TikTok Ads destination in src/v0/destinations/tiktok_ads/transformV2.js, there are plans for a future enhancement to group events by eventSource when batching to ensure all events in a batch have the same event_source value, as acknowledged by ItsSudip in PR #4497.

Applied to files:

src/v0/destinations/braze/util.js

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Build Transformer Docker Image - PR / Build Transformer Docker Image AMD64
GitHub Check: Build Transformer Docker Image - PR / Build Transformer Docker Image ARM64
GitHub Check: Build User Transformer Docker Image - PR / Build Transformer Docker Image AMD64
GitHub Check: Build User Transformer Docker Image - PR / Build Transformer Docker Image ARM64
GitHub Check: Code Coverage
GitHub Check: check-health
GitHub Check: test_and_publish
GitHub Check: UT Tests
GitHub Check: Check for formatting & lint errors
GitHub Check: Analyze (go)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (3)

src/util/prometheus.js (1)

478-497: Bucket selection is well-aligned with API limits.

The histogram buckets [1, 5, 10, 20, 30, 40, 50, 60, 70, 75] are appropriately chosen to match the Braze API maximum request count of 75. This will provide good granularity for analyzing batch fill rates and identifying underutilization patterns.

src/v0/destinations/braze/util.js (2)

559-576: Histogram metric usage is correctly implemented.

The conversion from stats.gauge() to stats.histogram() in the addTrackStats function is properly implemented. Each chunk's size is recorded as a histogram observation, which will correctly capture the distribution of batch sizes across all chunks processed. The method signature and parameters remain compatible.

467-476: Subscription metrics correctly remain as gauges.

Good decision to keep braze_batch_subscription_size and braze_batch_subscription_combined_size as gauge metrics. Since subscription batches undergo deduplication via combineSubscriptionGroups() and represent a final deduplicated state rather than individual observations, gauges are the appropriate metric type here.

src/util/prometheus.js

devops-github-rudderstack · 2025-11-26T08:48:07Z

Allure Test reports for this run are available at:

Allure Report: View Report

codecov · 2025-11-26T08:48:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.25%. Comparing base (9a0036e) to head (cf40329).
⚠️ Report is 8 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #4822   +/-   ##
========================================
  Coverage    92.25%   92.25%           
========================================
  Files          654      654           
  Lines        35358    35384   +26     
  Branches      8315     8328   +13     
========================================
+ Hits         32620    32645   +25     
- Misses        2503     2504    +1     
  Partials       235      235

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

devops-github-rudderstack · 2025-12-17T01:31:17Z

This PR is considered to be stale. It has been open for 20 days with no further activity thus it is going to be closed in 7 days. To avoid such a case please consider removing the stale label manually or add a comment to the PR.

## Problem Statement The previous implementation used Prometheus gauge metrics for tracking Braze batch sizes, which caused significant data loss and measurement inaccuracies: 1. **Data Loss from Overwrites**: The `processBatch` function iterates through multiple batch chunks, calling `addTrackStats()` for each chunk. Each call overwrote the previous gauge value, meaning only the LAST chunk size was retained, losing all other batch size data. 2. **Race Conditions**: Multiple concurrent `processBatch()` calls updating the same gauge (same destination_id label) created race conditions where gauge values were randomly overwritten by whichever request completed last. 3. **No Distribution Insights**: Gauges only store a single value, making it impossible to calculate averages, percentiles, or understand the distribution of batch sizes across time. 4. **Incorrect max_over_time Results**: Production graphs using `max_over_time()` on gauges showed unreliable values due to the overwrite behavior and race conditions, not reflecting actual maximum batch sizes. ## Solution: Histogram Metrics Changed to Prometheus histogram type for these metrics: - braze_batch_attributes_pack_size - braze_batch_events_pack_size - braze_batch_purchase_pack_size ### Why Histogram is Correct 1. **Preserves All Observations**: Every batch chunk size is recorded as an observation, eliminating data loss from overwrites. 2. **Automatic Statistics**: Histogram provides _count, _sum, and _bucket metrics enabling calculation of: - Average batch size: sum/count - Percentiles: p50, p95, p99 via histogram_quantile() - Total batches created: count metric - Total items processed: sum metric 3. **Distribution Analysis**: Buckets [1,5,10,20,30,40,50,60,70,75] aligned with Braze API limit (75 items max) enable tracking: - Batch fill rate (capacity utilization) - Small batch detection (underutilization) - Near-capacity batch percentage 4. **Concurrency-Safe**: Histogram accumulates observations atomically, eliminating race conditions between concurrent processBatch() calls. 5. **Better Operational Insights**: Can answer critical questions: - What's the 95th percentile batch size? (efficiency) - How many batches are we creating per second? (throughput) - What percentage of batches are < 25% capacity? (optimization opportunity) - Is batching working effectively? (compare avg to max capacity) ### Technical Details Buckets chosen based on TRACK_BRAZE_MAX_REQ_COUNT=75 constraint: - Lower buckets (1,5,10): Detect severely underutilized batches - Middle buckets (20,30,40,50): Track typical batch sizes - Upper buckets (60,70,75): Track high-efficiency batching Note: Subscription metrics kept as gauge since they use different batching logic (deduplication) and are updated once per batch call, not per chunk. ## Testing - ✅ All 87 Braze component tests pass - ✅ TypeScript compilation successful - ✅ No breaking changes to metric names (backward compatible for dashboards) ## Impact This change enables proper monitoring of Braze batching efficiency and eliminates measurement inaccuracies caused by gauge overwrites and race conditions. The histogram approach follows Prometheus best practices for measuring distributions of observed values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

devops-github-rudderstack · 2025-12-23T11:34:23Z

Allure Test reports for this run are available at:

Allure Report: View Report

sonarqubecloud · 2025-12-23T11:35:09Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

maheshkutty requested review from a team as code owners November 26, 2025 08:40

maheshkutty requested review from ItsSudip and vinayteki95 November 26, 2025 08:40

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

src/util/prometheus.js Show resolved Hide resolved

devops-github-rudderstack added Stale and removed Stale labels Dec 17, 2025

maheshkutty force-pushed the refactor.braze_object_metrics branch from cb9b4f4 to cf40329 Compare December 23, 2025 11:26

saikumarrs approved these changes Dec 23, 2025

View reviewed changes

saikumarrs approved these changes Dec 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(braze): replace gauge with histogram for batch size metrics #4822

refactor(braze): replace gauge with histogram for batch size metrics #4822

Uh oh!

maheshkutty commented Nov 26, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 26, 2025 •

edited

Loading

`.coderabbit.yaml` has unrecognized properties

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

devops-github-rudderstack commented Nov 26, 2025

Uh oh!

codecov bot commented Nov 26, 2025 •

edited

Loading

Uh oh!

devops-github-rudderstack commented Dec 17, 2025

Uh oh!

devops-github-rudderstack commented Dec 23, 2025

Uh oh!

sonarqubecloud bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor(braze): replace gauge with histogram for batch size metrics #4822

Are you sure you want to change the base?

refactor(braze): replace gauge with histogram for batch size metrics #4822

Uh oh!

Conversation

maheshkutty commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the changes introduced in this PR?

Problem Statement

Solution: Histogram Metrics

Why Histogram is Correct

Technical Details

Testing

Impact

What is the related Linear task?

Please explain the objectives of your changes below

Any changes to existing capabilities/behaviour, mention the reason & what are the changes ?

Any new dependencies introduced with this change?

Any new generic utility introduced or modified. Please explain the changes.

Any technical or performance related pointers to consider with the change?

Developer checklist

Reviewer checklist

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

.coderabbit.yaml has unrecognized properties

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devops-github-rudderstack commented Nov 26, 2025

Uh oh!

codecov bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

devops-github-rudderstack commented Dec 17, 2025

Uh oh!

devops-github-rudderstack commented Dec 23, 2025

Uh oh!

sonarqubecloud bot commented Dec 23, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maheshkutty commented Nov 26, 2025 •

edited

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading

`.coderabbit.yaml` has unrecognized properties

codecov bot commented Nov 26, 2025 •

edited

Loading