Skip to content

Conversation

@maheshkutty
Copy link
Contributor

@maheshkutty maheshkutty commented Nov 26, 2025

What are the changes introduced in this PR?

Problem Statement

The previous implementation used Prometheus gauge metrics for tracking Braze batch sizes, which caused significant data loss and measurement inaccuracies:

  1. Data Loss from Overwrites: The processBatch function iterates through multiple batch chunks, calling addTrackStats() for each chunk. Each call overwrote the previous gauge value, meaning only the LAST chunk size was retained, losing all other batch size data.

  2. No Distribution Insights: Gauges only store a single value, making it impossible to calculate averages, percentiles, or understand the distribution of batch sizes across time.

  3. Incorrect max_over_time Results: Production graphs using max_over_time() on gauges showed unreliable values due to the overwrite behavior, not reflecting actual maximum batch sizes.

Solution: Histogram Metrics

Changed to Prometheus histogram type for these metrics:

  • braze_batch_attributes_pack_size
  • braze_batch_events_pack_size
  • braze_batch_purchase_pack_size

Why Histogram is Correct

  1. Preserves All Observations: Every batch chunk size is recorded as an observation, eliminating data loss from overwrites.

  2. Automatic Statistics: Histogram provides _count, _sum, and _bucket metrics enabling calculation of:

    • Average batch size: sum/count
    • Percentiles: p50, p95, p99 via histogram_quantile()
    • Total batches created: count metric
    • Total items processed: sum metric
  3. Distribution Analysis: Buckets [1,5,10,20,30,40,50,60,70,75] aligned with Braze API limit (75 items max) enable tracking:

    • Batch fill rate (capacity utilization)
    • Small batch detection (underutilization)
    • Near-capacity batch percentage
  4. Concurrency-Safe: Histogram accumulates observations atomically, eliminating race conditions between concurrent processBatch() calls.

  5. Better Operational Insights: Can answer critical questions:

    • What's the 95th percentile batch size? (efficiency)
    • How many batches are we creating per second? (throughput)
    • What percentage of batches are < 25% capacity? (optimization opportunity)
    • Is batching working effectively? (compare avg to max capacity)

Technical Details

Buckets chosen based on TRACK_BRAZE_MAX_REQ_COUNT=75 constraint:

  • Lower buckets (1,5,10): Detect severely underutilized batches
  • Middle buckets (20,30,40,50): Track typical batch sizes
  • Upper buckets (60,70,75): Track high-efficiency batching

Note: Subscription metrics kept as gauge since they use different batching logic (deduplication) and are updated once per batch call, not per chunk.

Testing

  • ✅ All 87 Braze component tests pass
  • ✅ TypeScript compilation successful
  • ✅ No breaking changes to metric names (backward compatible for dashboards)

Impact

This change enables proper monitoring of Braze batching efficiency and eliminates measurement inaccuracies caused by gauge overwrites and race conditions. The histogram approach follows Prometheus best practices for measuring distributions of observed values.

What is the related Linear task?

https://linear.app/rudderstack/issue/INT-5500/braze-use-histogram-to-capture-stats-for-object-length-size
Resolves INT-5500

Please explain the objectives of your changes below

Put down any required details on the broader aspect of your changes. If there are any dependent changes, mandatorily mention them here

Any changes to existing capabilities/behaviour, mention the reason & what are the changes ?

N/A

Any new dependencies introduced with this change?

N/A

Any new generic utility introduced or modified. Please explain the changes.

N/A

Any technical or performance related pointers to consider with the change?

N/A

@coderabbitai review


Developer checklist

  • My code follows the style guidelines of this project

  • No breaking changes are being introduced.

  • All related docs linked with the PR?

  • All changes manually tested?

  • Any documentation changes needed with this change?

  • Is the PR limited to 10 file changes?

  • Is the PR limited to one linear task?

  • Are relevant unit and component test-cases added in new readability format?

Reviewer checklist

  • Is the type of change in the PR title appropriate as per the changes?

  • Verified that there are no credentials or confidential data exposed with the changes.

@maheshkutty maheshkutty requested review from a team as code owners November 26, 2025 08:40
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 26, 2025

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'auto_resolve_threads'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Walkthrough

Three Braze batch pack-size metrics were changed from Gauge to Histogram in both the Prometheus metric definitions and their usage; help text was updated to reflect distribution measurement and explicit histogram buckets were added.

Changes

Cohort / File(s) Change Summary
Prometheus Metric Definitions
src/util/prometheus.js
Converted braze_batch_attributes_pack_size, braze_batch_events_pack_size, braze_batch_purchase_pack_size from type 'gauge' to 'histogram'; updated help text to distribution phrasing and added buckets [1, 5, 10, 20, 30, 40, 50, 60, 70, 75].
Metric Usage
src/v0/destinations/braze/util.js
Updated recording of the three Braze batch pack-size metrics to align with histogram semantics (type change reflected where metrics are emitted).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Areas to focus on:

  • Verify histogram bucket boundaries [1, 5, 10, 20, 30, 40, 50, 60, 70, 75] suit expected batch sizes.
  • Confirm all metric emissions use histogram methods (e.g., .observe()), not gauge methods like .set().
  • Search repository for any other usages of these metric names that assume gauge APIs and update accordingly.

Possibly related PRs

Suggested reviewers

  • vinayteki95
  • ItsSudip
  • koladilip

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: replacing gauge metrics with histogram metrics for Braze batch size tracking.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering all required template sections with detailed explanations of the problem, solution, and technical rationale.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor.braze_object_metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3925624 and cb9b4f4.

📒 Files selected for processing (2)
  • src/util/prometheus.js (1 hunks)
  • src/v0/destinations/braze/util.js (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.js

⚙️ CodeRabbit configuration file

Focus on ESLint errors (max 3) and warnings (max 5).

Files:

  • src/util/prometheus.js
  • src/v0/destinations/braze/util.js
🧠 Learnings (2)
📚 Learning: 2025-05-29T13:29:39.436Z
Learnt from: maheshkutty
Repo: rudderlabs/rudder-transformer PR: 4359
File: src/v0/util/index.js:2272-2272
Timestamp: 2025-05-29T13:29:39.436Z
Learning: The `combineBatchRequestsWithSameJobIds` function in `src/v0/util/index.js` is used by both Mixpanel and Google AdWords Offline Conversions destinations in production. The function assumes metadata is always an array in multiple operations (forEach, filter, sort, map) and needs defensive programming to handle non-array metadata cases to prevent runtime errors.

Applied to files:

  • src/v0/destinations/braze/util.js
📚 Learning: 2025-07-15T06:27:07.528Z
Learnt from: ItsSudip
Repo: rudderlabs/rudder-transformer PR: 4497
File: src/v0/destinations/tiktok_ads/transformV2.js:686-691
Timestamp: 2025-07-15T06:27:07.528Z
Learning: For TikTok Ads destination in src/v0/destinations/tiktok_ads/transformV2.js, there are plans for a future enhancement to group events by eventSource when batching to ensure all events in a batch have the same event_source value, as acknowledged by ItsSudip in PR #4497.

Applied to files:

  • src/v0/destinations/braze/util.js
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Build Transformer Docker Image - PR / Build Transformer Docker Image AMD64
  • GitHub Check: Build Transformer Docker Image - PR / Build Transformer Docker Image ARM64
  • GitHub Check: Build User Transformer Docker Image - PR / Build Transformer Docker Image AMD64
  • GitHub Check: Build User Transformer Docker Image - PR / Build Transformer Docker Image ARM64
  • GitHub Check: Code Coverage
  • GitHub Check: check-health
  • GitHub Check: test_and_publish
  • GitHub Check: UT Tests
  • GitHub Check: Check for formatting & lint errors
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (3)
src/util/prometheus.js (1)

478-497: Bucket selection is well-aligned with API limits.

The histogram buckets [1, 5, 10, 20, 30, 40, 50, 60, 70, 75] are appropriately chosen to match the Braze API maximum request count of 75. This will provide good granularity for analyzing batch fill rates and identifying underutilization patterns.

src/v0/destinations/braze/util.js (2)

559-576: Histogram metric usage is correctly implemented.

The conversion from stats.gauge() to stats.histogram() in the addTrackStats function is properly implemented. Each chunk's size is recorded as a histogram observation, which will correctly capture the distribution of batch sizes across all chunks processed. The method signature and parameters remain compatible.


467-476: Subscription metrics correctly remain as gauges.

Good decision to keep braze_batch_subscription_size and braze_batch_subscription_combined_size as gauge metrics. Since subscription batches undergo deduplication via combineSubscriptionGroups() and represent a final deduplicated state rather than individual observations, gauges are the appropriate metric type here.

@devops-github-rudderstack
Copy link
Contributor

Allure Test reports for this run are available at:

@codecov
Copy link

codecov bot commented Nov 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.25%. Comparing base (9a0036e) to head (cf40329).
⚠️ Report is 8 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #4822   +/-   ##
========================================
  Coverage    92.25%   92.25%           
========================================
  Files          654      654           
  Lines        35358    35384   +26     
  Branches      8315     8328   +13     
========================================
+ Hits         32620    32645   +25     
- Misses        2503     2504    +1     
  Partials       235      235           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@devops-github-rudderstack
Copy link
Contributor

This PR is considered to be stale. It has been open for 20 days with no further activity thus it is going to be closed in 7 days. To avoid such a case please consider removing the stale label manually or add a comment to the PR.

## Problem Statement

The previous implementation used Prometheus gauge metrics for tracking Braze
batch sizes, which caused significant data loss and measurement inaccuracies:

1. **Data Loss from Overwrites**: The `processBatch` function iterates through
   multiple batch chunks, calling `addTrackStats()` for each chunk. Each call
   overwrote the previous gauge value, meaning only the LAST chunk size was
   retained, losing all other batch size data.

2. **Race Conditions**: Multiple concurrent `processBatch()` calls updating
   the same gauge (same destination_id label) created race conditions where
   gauge values were randomly overwritten by whichever request completed last.

3. **No Distribution Insights**: Gauges only store a single value, making it
   impossible to calculate averages, percentiles, or understand the distribution
   of batch sizes across time.

4. **Incorrect max_over_time Results**: Production graphs using `max_over_time()`
   on gauges showed unreliable values due to the overwrite behavior and race
   conditions, not reflecting actual maximum batch sizes.

## Solution: Histogram Metrics

Changed to Prometheus histogram type for these metrics:
- braze_batch_attributes_pack_size
- braze_batch_events_pack_size
- braze_batch_purchase_pack_size

### Why Histogram is Correct

1. **Preserves All Observations**: Every batch chunk size is recorded as an
   observation, eliminating data loss from overwrites.

2. **Automatic Statistics**: Histogram provides _count, _sum, and _bucket
   metrics enabling calculation of:
   - Average batch size: sum/count
   - Percentiles: p50, p95, p99 via histogram_quantile()
   - Total batches created: count metric
   - Total items processed: sum metric

3. **Distribution Analysis**: Buckets [1,5,10,20,30,40,50,60,70,75] aligned
   with Braze API limit (75 items max) enable tracking:
   - Batch fill rate (capacity utilization)
   - Small batch detection (underutilization)
   - Near-capacity batch percentage

4. **Concurrency-Safe**: Histogram accumulates observations atomically,
   eliminating race conditions between concurrent processBatch() calls.

5. **Better Operational Insights**: Can answer critical questions:
   - What's the 95th percentile batch size? (efficiency)
   - How many batches are we creating per second? (throughput)
   - What percentage of batches are < 25% capacity? (optimization opportunity)
   - Is batching working effectively? (compare avg to max capacity)

### Technical Details

Buckets chosen based on TRACK_BRAZE_MAX_REQ_COUNT=75 constraint:
- Lower buckets (1,5,10): Detect severely underutilized batches
- Middle buckets (20,30,40,50): Track typical batch sizes
- Upper buckets (60,70,75): Track high-efficiency batching

Note: Subscription metrics kept as gauge since they use different batching
logic (deduplication) and are updated once per batch call, not per chunk.

## Testing

- ✅ All 87 Braze component tests pass
- ✅ TypeScript compilation successful
- ✅ No breaking changes to metric names (backward compatible for dashboards)

## Impact

This change enables proper monitoring of Braze batching efficiency and
eliminates measurement inaccuracies caused by gauge overwrites and race
conditions. The histogram approach follows Prometheus best practices for
measuring distributions of observed values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@maheshkutty maheshkutty force-pushed the refactor.braze_object_metrics branch from cb9b4f4 to cf40329 Compare December 23, 2025 11:26
@devops-github-rudderstack
Copy link
Contributor

Allure Test reports for this run are available at:

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants