Skip to content

feat: support round-robin load balancing for routing & add retry mechanism for TSF event reporting#699

Merged
SkyeBeFreeman merged 2 commits intopolarismesh:mainfrom
shedfreewu:main
Mar 18, 2026
Merged

feat: support round-robin load balancing for routing & add retry mechanism for TSF event reporting#699
SkyeBeFreeman merged 2 commits intopolarismesh:mainfrom
shedfreewu:main

Conversation

@shedfreewu
Copy link
Contributor

变更概述

本 PR 包含两个独立功能改动:

  1. 路由负载均衡支持轮询:基于当前实例列表动态生成路由键,实现加权轮询负载均衡。
  2. TSF 事件上报重试机制:为 V1 业务失败和网络异常分别引入独立的重试策略。

变更详情

一、路由负载均衡支持轮询(WeightedRoundRobinBalance)

改动说明:

  • 原来路由键使用 namespace.service 固定字符串,导致路由规则过滤后实例列表变化时,轮询状态无法感知,出现负载不均衡问题
  • 新增 generateRouteKey(List<Instance>) 方法,根据当前实例列表的 ID(或 host:port)排序后计算 hashCode 作为路由键
  • 相同实例集合(不同顺序)生成相同路由键,保证轮询状态的一致性
  • 新增完整单元测试覆盖

二、TSF 事件上报重试机制(TsfEventReporter)

改动说明:

针对两类不同的失败场景,分别设计了独立的重试策略:

1. V1 业务失败重试(retCode != 0

  • 当服务端返回 retCode != 0 时,认为是业务层失败,立即在当次 HTTP 请求内同步重试,最多重试 3 次V1_MAX_RETRY = 3
  • 每次重试重新构造 HttpPostStringEntity,避免 entity 已被消费导致重复读取失败
  • 超过最大重试次数后,记录警告日志并放弃本批次事件,不阻塞后续上报

2. 网络异常暂停/恢复机制(Exception

  • 当发生网络异常(连接超时、IO 异常等)时,不对单个事件单独重试,而是采用"暂停队列消费 + 延迟恢复"的策略,避免任务堆积
  • 具体流程:
    • 异常发生后,将已从队列取出的事件逆序放回队列头部,保持原有顺序,避免事件丢失
    • 设置 paused = true暂停所有队列消费(V1 和 Report 队列共用同一暂停标志)
    • 通过独立的 retryExecutors 调度器,延迟 60 秒后自动恢复消费(paused = false
    • 幂等保护:已处于暂停状态时不重复调度恢复任务
  • 引入全局重试计数器 commonRetryCount,最多允许重试 120 次(即最长约 2 小时持续重试)
  • 超过最大重试次数后,清空所有队列并重置状态,彻底放弃,防止内存无限积压

3. Report 事件(限流事件)

  • 服务端返回 errorInfo 非空时,认为是不可重试的业务错误,直接放弃,不重试
  • 网络异常同样走上述暂停/恢复机制

其他改进:

  • BlockingQueueLinkedBlockingQueue)升级为 LinkedBlockingDeque,支持将事件放回队列头部,保证重试时的事件顺序
  • TsfEventDataPair 新增 toString() 方法,便于日志调试
  • 新增测试专用构造函数,允许注入较小的重试等待时间,提升测试效率

测试

  • WeightedRoundRobinBalanceTest - 轮询路由键生成逻辑单元测试
  • TsfEventReporterTest - 事件上报重试机制单元测试(覆盖 V1 业务失败重试、网络异常暂停恢复等场景)

@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 68.85246% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 21.08%. Comparing base (b5d9223) to head (e008c05).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...nt/polaris/plugins/event/tsf/TsfEventReporter.java 66.66% 28 Missing and 8 partials ⚠️
...balancer/roundrobin/WeightedRoundRobinBalance.java 92.30% 0 Missing and 1 partial ⚠️
...polaris/plugins/event/tsf/v1/TsfEventDataPair.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #699      +/-   ##
============================================
+ Coverage     20.43%   21.08%   +0.64%     
- Complexity     1037     1119      +82     
============================================
  Files           390      408      +18     
  Lines         16189    16893     +704     
  Branches       2088     2164      +76     
============================================
+ Hits           3309     3562     +253     
- Misses        12474    12918     +444     
- Partials        406      413       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@SkyeBeFreeman SkyeBeFreeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@SkyeBeFreeman SkyeBeFreeman merged commit 25e7896 into polarismesh:main Mar 18, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants