Skip to content

[refactor] AlarmCacheManager refactoring processing logic #3525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 6, 2025

Conversation

Duansg
Copy link
Contributor

@Duansg Duansg commented Jun 29, 2025

What's changed?

In order to make the semantics of the Promql parser clearer, the original scheme of fingerprinting only by alerts is gradually not applicable to the subsequent logical processing, and subsequently, if there is no value = null data returned, it will not be possible to find and restore the notification through the original fingerprint information, so this refactoring scheme is as follows:
image

On the original firingAlertMap and pendingAlertMap storage scheme, extend defineId, after all, in the existing architecture, AlertDefine's rules are essential, even if there is a special alert notification, you can customize the alert rule ( AlarmDefineCommonEnum) to realize it.

If an alarm/recovery occurs for a single indicator under a rule, it is found through its alarm definition. If all the indicators under this rule are back to normal, the whole row of data can be processed through the alarm definition, the specific process, as shown below:
image

Firing info
image

Data on recovery indicators are as follows
image

Recovery logic processing
image

Modification details:

  1. AlarmCacheManager changes and replaces all called methods.
  2. Compatible with historical alarm handling, first query the ones containing defineId, if not, query the historical ones.
  3. To initialize historical alarm data, a new HZB_ALERT_SINGLE.define_id field has been added and some migration script has been added.
  4. Add new test cases/fix test cases and go through the history of test cases successfully.

Checklist

  • I have read the Contributing Guide
  • I have written the necessary doc or comment.
  • I have added the necessary unit tests and all cases have passed.

Add or update API

  • I have added the necessary e2e tests and all cases have passed.

1、Periodic
iShot_2025-06-29_19 42 03
iShot_2025-06-29_19 42 36
iShot_2025-06-29_19 43 27
iShot_2025-06-29_19 46 57

2、RealTime
iShot_2025-06-29_19 49 19
iShot_2025-06-29_19 50 10
iShot_2025-06-29_19 50 55

3、Collector
iShot_2025-06-29_19 52 43
iShot_2025-06-29_19 53 02

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the AlarmCacheManager processing logic to support the new alarm definition mechanism by introducing a new defineId field and updating the alert handling logic. Key changes include:

  • Adding a new column and migration scripts for define_id in PostgreSQL, MySQL, and H2.
  • Updating the SingleAlert entity and related components (calculators, handler, tests) to pass defineId into alert cache operations.
  • Refactoring AlarmCacheManager to use a Guava Table with stringified defineId as the key, along with adaptations in the alert calculation logic.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
db/migration/postgresql/V173__update_column.sql Adds migration logic to add the define_id column if not present.
db/migration/mysql/V173__update_column.sql Uses a stored procedure to conditionally add the define_id column.
db/migration/h2/V173__update_column.sql Adds the define_id column with IF NOT EXISTS clause.
SingleAlert.java Adds defineId field to the SingleAlert entity.
AlarmDefineCommonEnum.java Introduces a new enum constant for COLLECTOR with an associated defineId.
RealTimeAlertCalculatorMatchTest.java, PeriodicAlertCalculatorTest.java Updates tests to verify that the new defineId is used in CacheManager operations.
RealTimeAlertCalculator.java, PeriodicAlertCalculator.java, CollectorAlertHandler.java Pass defineId to cache manager methods to ensure consistency in handling alerts.
AlarmCacheManager.java Refactors pending and firing alert maps to use a Guava Table keyed by defineId and fingerprint, with a fallback for historical alerts.
Comments suppressed due to low confidence (2)

hertzbeat-alerter/src/main/java/org/apache/hertzbeat/alert/calculate/AlarmCacheManager.java:65

  • [nitpick] Consider renaming the parameter 'fingerPrint' to 'fingerprint' for consistency and clarity across the codebase.
    public void putPending(Long defineId, String fingerPrint, SingleAlert alert) {

hertzbeat-alerter/src/main/java/org/apache/hertzbeat/alert/calculate/AlarmCacheManager.java:61

  • It would be helpful to add a comment explaining the rationale for converting defineId to a string (as rowKey) and the use of a historical key when defineId is null.
            this.firingAlertMap.put(rowKey, fingerprint, singleAlert);

@tomsun28
Copy link
Member

Hi, thanks for very thoughtful pr.

Sorry I just took a quick look and haven't learn this PR in depth. I have a question is it possible if we make defineId a member of lables? Then we don't need to handle it specially, because we need to consider not only our internal alarms, but also external ones (which don't have define_id).
I remember that when the internal alarm is triggered, defineId will be attached to labels as a fingerprint.

@Duansg
Copy link
Contributor Author

Duansg commented Jun 29, 2025

but also external ones (which don't have define_id).

HI, thanks for the reply, and in response to a couple of your questions

  1. The current fingerprints seem to have only some information like define name and define bind Labels attached to them.

  2. Regarding your comment that some of the external alerts don't have define_id, this is a problem I also found, so I added AlarmDefineCommonEnum to customize it individually, and although this doesn't affect the internal processing logic, it does redundancy the define_id field.

I reconsidered, if I take into account external alerts, i.e. SingleAlert does not store the define_id field, but exists as bind labels, and when AlarmCacheManager is initialized, it is enough to recover it from the labels, I tried to modify this

@Duansg
Copy link
Contributor Author

Duansg commented Jun 29, 2025

@tomsun28 hi, I reorganized the query logic for define_id and revalidated the following function points:

  1. Re-tested and verified Periodic, RealTime and Collector alarm/recovery logic.
  2. Reorganized the relevant test cases.
  3. Verified that labels has define_id stored successfully.
  4. Verified the initialization logic of restarting the application AlarmCacheManager.

@tomsun28 tomsun28 requested review from MasamiYui and pwallk July 2, 2025 00:39
@tomsun28
Copy link
Member

tomsun28 commented Jul 2, 2025

Hi since this involves basic alerts, sorry that we will review it slowly. I will review it on the weekend.

@MasamiYui
Copy link
Member

Thank you for your PR. I noticed that there are currently three ways to define defineId in the code:
defineId = define.getId();
defineId = getCustomKey(fingerprint);
defineId = singleAlert.getLabels().get(CommonConstants.LABEL_DEFINE_ID);
Could you elaborate on the different application scenarios for each approach?

@Duansg
Copy link
Contributor Author

Duansg commented Jul 4, 2025

Thank you for your PR. I noticed that there are currently three ways to define defineId in the code: defineId = define.getId(); defineId = getCustomKey(fingerprint); defineId = singleAlert.getLabels().get(CommonConstants.LABEL_DEFINE_ID); Could you elaborate on the different application scenarios for each approach?

@MasamiYui

Of course;

1、define.getId()
This one, as shown where it is used, is intended to get the contextual AlertDefine rule when processing Periodic with RealTime calculations. It will be stored as rowKey.

2、getCustomKey(fingerprint)
It is to deal with external or internal not defined by AlertDefine rule, such as Collector up and down line notification, using CUSTOM_FIRING_ROW_KEY splicing, just in order to be able to understand the get(fingerPrint, fingerPrint) such as the more obscure code of the , the splice processing has no special meaning.

3、singleAlert.getLabels().get(CommonConstants.LABEL_DEFINE_ID);
As mentioned in the context with @tomsun28 maintaining define_id separately is unnecessary since it doesn't exist without AlertDefine rules. Therefore, in the latest changes, it is stored via SingleAlert labels to pre-populate the firingAlertMap during the initialization method of org.apache.hertzbeat.alert.calculate.AlarmCacheManager#AlarmCacheManager.

@Duansg
Copy link
Contributor Author

Duansg commented Jul 4, 2025

Thank you for your PR. I noticed that there are currently three ways to define defineId in the code: defineId = define.getId(); defineId = getCustomKey(fingerprint); defineId = singleAlert.getLabels().get(CommonConstants.LABEL_DEFINE_ID); Could you elaborate on the different application scenarios for each approach?

@MasamiYui

Of course;

1、define.getId() This one, as shown where it is used, is intended to get the contextual AlertDefine rule when processing Periodic with RealTime calculations. It will be stored as rowKey.

2、getCustomKey(fingerprint) It is to deal with external or internal not defined by AlertDefine rule, such as Collector up and down line notification, using CUSTOM_FIRING_ROW_KEY splicing, just in order to be able to understand the get(fingerPrint, fingerPrint) such as the more obscure code of the , the splice processing has no special meaning.

3、singleAlert.getLabels().get(CommonConstants.LABEL_DEFINE_ID); As mentioned in the context with @tomsun28 maintaining define_id separately is unnecessary since it doesn't exist without AlertDefine rules. Therefore, in the latest changes, it is stored via SingleAlert labels to pre-populate the firingAlertMap during the initialization method of org.apache.hertzbeat.alert.calculate.AlarmCacheManager#AlarmCacheManager.

@MasamiYui

I'm adding to that is:
The define.getId() is used as the rowKey of the Table, but some without define need to use fingerprint instead, and going to get it in the AlarmCacheManager initialization method is for compatibility with the original initialization logic.

Copy link
Member

@MasamiYui MasamiYui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From implementation perspective, there are no problems.
LGTM.

Copy link
Member

@tomsun28 tomsun28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@pwallk pwallk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tomsun28 tomsun28 merged commit 553bdba into apache:master Jul 6, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

8 participants