Skip to content

feat: add time based eviction to data managed by cachinglayer #43490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

zhengbuqian
Copy link
Collaborator

issue: #41435

also added disk capacity protection

@sre-ci-robot sre-ci-robot added approved size/L Denotes a PR that changes 100-499 lines. labels Jul 22, 2025
@mergify mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Jul 22, 2025
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

mergify bot commented Jul 22, 2025

@zhengbuqian cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 22, 2025

@zhengbuqian go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 22, 2025

@zhengbuqian cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

@zhengbuqian
Copy link
Collaborator Author

rerun cpp-unit-test

@zhengbuqian
Copy link
Collaborator Author

rerun go-sdk

@zhengbuqian
Copy link
Collaborator Author

/run-cpu-e2e

@zhengbuqian
Copy link
Collaborator Author

rerun go-sdk

Copy link

codecov bot commented Jul 22, 2025

Codecov Report

Attention: Patch coverage is 67.54386% with 37 lines in your changes missing coverage. Please review.

Project coverage is 78.94%. Comparing base (990a25e) to head (1a52126).
Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
internal/core/src/cachinglayer/lrucache/DList.cpp 69.11% 21 Missing ⚠️
internal/core/src/cachinglayer/Utils.cpp 45.83% 13 Missing ⚠️
internal/core/src/cachinglayer/Utils.h 66.66% 1 Missing ⚠️
internal/core/src/segcore/segcore_init_c.cpp 0.00% 1 Missing ⚠️
pkg/util/paramtable/component_param.go 91.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #43490      +/-   ##
==========================================
- Coverage   78.94%   78.94%   -0.01%     
==========================================
  Files        1566     1567       +1     
  Lines      224453   224778     +325     
==========================================
+ Hits       177205   177456     +251     
- Misses      40834    40896      +62     
- Partials     6414     6426      +12     
Components Coverage Δ
Client 79.47% <ø> (ø)
Core 73.97% <62.88%> (+0.09%) ⬆️
Go 79.90% <94.11%> (-0.02%) ⬇️
Files with missing lines Coverage Δ
internal/core/src/cachinglayer/Manager.cpp 83.33% <ø> (ø)
internal/core/src/cachinglayer/lrucache/DList.h 80.95% <ø> (ø)
...ternal/core/src/cachinglayer/lrucache/ListNode.cpp 82.40% <100.00%> (-0.14%) ⬇️
internal/core/src/cachinglayer/lrucache/ListNode.h 100.00% <ø> (ø)
internal/querynodev2/server.go 61.14% <100.00%> (+0.23%) ⬆️
internal/core/src/cachinglayer/Utils.h 71.85% <66.66%> (+0.28%) ⬆️
internal/core/src/segcore/segcore_init_c.cpp 21.34% <0.00%> (-0.25%) ⬇️
pkg/util/paramtable/component_param.go 98.16% <91.66%> (-0.02%) ⬇️
internal/core/src/cachinglayer/Utils.cpp 64.66% <45.83%> (-4.91%) ⬇️
internal/core/src/cachinglayer/lrucache/DList.cpp 79.81% <69.11%> (-3.74%) ⬇️

... and 72 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

mergify bot commented Jul 22, 2025

@zhengbuqian cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 22, 2025

@zhengbuqian cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 22, 2025

@zhengbuqian go-sdk check failed, comment rerun go-sdk can trigger the job again.

@zhengbuqian
Copy link
Collaborator Author

rerun go-sdk

@zhengbuqian
Copy link
Collaborator Author

/run-cpu-e2e

@zhengbuqian
Copy link
Collaborator Author

rerun cpp-unit-test

Copy link
Contributor

mergify bot commented Jul 23, 2025

@zhengbuqian go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 23, 2025

@zhengbuqian cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 23, 2025

@zhengbuqian cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 23, 2025

@zhengbuqian cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: Buqian Zheng <[email protected]>
@zhengbuqian zhengbuqian force-pushed the time-based_eviction branch from 5a843d3 to 5ddb0db Compare July 23, 2025 17:59
Copy link
Contributor

mergify bot commented Jul 23, 2025

@zhengbuqian go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 23, 2025

@zhengbuqian cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 23, 2025

@zhengbuqian cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

@zhengbuqian
Copy link
Collaborator Author

rerun cpp-unit-test

@zhengbuqian
Copy link
Collaborator Author

/run-cpu-e2e

@zhengbuqian
Copy link
Collaborator Author

rerun go-sdk

Copy link
Contributor

mergify bot commented Jul 24, 2025

@zhengbuqian cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 24, 2025

@zhengbuqian cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 24, 2025

@zhengbuqian go-sdk check failed, comment rerun go-sdk can trigger the job again.

@@ -2842,6 +2842,7 @@ type queryNodeConfig struct {
TieredCacheTouchWindowMs ParamItem `refreshable:"false"`
TieredEvictionIntervalMs ParamItem `refreshable:"false"`
TieredLoadingMemoryFactor ParamItem `refreshable:"false"`
CacheCellUnaccssedSurvivalTime ParamItem `refreshable:"false"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one typo: CacheCellUnaccssedSurvivalTime -> CacheCellUnaccessedSurvivalTime

// Combine with logical eviction target (take the maximum)
eviction_target.memory_bytes = std::max(
eviction_target.memory_bytes, physical_eviction_needed);
eviction_target.memory_bytes =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should eviction_target.file_bytes and min_eviction.file_bytes be updated here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely!

// we only need to evict for logical limit and we have succeeded.
break;
}

if (physical_eviction_needed = checkPhysicalMemoryLimit(size);
physical_eviction_needed == 0) {
if (physical_eviction_needed = checkPhysicalResourceLimit(size);
Copy link
Contributor

@sparknack sparknack Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since physical memory usage fluctuates over time, should min_eviction be set as a multiple of physical_eviction_needed?
This could handle the case where the minimum amount of evicted memory is immediately reused by another process outside the caching layer's control.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in checkPhysicalResourceLimit we already multiplied size by the loading factor

@@ -144,12 +145,12 @@ DList::reserveMemoryInternal(const ResourceUsage& size) {
"still need to evict {}",
size.ToString(),
evicted_size.ToString(),
FormatBytes(physical_eviction_needed));
physical_eviction_needed.ToString());
}

// Reserve resources (both checks passed)
used_memory_ += size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After introducing disk resource reservation, used_memory_ seems confusing. What about used_resource_ or another appropriate name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds fair, done

@@ -269,7 +269,8 @@ DList::tryEvict(const ResourceUsage& expected_eviction,
ResourceUsage actively_pinned{0, 0};

// accumulate victims using expected_eviction.
for (auto it = tail_; it != nullptr; it = it->next_) {
ListNode* it = nullptr;
for (it = tail_; it != nullptr; it = it->next_) {
if (!would_help(it->size())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these skipped it be considered expired items?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't know, they may be pinned, or not pinned.

why?

Signed-off-by: Buqian Zheng <[email protected]>
@mergify mergify bot added the ci-passed label Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/feature Issues related to feature request from users size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants