Skip to content

[#1727] feat(server): Introduce local allocation buffer to store blocks in memory #2492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jun 12, 2025

Conversation

xianjingfeng
Copy link
Member

@xianjingfeng xianjingfeng commented May 28, 2025

What changes were proposed in this pull request?

Introduce local allocation buffer to store blocks in memory.

Why are the changes needed?

Fix: #1727

Does this PR introduce any user-facing change?

set rss.server.buffer.lab.enable to true in server.conf

How was this patch tested?

CI and verify in production environment

@xianjingfeng xianjingfeng requested review from maobaolong, zuston, advancedxy, jerqi and rickyma and removed request for maobaolong and zuston May 28, 2025 10:19
@xianjingfeng
Copy link
Member Author

Test result of terasort(data size:1T, parallelism:8000)

image image image image

Copy link

github-actions bot commented May 28, 2025

Test Results

 3 049 files  + 30   3 049 suites  +30   6h 47m 55s ⏱️ -21s
 1 186 tests +  8   1 185 ✅ +  8   1 💤 ±0  0 ❌ ±0 
15 042 runs  +120  15 027 ✅ +120  15 💤 ±0  0 ❌ ±0 

Results for commit 497c524. ± Comparison against base commit b45e986.

This pull request removes 4 and adds 12 tests. Note that renamed tests count towards both.
org.apache.uniffle.common.ShufflePartitionedBlockTest ‑ testNotEquals{int, long, long, int}[1]
org.apache.uniffle.common.ShufflePartitionedBlockTest ‑ testNotEquals{int, long, long, int}[2]
org.apache.uniffle.common.ShufflePartitionedBlockTest ‑ testNotEquals{int, long, long, int}[3]
org.apache.uniffle.common.ShufflePartitionedBlockTest ‑ testNotEquals{int, long, long, int}[4]
org.apache.uniffle.server.buffer.LABShuffleBufferWithLinkedListTest ‑ appendMultiBlocksTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithLinkedListTest ‑ appendRepeatBlockTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithLinkedListTest ‑ appendTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithLinkedListTest ‑ getShuffleDataTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithLinkedListTest ‑ getShuffleDataWithExpectedTaskIdsFilterTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithLinkedListTest ‑ getShuffleDataWithLocalOrderTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithLinkedListTest ‑ toFlushEventTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithSkipListTest ‑ appendMultiBlocksTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithSkipListTest ‑ appendRepeatBlockTest
org.apache.uniffle.server.buffer.LABShuffleBufferWithSkipListTest ‑ appendTest
…

♻️ This comment has been updated with latest results.

@jerqi
Copy link
Contributor

jerqi commented May 28, 2025

cc @frankliee

@jerqi
Copy link
Contributor

jerqi commented May 29, 2025

@rickyma Could you help me review this pull request?

@@ -226,9 +238,9 @@ public StatusCode registerBuffer(
ShuffleServerMetrics.gaugeTotalPartitionNum.inc();
ShuffleBuffer shuffleBuffer;
if (shuffleBufferType == ShuffleBufferType.SKIP_LIST) {
shuffleBuffer = new ShuffleBufferWithSkipList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the class SkipListLabShuffleBuffer and LinkedListLabShuffleBuffer, and add ShuffleBufferFactory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me

@xianjingfeng xianjingfeng changed the title [#1727] feat(server): Introduce local allocation buffer to store blocks in memory [WIP][#1727] feat(server): Introduce local allocation buffer to store blocks in memory May 29, 2025
@xianjingfeng
Copy link
Member Author

Do you mean to do it like below? I tried this solution and found that many methods need to be implemented. And the code looks inelegant.@jerqi
image
And I also tried the solution like below. And i think it's better.
image

@jerqi
Copy link
Contributor

jerqi commented May 30, 2025

Do you mean to do it like below? I tried this solution and found that many methods need to be implemented. And the code looks inelegant.@jerqi image And I also tried the solution like below. And i think it's better. image

The second way is ok for me, too. You can extract an interface SupportsLAB for second way.

@zuston
Copy link
Member

zuston commented Jun 3, 2025

Good to see this PR for the performance improvement. I will take a look in the later of this day

Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for this abstrction

zuston
zuston previously approved these changes Jun 4, 2025
jerqi
jerqi previously approved these changes Jun 5, 2025
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cc @frankliee , Could you take a look? I will wait util tomorrow.

zuston
zuston previously approved these changes Jun 5, 2025
jerqi
jerqi previously approved these changes Jun 5, 2025
@advancedxy
Copy link
Contributor

I am quite busy these days. But I will take a look at this later today.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, the comparison result is promising.
Left some minor comments, others lgtm.


abstract void allocateDataBuffer();

public int alloc(int size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementation and the method name doesn't seem aligned? Could you add some java doc and/or change the method name here?

private Chunk createChunk(boolean pool, int size) {
Chunk chunk;
int id = chunkID.getAndIncrement();
assert id > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seem redundant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so .. chunkID should always greater than zero?

Even if it's possible to overflow, we should use Preconditions.check instead of assert, which will be no-op if assertion if not enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to overflow, let's use Preconditions.check.

return dataLength == that.dataLength
&& crc == that.crc
&& blockId == that.blockId
&& data.equals(that.data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I think at least we should keep dataLength and crc check?

private Chunk currChunk;

List<Integer> chunks = new LinkedList<>();
private final int maxAlloc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this field name is a bit of vague, could this be called as capacity or something similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means when a block's size is too big (bigger than maxAlloc), it will not be allocated on LAB.

@xianjingfeng xianjingfeng dismissed stale reviews from jerqi and zuston via 9cd9949 June 6, 2025 09:13
@xianjingfeng xianjingfeng requested a review from advancedxy June 6, 2025 09:20
@zuston
Copy link
Member

zuston commented Jun 10, 2025

Gentle ping @advancedxy

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, others lgtm.

@jerqi jerqi requested a review from advancedxy June 11, 2025 03:42
@jerqi jerqi merged commit c83a1b5 into apache:master Jun 12, 2025
41 checks passed
xianjingfeng added a commit that referenced this pull request Jul 25, 2025
…apacityRatio (#2554)

### What changes were proposed in this pull request?
Change the default value of chunkPoolCapacityRatio

### Why are the changes needed?
The proportion and the frequency of small blocks is not high. If this value is set too high, it may cause off-heap memory overflow.
Fix: #2492

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Verify in production environment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Introduce local allocation buffer to store blocks in memory
5 participants