feat: add multi level merge sort that will always fit in memory #15700

rluvaton · 2025-04-13T19:49:47Z

Which issue does this PR close?

Closes A complete solution for stable and safe sort with spill #14692.

Rationale for this change

We need merge sort that does not fail with out of memory

What changes are included in this PR?

Implemented multi level merge sort on top of SortPreservingMergeStream that spill intermediate result when not enough memory.

How does it work:

When using the MultiLevelMerge you provide in memory streams and spill files,
each spill file contain the memory size of the record batch with the largest memory consumption.

Why is this important?

SortPreservingMergeStream uses BatchBuilder which grow and shrink memory based on the record batches that it get. however if there is not enough memory it will just fail.

this solution will reserve beforehand for each spill file the worst case scenerio for the record batch size so there will be no way that there is not enough memory mid sorting.

it will also try to reduce buffer size and number of streams to the minimum when there is not enough memory and will only fail if there is not enough memory for holding 2 record batches with no buffering in the stream

It can also be easily adjusted to allow for predefined maximum memory to merge stream

Are these changes tested?

yes added fuzz test for aggregate and sort

Are there any user-facing changes?

not really

Related to #15610

rluvaton · 2025-04-13T22:06:54Z

datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs

+
+
+#[tokio::test]
+async fn test_low_cardinality() -> Result<()> {


This fails on main on OOM

2010YOUY01 · 2025-04-14T03:47:18Z

datafusion/physical-plan/src/sorts/sort.rs


        for batch in batches_to_spill {
            in_progress_file.append_batch(&batch)?;
+
+            *max_record_batch_size =
+                (*max_record_batch_size).max(batch.get_actually_used_size());


I think it's not realistic to correctly know a batch's size after a roundtrip of spilling and reading back, with this get_actually_used_size() implementation. The actual implementation might give us some surprise. The implementation can get even more complex in the future, for example we might implement extra encodings for #14078, and the memory size of a batch after reading back can be harder to estimate.

Unless the actual array content before spill and after spill is different this function will always return the correct result regardless of the spill file format as we calculate the actual array content size.

There might be some type of arrays with complex internal buffer management, a simple example is:
Before spilling an StringView array has 10MB actual content, backed by 3 * 4MB buffer.
After spilling and reading back, the reader implementation decided to use 1 * 16MB buffer instead.
Different allocation policies caused different fragmentation status, and physical memory consumed varies.

Here are some real bugs found recently due to similar reasons (this explains why I'm worried about inconsistent memory size for logically equivalent batches):
#14644
#14823
#13377
Note they're only caused by primitive and string arrays, for more complex types like struct, array, or other nested types, I think it's even more likely to see such inconsistency.

I'm trying to reproduce that so I can better answer, how do you create that string view array so it will cause what you said?

So after looking at the code I came to the conclusion that this is still the closest there is to accurately estimating memory

I think we should not estimate, even if it's correct 99% of the time, IMO it's impossible to make sure it's always accurate for nested type's reader implementation. If the estimate is way off for edge cases, the bug would be hard to investigate.
If we want to follow this optimistic approach, the only required memory accounting I think is during buffering batches inside SortExec, and all the remaining memory-tracking code can be deleted to make the implementation much more simpler, the potential problem is unexpected behavior for non-primitive types (e.g. dictionary array's row format size can explode)

If I added tests for every type to make sure the memory accounting is correct would you approve?

alamb · 2025-05-24T11:39:59Z

@2010YOUY01 and @ding-young I wonder if you can review this PR again to help @rluvaton get it merged?

Specifically if it needs more tests perhaps you can help identify which are needed

ding-young · 2025-05-24T12:17:30Z

@alamb Sure! I may not be able to provide a detailed review right away, but I can definitely help by running the tests added in the PR locally and looking into memory accounting for the nested type that have been mentioned.

2010YOUY01 · 2025-05-24T12:22:39Z

@2010YOUY01 and @ding-young I wonder if you can review this PR again to help @rluvaton get it merged?

Specifically if it needs more tests perhaps you can help identify which are needed

I have some concerns about this PR's design direction (see more in #15700 (comment)), and I don't think it can be addressed by more extensive tests.
By the way, this PR serves as an alternative to #15610. It's ready for another review, except that it needs to merge main.

rluvaton · 2025-05-24T17:39:10Z

@2010YOUY01 and @ding-young I wonder if you can review this PR again to help @rluvaton get it merged?

Specifically if it needs more tests perhaps you can help identify which are needed

I have some concerns about this PR's design direction (see more in #15700 (comment)), and I don't think it can be addressed by more extensive tests.

Why is that? You raised some concerns about miscalculating the size of the record batch, adding tests will make sure we are calculating correctly

adriangb · 2025-07-01T04:20:10Z

In the interest of this valuable work not being lost, is there any way that #15700 (comment) could be addressed by a method that's not more tests? Could we calculate the actual batch sizes every time we load into memory? Even if possible that opens up questions of what to do if we load a batch and now exceed our memory budget, but maybe it's a path forward?

ding-young · 2025-07-01T04:51:31Z

Hi @adriangb, thanks for raising this point. I'm currently reviewing both this PR and the other cascading merge sort PR (#15610). I'm not taking sides between the two approaches, but I agree that accurately estimating memory consumption is tricky considering issues discussed above and the fact that now compression is supported in spill files. We may need to think more about whether we can special-case scenarios where the memory size changes after spilling and reloading, or perhaps add some kind of backup logic to handle such situations more gracefully.

ding-young · 2025-07-03T09:39:14Z

I've rebased this branch on the latest main and tested whether estimated size changes after we load RecordBatch which was compressed with lz4_frame into memory. The result of get_actually_used_size() was identical before and after (arrow-ipc StreamReader will return decoded array). Of course, since buffer allocations and copies happen internally during decoding, actual system memory usage (which DataFusion doesn't track) may temporarily be higher. Anyway, I've only tested for primitive type array + compression so I'll run a few more tests and try to see if I can reproduce any of the problematic cases discussed above.

Hi @adriangb, thanks for raising this point. I'm currently reviewing both this PR and the other cascading merge sort PR (#15610). I'm not taking sides between the two approaches, but I agree that accurately estimating memory consumption is tricky considering issues discussed above and the fact that now compression is supported in spill files. We may need to think more about whether we can special-case scenarios where the memory size changes after spilling and reloading, or perhaps add some kind of backup logic to handle such situations more gracefully.

2010YOUY01 · 2025-07-04T07:22:44Z

#15700 (comment)

I have a idea to fix this concern: adding a max merge degree configuration, if either
a. SPM's estimated memory exceed budget
b. configured max merge degree has reached
do a re-spill.

This approach I think has two advantages:

If batch size bloat happens after spill and read back roundtrip (see feat: add multi level merge sort that will always fit in memory #15700 (comment)), if there is a hard merge degree limit to override the estimation, query can still finish.
Also helpful to tune for speed: even we have enough memory to perform a very wide merge, limiting it to a smaller merge is still likely to run faster.

I (or possibly @ding-young) can handle this patch in a follow-up PR. I think we can move forward with this one—I’ll review it in the next few days.

rluvaton · 2025-07-04T15:37:56Z

So should I fix this PR conflicts? It seems like this pr has a chance to be merged

ding-young · 2025-07-04T16:01:01Z

@rluvaton If you’d like, I can send a PR to your (fork's) branch that resolve merge conflicts since I already have one. Anyway there were only minor diffs to handle when I rebased your branch with main.

rluvaton · 2025-07-04T16:17:34Z

I would appreciate it, it would greatly help me

ding-young · 2025-07-07T00:55:42Z

I would appreciate it, it would greatly help me

@rluvaton I opened a pr on your fork. Would you take a look when you have some time?

rluvaton · 2025-07-17T13:09:51Z

@2010YOUY01 I've updated based on your comments and commented back on some

2010YOUY01

Thanks for the updates, I think the overall structure of this PR looks great, most of my questions/suggestions from the 1st round of review has been addressed.

Now I only got two additional questions:

#15700 (comment) If this issue is not significant I still think it's better to first integrate this kernel into Arrow, and keep the solution here simple. Otherwise we should add UTs for it before merge.
Do we need this explicit batch chunking when writing spill files? #15700 (comment)

I left some additional advices for polishing in the 2nd round, but they're totally fine to be done as follow-up PRs.

I think this PR is almost ready, cc @ding-young and @alamb if you also want to review it again.

2010YOUY01 · 2025-07-19T07:35:20Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+use futures::TryStreamExt;
+use futures::{Stream, StreamExt};
+
+/// Merges a stream of sorted cursors and record batches into a single sorted stream


It would be great to add a high-level doc about how this multi-level merge work.

added with diagram

2010YOUY01 · 2025-07-19T07:49:21Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+    // This is for avoiding double reservation of memory from our side and the sort preserving merge stream
+    // side.
+    // and doing a lot of code changes to avoid accounting for the memory used by the streams
+    unbounded_memory_pool: Arc<dyn MemoryPool>,


I think a clearer way to implement is let StreamingMergeBuilder to include a new interface: with_bypass_mempool(), and this will construct a temporary unbounded memory pool inside SPM, and let its memory reservation point to it.

added and created the temporary unbounded memory pool inside SPM.

made that function pub(super) to avoid exposing this to the users as I feel it is a hack

2010YOUY01 · 2025-07-19T07:53:35Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+
+    async fn create_stream(mut self) -> Result<SendableRecordBatchStream> {
+        loop {
+            // Hold this for the lifetime of the stream


If we have this reservation with the same lifetime as the stream, would it be better to create a MultiLevelMergeStream and make this reservation a struct field?

changed and made sure to use the memory reservation when only merging in-memory streams otherwise use the worst case scenario

2010YOUY01 · 2025-07-19T07:59:38Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+        }
+    }
+
+    fn create_sorted_stream(


It would be great to include some comment for its high-level idea.
I also think maybe merge_sorted_runs_within_mem_limit() can be a more precise name?

added and renamed

2010YOUY01 · 2025-07-19T08:02:05Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+            // (reserving memory for the biggest batch in each stream)
+            // This is a hack
+            .with_reservation(
+                MemoryConsumer::new("merge stream mock memory")


I see, this makes sense.
To enforce such validation, in the future we can extend StreamingMergeBuilder with each stream's max batch size, and do some inner sanity checks:

let res = StreamingMergeBuilder::new() .with_streams(streams) .with_max_batch_size_per_stream(max_batch_sizes)

2010YOUY01 · 2025-07-19T08:10:59Z

datafusion/physical-plan/src/sorts/streaming_merge.rs

        let Some(expressions) = expressions else {
            return internal_err!("Sort expressions cannot be empty for streaming merge");
        };

+        if !sorted_spill_files.is_empty() {


I agree it's good to reduce the number of APIs, then two approaches seem to have similar complexity.

2010YOUY01 · 2025-07-19T08:13:12Z

datafusion/physical-plan/src/spill/get_size.rs

+use arrow::downcast_primitive_array;
+use arrow_schema::DataType;
+
+/// TODO - NEED TO MOVE THIS TO ARROW


It's not obvious to me under what situation it can overestimate by a lot? I was thinking those batch arrays won't over allocate buffers too much, because we have a configured batch size.

Do you have a reproducer, perhaps we can look into it further.

2010YOUY01 · 2025-07-19T08:18:18Z

datafusion/physical-plan/src/spill/spill_manager.rs

+    /// # Errors
+    /// - Returns an error if spilling would exceed the disk usage limit configured
+    ///   by `max_temp_directory_size` in `DiskManager`
+    pub(crate) fn spill_record_batch_by_size_and_return_max_batch_memory(


Note the SpillManager mod is exported as pub(crate)

datafusion/datafusion/physical-plan/src/spill/mod.rs

Line 22 in ae1ed6d

pub(crate) mod spill_manager;

, (I think we should better mark them as pub(crate) here to avoid confusion), so existing functions are not public APIs.

Then we’re free to modify the existing functions, and combine this one with spill_record_batch_by_size() to reuse the code.

2010YOUY01 · 2025-07-19T08:27:11Z

datafusion/physical-plan/src/spill/spill_manager.rs

+            let mut offset = 0;
+            let total_rows = batch.num_rows();
+
+            // Keep slicing the batch until we have left with a batch that is smaller than


Why do we have to do this step here?

I looked at its only use case:

datafusion/datafusion/physical-plan/src/sorts/multi_level_merge.rs

Line 129 in ae1ed6d

.spill_record_batch_stream_by_size(

The stream this function takes is produced by SPM, which has already chunk the output by batch_size in the configuration.

you are right, simplify

ding-young · 2025-07-21T05:23:51Z

datafusion/physical-plan/src/spill/spill_manager.rs

+    pub(crate) fn spill_record_batch_by_size_and_return_max_batch_memory(
+        &self,
+        batch: &RecordBatch,
+        request_description: &str,


Nit: Maybe it might be better to unify the naming. Some functions use request_msg, others use request_description

rluvaton · 2025-07-23T19:35:29Z

can you please re-review I don't believe there are any actionable comments left for me

datafusion/physical-plan/src/spill/spill_manager.rs

Copilot

Pull Request Overview

This PR adds a multi-level merge sort implementation that prevents out-of-memory failures by intelligently managing memory during sorting operations. The implementation spills intermediate results to disk when memory is insufficient and dynamically adjusts buffer sizes and stream counts to work within available memory constraints.

Implements MultiLevelMergeBuilder that performs hierarchical merging with automatic spilling
Adds memory-aware spill file management with worst-case memory reservation strategy
Integrates the multi-level merge into both sort and aggregate operations

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`spill_manager.rs`	Adds methods to track maximum batch memory sizes during spilling operations
`streaming_merge.rs`	Integrates multi-level merge capabilities and adds `SortedSpillFile` structure
`sort.rs`	Updates external sorter to track batch memory sizes and use multi-level merge
`multi_level_merge.rs`	New core implementation of memory-aware multi-level merge algorithm
`row_hash.rs`	Updates aggregate spilling to use new spill file format with memory tracking
Test files	Adds comprehensive fuzz tests for memory-constrained environments

Copilot · 2025-07-25T03:23:59Z

datafusion/physical-plan/src/spill/spill_manager.rs

 use arrow::datatypes::SchemaRef;
 use arrow::record_batch::RecordBatch;
 use datafusion_execution::runtime_env::RuntimeEnv;
+use std::sync::Arc;


[nitpick] The import order has been changed unnecessarily. The original import order (std imports first, then external crates) followed Rust conventions better.

Suggested change

use arrow::datatypes::SchemaRef;

use arrow::record_batch::RecordBatch;

use datafusion_execution::runtime_env::RuntimeEnv;

use std::sync::Arc;

use std::sync::Arc;

use arrow::datatypes::SchemaRef;

use arrow::record_batch::RecordBatch;

use datafusion_execution::runtime_env::RuntimeEnv;

isn't the import order defined in clippy/fmt?

Copilot · 2025-07-25T03:24:00Z

datafusion/physical-plan/src/sorts/streaming_merge.rs

+    ///
+    /// This is not marked as `pub` because it is not recommended to use this method


The comment should explain why this method is not recommended and what the risks are of bypassing the memory pool.

Suggested change

///

/// This is not marked as `pub` because it is not recommended to use this method

///

/// This method bypasses the memory pool, which can lead to unregulated memory usage.

/// Using an unbounded memory pool may result in excessive memory consumption and

/// potential system instability if memory usage exceeds available resources.

/// This is not marked as `pub` because it is not recommended to use this method

/// except in testing or controlled scenarios where memory usage is closely monitored.

Copilot · 2025-07-25T03:24:00Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+            // TODO - add a threshold for number of files to disk even if empty and reading from disk so
+            //        we can avoid the memory reservation


This TODO comment is incomplete and unclear. It should be completed to explain the specific optimization being considered.

Suggested change

// TODO - add a threshold for number of files to disk even if empty and reading from disk so

// we can avoid the memory reservation

// TODO - Consider adding a threshold for the number of files to spill to disk, even if the files are empty

// and we are reading from disk. This optimization aims to reduce memory reservation by limiting

// the number of in-memory streams. The threshold could be based on factors such as available memory,

// the size of the sorted spill files, or the expected performance impact of disk I/O. Further analysis

// and testing are needed to determine the appropriate threshold value and its implementation.

datafusion/physical-plan/src/sorts/multi_level_merge.rs

Copilot · 2025-07-25T03:24:01Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+                // If we're only merging memory streams, we don't need to attach the memory reservation
+                // as it's empty
+                if is_only_merging_memory_streams {
+                    assert_eq!(memory_reservation.size(), 0, "when only merging memory streams, we should not have any memory reservation and let the merge sort handle the memory");


Using assert_eq! in production code can cause panics. Consider returning an error result or using a debug assertion instead.

Suggested change

assert_eq!(memory_reservation.size(), 0, "when only merging memory streams, we should not have any memory reservation and let the merge sort handle the memory");

if memory_reservation.size() != 0 {

return Err(datafusion_common::DataFusionError::Internal(

"when only merging memory streams, we should not have any memory reservation and let the merge sort handle the memory".to_string(),

));

}

the assertion is on purpose to ensure correctness

Copilot · 2025-07-25T03:24:01Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+        minimum_number_of_required_streams: usize,
+        reservation: &mut MemoryReservation,
+    ) -> Result<(Vec<SortedSpillFile>, usize)> {
+        assert_ne!(buffer_len, 0, "Buffer length must be greater than 0");


Using assert_ne! in production code can cause panics. Consider returning an error result for invalid input parameters instead.

this is the purpose I want to return panic as this is correctness

2010YOUY01

Thanks again @rluvaton

I have filed the issues for the follow-up tasks.

I plan to wait a few days before merging, in case others would like to review it too.

ding-young

Thanks a lot! It looks good to me.

2010YOUY01 · 2025-07-29T02:37:47Z

Thanks everyone @rluvaton @ding-young and others who have helped along the way 🚀

adriangb · 2025-07-29T03:01:51Z

Amazing work! Aside from updating to main what do users of DataFusion need to do to use this improvement?

2010YOUY01 · 2025-07-29T03:46:23Z

Amazing work! Aside from updating to main what do users of DataFusion need to do to use this improvement?

Nothing else needed, it will be triggered automatically and it doesn't require a configuration to control.

alamb · 2025-07-29T18:37:05Z

EPIC!

I will also add this to the list of things we should mention in the 50.0.0 release

also

alamb · 2025-07-29T18:39:38Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+///  - No: return that sorted stream as the final output stream
+///
+/// ```text
+/// Initial State: Multiple sorted streams + spill files


…he#15700) * feat: add multi level merge sort that will always fit in memory * test: add fuzz test for aggregate * update * add more tests * fix test * update tests * added more aggregate fuzz * align with add fuzz tests * add sort fuzz * fix lints and formatting * moved spill in memory constrained envs to separate test * rename `StreamExec` to `OnceExec` * added comment on the usize in the `in_progress_spill_file` inside ExternalSorter * rename buffer_size to buffer_len * reuse code in spill fuzz * double the amount of memory needed to sort * add diagram for explaining the overview * update based on code review * fix test based on new memory calculation * remove get_size in favor of get_sliced_size * change to result

adriangb · 2025-08-05T20:03:00Z

I'm trying this out for our compaction system and am not able to get my sort to work without hitting memory limits. Note that I am using datafusion-cli but am not sure if it has a disk manager, etc. configured, but I figure if I can't reproduce it's maybe not obvious how to configure datafusion-cli so it's a fair question:

In q.sql:

-- About 6.32 GB of parquet compressed (~ 10 x compression ratio)
-- Split into ~60 ~100 MB files
CREATE EXTERNAL TABLE t1
STORED AS PARQUET
LOCATION '/Users/adriangb/Downloads/data/day=2025-08-05/';

explain
COPY (
    SELECT *
    FROM t1
    ORDER BY deployment_environment, kind, service_name, trace_id
)
TO '/Users/adriangb/Downloads/out.parquet';

COPY (
    SELECT *
    FROM t1
    ORDER BY deployment_environment, kind, service_name, trace_id
)
TO '/Users/adriangb/Downloads/out.parquet';

Even a limit of 16GB fails:

❯ ./target/release/datafusion-cli --mem-pool-type 'fair' --memory-limit '16g' --disk-limit '250gb' -f q.sql 
DataFusion CLI v49.0.0
0 row(s) fetched. 
Elapsed 0.243 seconds.

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │        DataSinkExec       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │ deployment_environment ASC│ |
|               | │    NULLS LAST, kind ASC   │ |
|               | │         NULLS LAST,       │ |
|               | │        service_name       │ |
|               | │       ASC NULLS LAST,     │ |
|               | │     trace_id ASC NULLS    │ |
|               | │            LAST           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │          SortExec         │ |
|               | │    --------------------   │ |
|               | │ deployment_environment@35 │ |
|               | │   ASC NULLS LAST, kind@6  │ |
|               | │       ASC NULLS LAST,     │ |
|               | │       service_name@27     │ |
|               | │       ASC NULLS LAST,     │ |
|               | │       trace_id@4 ASC      │ |
|               | │         NULLS LAST        │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 68         │ |
|               | │      format: parquet      │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched. 
Elapsed 0.255 seconds.

Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
  ExternalSorter[4]#12(can spill: true) consumed 1089.2 MB,
  ExternalSorter[11]#26(can spill: true) consumed 1018.9 MB,
  ExternalSorter[8]#20(can spill: true) consumed 1004.6 MB.
Error: Failed to allocate additional 744.1 KB for ExternalSorterMerge[6] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total pool

I can maybe share the data with some sort of NDA but honestly it's not that interesting, it's just a lot of random data.

ding-young · 2025-08-06T01:27:55Z

@adriangb There are still some cases that external sorting gives up with memory allocation failure, especially when
(1) internal memory accounting is wrong #14748 (which is also the cause of #16979)

and (2) when memory pressure is so high that multi level merge step can't grow the reservation

datafusion/datafusion/physical-plan/src/sorts/multi_level_merge.rs

Line 381 in 6043be4

return Err(err);

We need to estimate the memory required for row-formatted batch correctly for (1) and further limit the number of spills to merge if multi level merge fails (#16908) for (2). I'm working on these follow up issues recently but it will take time.

Anyway, until these fixes are done, I'd recommend trying to run above query with a smaller partitions setting.

feat: add multi level merge sort that will always fit in memory

e9beec8

rluvaton mentioned this pull request Apr 13, 2025

Cascaded spill merge and re-spill #15610

Closed

test: add fuzz test for aggregate

58b299d

github-actions bot added the core Core DataFusion crate label Apr 13, 2025

rluvaton commented Apr 13, 2025

View reviewed changes

2010YOUY01 reviewed Apr 14, 2025

View reviewed changes

rluvaton added 4 commits April 15, 2025 22:45

update

a8ac81e

add more tests

082a009

fix test

6cdf5dc

update tests

04bb5ed

rluvaton mentioned this pull request Apr 15, 2025

test: add fuzz test for doing aggregation with larger than memory groups and sorting with limited memory #15727

Draft

rluvaton added 5 commits April 16, 2025 00:11

added more aggregate fuzz

d630cf1

align with add fuzz tests

5a55135

add sort fuzz

a818582

Merge branch 'main' into add-multi-level-merge-sort

9334cf4

fix lints and formatting

ba17329

rluvaton marked this pull request as ready for review April 15, 2025 22:12

rluvaton mentioned this pull request May 23, 2025

More accurate memory accounting in external sort #14748

Open

2010YOUY01 reviewed Jul 19, 2025

View reviewed changes

ding-young reviewed Jul 21, 2025

View reviewed changes

rluvaton added 5 commits July 21, 2025 14:44

add diagram for explaining the overview

0403296

update based on code review

0010222

fix test based on new memory calculation

8e72e79

Merge branch 'main' into add-multi-level-merge-sort

473d88b

remove get_size in favor of get_sliced_size

bc76825

ding-young reviewed Jul 24, 2025

View reviewed changes

datafusion/physical-plan/src/spill/spill_manager.rs Outdated Show resolved Hide resolved

change to result

00dcf58

2010YOUY01 requested a review from Copilot July 25, 2025 03:22

Copilot AI reviewed Jul 25, 2025

View reviewed changes

This was referenced Jul 25, 2025

Combine utilities in SpillManager #16907

Closed

Limit the max merge degree during re-spill in external sort #16908

Open

Validate the memory consumption in SortPreservingMergeStream #16909

Open

2010YOUY01 approved these changes Jul 25, 2025

View reviewed changes

ding-young approved these changes Jul 25, 2025

View reviewed changes

2010YOUY01 merged commit 5421825 into apache:main Jul 29, 2025
27 checks passed

rluvaton deleted the add-multi-level-merge-sort branch July 29, 2025 05:39

alamb mentioned this pull request Jul 29, 2025

Release DataFusion 50.0.0 (Aug/Sep 2025) #16799

Open

30 tasks

alamb reviewed Jul 29, 2025

View reviewed changes

ding-young mentioned this pull request Jul 30, 2025

Sort-tpch Q5 fails when memory-limit is intermediate while succeeds with smaller memory #16979

Open



		#[tokio::test]
		async fn test_low_cardinality() -> Result<()> {

		///
		/// This is not marked as `pub` because it is not recommended to use this method

-    ///
-    /// This is not marked as `pub` because it is not recommended to use this method
+    ///
+    /// This method bypasses the memory pool, which can lead to unregulated memory usage.
+    /// Using an unbounded memory pool may result in excessive memory consumption and
+    /// potential system instability if memory usage exceeds available resources.
+    /// This is not marked as `pub` because it is not recommended to use this method
+    /// except in testing or controlled scenarios where memory usage is closely monitored.

		// TODO - add a threshold for number of files to disk even if empty and reading from disk so
		// we can avoid the memory reservation

-            // TODO - add a threshold for number of files to disk even if empty and reading from disk so
-            //        we can avoid the memory reservation
+            // TODO - Consider adding a threshold for the number of files to spill to disk, even if the files are empty
+            //        and we are reading from disk. This optimization aims to reduce memory reservation by limiting
+            //        the number of in-memory streams. The threshold could be based on factors such as available memory,
+            //        the size of the sorted spill files, or the expected performance impact of disk I/O. Further analysis
+            //        and testing are needed to determine the appropriate threshold value and its implementation.

-                    assert_eq!(memory_reservation.size(), 0, "when only merging memory streams, we should not have any memory reservation and let the merge sort handle the memory");
+                    if memory_reservation.size() != 0 {
+                        return Err(datafusion_common::DataFusionError::Internal(
+                            "when only merging memory streams, we should not have any memory reservation and let the merge sort handle the memory".to_string(),
+                        ));
+                    }

feat: add multi level merge sort that will always fit in memory #15700

feat: add multi level merge sort that will always fit in memory #15700

Conversation

rluvaton commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented May 24, 2025

Uh oh!

ding-young commented May 24, 2025

Uh oh!

2010YOUY01 commented May 24, 2025

Uh oh!

rluvaton commented May 24, 2025

Uh oh!

adriangb commented Jul 1, 2025

Uh oh!

ding-young commented Jul 1, 2025

Uh oh!

ding-young commented Jul 3, 2025

Uh oh!

2010YOUY01 commented Jul 4, 2025

Uh oh!

rluvaton commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ding-young commented Jul 4, 2025

Uh oh!

rluvaton commented Jul 4, 2025

Uh oh!

ding-young commented Jul 7, 2025

Uh oh!

rluvaton commented Jul 17, 2025

Uh oh!

2010YOUY01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton commented Apr 13, 2025 •

edited

Loading

rluvaton Apr 14, 2025 •

edited

Loading

rluvaton commented Jul 4, 2025 •

edited

Loading

2010YOUY01 left a comment •

edited

Loading

rluvaton Jul 21, 2025 •

edited

Loading

rluvaton commented Jul 23, 2025 •

edited

Loading