Skip to content

Conversation

@adriangb
Copy link
Contributor

The rationale is that we want to use it in DataFusion but don't want to force a dependency on Parquet to use this trait.

adriangb and others added 6 commits January 10, 2026 10:52
This PR introduces two new crates:

- `arrow-memory-size`: Contains the `HeapSize` trait for estimating heap
  memory usage, with implementations for std types, arrow-buffer types,
  and all arrow-array types.

- `arrow-memory-size-derive`: Provides `#[derive(HeapSize)]` proc macro
  for automatically implementing HeapSize on structs and enums.

The HeapSize infrastructure was previously in the parquet crate. Moving
it to a standalone crate allows users who don't use parquet to depend on
memory size estimation utilities.

**Breaking Change**: The `#[derive(HeapSize)]` macro has been removed from
`parquet_derive`. Users should switch to `arrow-memory-size-derive`.

The parquet crate re-exports `HeapSize` from `arrow_memory_size` for
backward compatibility.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Move HeapSize implementations from arrow-memory-size to their
  respective crates (arrow-buffer, arrow-array)
- arrow-memory-size now contains only the trait and std impls
- Re-export HeapSize and HeapSizeDerive from arrow::util
- Add derive macro attributes: #[heap_size(ignore)], #[heap_size(size = N)],
  #[heap_size(size_fn = path)]
- Add type coverage: tuples (up to 12), arrays [T; N], Mutex, RwLock
- Add README files with comparison to deepsize and get-size2 crates

This allows users who don't use arrow to depend on just arrow-memory-size
for the trait, while arrow crates implement HeapSize for their own types.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove prescriptive recommendations about which crates to use
- Explain motivations: minimal API, customizable Arc/Rc semantics, Arrow integration
- Simplify comparison section to just explain our approach

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jan 10, 2026
Tests for the HeapSize derive macro are not parquet-specific,
so move them from parquet_derive_test to arrow-memory-size-derive/tests/.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Comment on lines 61 to 65
impl HeapSize for StringArray {
fn heap_size(&self) -> usize {
self.get_buffer_memory_size()
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be a blanket implementation:

impl<T> HeapSize for T
where
    T: Array

Proc-macro crates cannot resolve intra-doc links to external crates,
causing rustdoc errors for the arrow_memory_size::HeapSize references.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@adriangb adriangb marked this pull request as ready for review January 11, 2026 20:34
Replace individual HeapSize implementations with a macro to reduce
code duplication. A blanket impl<T: Array> is not possible due to
Rust's orphan rules (E0210).

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@alamb
Copy link
Contributor

alamb commented Jan 12, 2026

I am not sure we need a new crate just for the HeapSize trait in a downstream crate (DataFUsion)

What about making a trait like this in DataFusion common:

pub trait DFHeapSize {
...
}

And then providing a blanket implementation for DFHeapSize for everything that implements HeapSize

impl <T: HeapSize> for DFHeapSize { 
...
}

That way we can use the arrow implementations, but we don't have to make a whole new crate etc

arrow-data = { version = "57.2.0", path = "./arrow-data" }
arrow-ipc = { version = "57.2.0", path = "./arrow-ipc" }
arrow-json = { version = "57.2.0", path = "./arrow-json" }
arrow-memory-size = { version = "57.2.0", path = "./arrow-memory-size" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to do this I would suggest we put it in arrow-buffer or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to just move it there instead of a new crate, certainly easier

@adriangb
Copy link
Contributor Author

I agree we can move this into arrow-buffer and then make a trait in DataFusion if that sounds better. We still a different crate for the macros though 😞

@alamb
Copy link
Contributor

alamb commented Jan 12, 2026

What do we think about just adding the new crate in DataFusion?

Or is there more to this PR that I am missing?

@alamb
Copy link
Contributor

alamb commented Jan 12, 2026

I haven't had a chance to look through it (2k lines is a lot!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants