Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

Which issue does this PR close?

An initial attempt towards #18467

Rationale for this change

Rationale for the additional lint rule clippy::needless_pass_by_value

There is a clippy lint rule that is not turned on by the current strictness level in CI: https://rust-lang.github.io/rust-clippy/master/index.html#needless_pass_by_value
Note it has the Clippy category pedantic, and its description is lints which are rather strict or have occasional false positives from https://doc.rust-lang.org/nightly/clippy

It seems we have been suffering from the excessive copying issue for quite some time, and @alamb is on the front line now #18413. I think this extra lint rule is able to help.

Implementation plan

This PR only enables this rule in datafusion-common package, and apply #[allow(clippy::needless_pass_by_value)] for all violations.
If this PR makes sense, we can open a tracking issue and roll out this check to the remaining workspace packages. At least this can help prevent new inefficient patterns and identify existing issues that we can fix gradually.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added development-process Related to development process of DataFusion common Related to common crate labels Nov 3, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 -- I think this looks like an improvement to me

If we are going to add this lint I think we should also update the various APIs to pass (not just add #[allow(clippy::needless_pass_by_value)]

If this PR makes sense, we can open a tracking issue and roll out this check to the remaining workspace packages. At least this can help prevent new inefficient patterns and identify existing issues that we can fix gradually.

I think it makes a lot of sense

use std::sync::Arc;

#[allow(clippy::needless_pass_by_value)]
fn create_qualified_schema(qualifier: &str, names: Vec<&str>) -> Result<DFSchema> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could change this to impl IntoIterator<Item = &str> for example

Suggested change
fn create_qualified_schema(qualifier: &str, names: Vec<&str>) -> Result<DFSchema> {
fn create_qualified_schema(qualifier: &str, names: impl IntoIterator<Item = &str>) -> Result<DFSchema> {

/// If `rehash==true` this combines the previous hash value in the buffer
/// with the new hash using `combine_hashes`
#[cfg(not(feature = "force_hash_collisions"))]
#[allow(clippy::needless_pass_by_value)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? What is clippy's alternate suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am wondering if this has found a good potential for optimization...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arg array can safely be taken by reference.

However, I believe all current call sites already pass it by moving, so they’re fine.

That said, there might be optimization opportunities elsewhere — this needless_pass_by_value warning could indicate that some callers are cloning unnecessarily just to make the rust compiler happy.

cargo clippy --all-targets --workspace --features avro,pyarrow,integration-tests,extended_tests -- -D warnings

# Update packages incrementally for stricter Clippy checks
# TODO: add tracking issue for the remaining workspace packages like `datafusion-catalog`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better to add at the module level (rather than in the CI script) so that it would also be flagged locally when people ran clippy.

Similar to this:

#![deny(clippy::clone_on_ref_ptr)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the CI script, and update the datafusion/common/cargo.toml instead in 2736b39, since we can update the whole datafusion-common crate at once.

When updating a larger crate, I think it's possible to configure like this to update module by module.

@github-actions github-actions bot added substrait Changes to the substrait crate proto Related to proto crate labels Nov 4, 2025
@github-actions github-actions bot removed substrait Changes to the substrait crate proto Related to proto crate labels Nov 4, 2025
@2010YOUY01
Copy link
Contributor Author

Thank you @2010YOUY01 -- I think this looks like an improvement to me

If we are going to add this lint I think we should also update the various APIs to pass (not just add #[allow(clippy::needless_pass_by_value)]

If this PR makes sense, we can open a tracking issue and roll out this check to the remaining workspace packages. At least this can help prevent new inefficient patterns and identify existing issues that we can fix gradually.

I think it makes a lot of sense

Thank you for the feedbacks!

For the following cases, I kept the #[allow(clippy::needless_pass_by_value)] to suppress it

  • Tests. I think this rule do seem a bit annoying for tests, and I couldn't find a way to turn it off for all test modules...however we can apply the suppressing macro for individual test modules.
  • Public APIs
  • Intentional moves

For future works, I think we could start with crates that are performance critical or currently suffering from unnecessary clones.

@xudong963
Copy link
Member

performance critical

One place is the logical optimizer I think

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 -- this looks like an improvement to me

I think changing allow to expect would probably be a good improvement

I would also suggest as follow on PRs:

  1. Consolidate the clippy configuration into Cargo.toml (there are a bunch of configs in library functions too -- for example
    // Make sure fast / cheap clones on Arc are explicit:
    // https://github.com/apache/datafusion/issues/11143
    #![deny(clippy::clone_on_ref_ptr)]
  2. Change existing allow to expect to clean up our clippy linting

use arrow::datatypes::{DataType, SchemaBuilder};
use std::sync::Arc;

#[allow(clippy::needless_pass_by_value)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer using

Suggested change
#[allow(clippy::needless_pass_by_value)]
#[expect(clippy::needless_pass_by_value)]

Which will error if the lint is no longer needed

allow will sit there silently even when the the lint is fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in dac1004

ScalarValue::List(arr) => fmt_list(arr.to_owned() as ArrayRef, f)?,
ScalarValue::LargeList(arr) => fmt_list(arr.to_owned() as ArrayRef, f)?,
ScalarValue::FixedSizeList(arr) => fmt_list(arr.to_owned() as ArrayRef, f)?,
ScalarValue::List(arr) => fmt_list(arr.as_ref(), f)?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@2010YOUY01
Copy link
Contributor Author

  1. Consolidate the clippy configuration into Cargo.toml (there are a bunch of configs in library functions too -- for example
    // Make sure fast / cheap clones on Arc are explicit:
    // https://github.com/apache/datafusion/issues/11143
    #![deny(clippy::clone_on_ref_ptr)]

fixed in dac1004

@2010YOUY01
Copy link
Contributor Author

Thank you @alamb and @xudong963 for the review and suggestions.

I have opened #18503 to track the follow-up tasks for enforcing this lint rule globally in DataFusion. I’ve opened issues for only two smaller packages first, to see how things go.

#![deny(clippy::clone_on_ref_ptr)]
// This lint rule is enforced in `../Cargo.toml`, but it's okay to skip them in tests
// See details in https://github.com/apache/datafusion/issues/18503
#![cfg_attr(test, allow(clippy::needless_pass_by_value))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To turn off this lint rule for all tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate development-process Related to development process of DataFusion logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants