Skip to content

Add HashSet check to Enumerable.Distinct #114237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

brantburnett
Copy link
Contributor

@brantburnett brantburnett commented Apr 3, 2025

Libraries which are written to receive IEnumerable<T> as inputs may be performing Enumerable.Distinct operations on those inputs. Distinct is currently building a HashSet<T> that gradually fills as the returned IEnumerable<T> is iterated.

If the caller is passes a HashSet<T> that uses the same equality comparer as the IEnumerable<T>, this potentially expensive step is unnecessary. The input HashSet<T> is known to already be distinct based on the given comparer and may be returned directly as a no-op.

@Copilot Copilot bot review requested due to automatic review settings April 3, 2025 18:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the Enumerable.Distinct method to avoid unnecessary processing when the source is a HashSet with a matching equality comparer. It adds new tests to verify the no-op behavior when using a HashSet and updates the Distinct method accordingly.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/libraries/System.Linq/tests/DistinctTests.cs Added tests to ensure the optimization returns the original HashSet when appropriate
src/libraries/System.Linq/src/System/Linq/Distinct.cs Updated Distinct to return the source directly if it is a HashSet with a matching comparer

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 3, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

@@ -22,6 +22,13 @@ public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> s
return [];
}

if (source is HashSet<TSource> hashSet &&
ReferenceEquals(hashSet.Comparer, comparer ?? EqualityComparer<TSource>.Default))
Copy link
Member

@stephentoub stephentoub Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to any real-world examples where this will be common, i.e. some function on GitHub that returns a HashSet typed as an enumerable then consumed by a caller using Distinct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to any real-world examples where this will be common, i.e. some function on GitHub that returns a HashSet typed as an enumerable then consumed by a caller using Distinct?

The use case I found was actually on a large private repository. One site is in a shared library used for updating a DB table based on a list of IDs. The list is taken as IEnumerable<int> to allow any type of input list, but then the method uses .Distinct() to ensure it doesn't repeat IDs when executing the queries. There are multiple call sites spread across the large repository, but one of the call sites is building the list to pass using HashSet<int> so it's deduplicating while building the list.

Of course, this specific example could be addressed by either expressly taking HashSet<T> as an overload on the method, or type testing in the method for HashSet<T>. But I thought it might be worth a more general optimization as I imagine others may be encountering it without realizing it, especially since the cost of the type check is low. That said, I don't have any other real-world examples and I understand if you think this might be excessive optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Linq community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants