-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Add HashSet check to Enumerable.Distinct #114237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the Enumerable.Distinct method to avoid unnecessary processing when the source is a HashSet with a matching equality comparer. It adds new tests to verify the no-op behavior when using a HashSet and updates the Distinct method accordingly.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
src/libraries/System.Linq/tests/DistinctTests.cs | Added tests to ensure the optimization returns the original HashSet when appropriate |
src/libraries/System.Linq/src/System/Linq/Distinct.cs | Updated Distinct to return the source directly if it is a HashSet with a matching comparer |
Tagging subscribers to this area: @dotnet/area-system-linq |
@@ -22,6 +22,13 @@ public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> s | |||
return []; | |||
} | |||
|
|||
if (source is HashSet<TSource> hashSet && | |||
ReferenceEquals(hashSet.Comparer, comparer ?? EqualityComparer<TSource>.Default)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to any real-world examples where this will be common, i.e. some function on GitHub that returns a HashSet typed as an enumerable then consumed by a caller using Distinct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to any real-world examples where this will be common, i.e. some function on GitHub that returns a HashSet typed as an enumerable then consumed by a caller using Distinct?
The use case I found was actually on a large private repository. One site is in a shared library used for updating a DB table based on a list of IDs. The list is taken as IEnumerable<int>
to allow any type of input list, but then the method uses .Distinct()
to ensure it doesn't repeat IDs when executing the queries. There are multiple call sites spread across the large repository, but one of the call sites is building the list to pass using HashSet<int>
so it's deduplicating while building the list.
Of course, this specific example could be addressed by either expressly taking HashSet<T>
as an overload on the method, or type testing in the method for HashSet<T>
. But I thought it might be worth a more general optimization as I imagine others may be encountering it without realizing it, especially since the cost of the type check is low. That said, I don't have any other real-world examples and I understand if you think this might be excessive optimization.
Libraries which are written to receive
IEnumerable<T>
as inputs may be performingEnumerable.Distinct
operations on those inputs.Distinct
is currently building aHashSet<T>
that gradually fills as the returnedIEnumerable<T>
is iterated.If the caller is passes a
HashSet<T>
that uses the same equality comparer as theIEnumerable<T>
, this potentially expensive step is unnecessary. The inputHashSet<T>
is known to already be distinct based on the given comparer and may be returned directly as a no-op.