[DRAFT] Perceptual Hashing Deduplication #226

mwoodson1 · 2025-01-09T17:01:21Z

This PR adds perceptual hashing-based deduplication on top of the existing file hashing and embedding-based deduplication methods. Perceptual hashs are much faster to compute than embedding and more robust than simple file hash's so this serves as a good in-between method to find duplicate images. The following hashs are implemented

Perceptual Hash
Average Hash
Wavelet Hash
Difference Hash

The API is implemented as an extension of fob.compute_exact_duplicates.

dataset = ...
duplicates = fob.compute_exact_duplicates(dataset, hash_method='dhash', threshold=0.01)

jacobsela

Generally looks good. I added some minor notes and one major note about the option of having this in compute_near_duplicates. Let me know what you think about it.

From a correctness standpoint it works great, besides whash that I can't seem to get good results out of.

I made a notebook for comparing hash methods, as well as to compare against the full compute_near_duplicates with clip embeddings as a baseline.
Results:

all methods find everything file hash finds
choice of threshold of course matters a lot
generally hashes get quite good performance
from a cursory look - clip embeddings seem to have better P/R tradeoffs, but hashes were able to find some duplicates it was not able to find. Pretty interesting.
dhash seems to behave the most like clip

EDIT: I haven't played around with thresholds too much, more performance can be squeezed out probably.

jacobsela · 2025-01-09T17:21:35Z

fiftyone/brain/internal/core/perceptual_hash.py

+import scipy
+
+
+def compute_image_hash(image_path, method="phash", hash_size=8):


Not specific to this, not necessarily important to change here but wanted to note: we use this pattern (consolidate many implementations into 1 core function) a lot. It's not great obviously because you have to manually add each new option to the if-else. We should think about moving to registries over time.

Good point, I typically prefer registries when the set of implementations is un-bounded (new model architectures for example). I believe the set of hash functions for images is relatively bounded so I wasn't too worried.

fiftyone/brain/internal/core/duplicates.py

jacobsela · 2025-01-09T17:43:17Z

fiftyone/brain/internal/core/duplicates.py

@@ -92,7 +96,14 @@ def compute_near_duplicates(
    return similarity_index


-def compute_exact_duplicates(samples, num_workers, skip_failures, progress):
+def compute_exact_duplicates(


Maybe have this in compute_near_duplicates? You can compute hashes into a field and pass that as the embedding field. It should give the same behavior. This serves the following:

clarity - this isn't exact duplicates still

This requires a threshold parameter. More inline with near duplicates than exact.

This code uses sklearn.metrics.pairwise.pairwise_distances backend, just like the similarity index w/ duplicates mixin in compute near duplicates.

near duplicates returns the underlying similarity index allowing for easier manipulation of the data afterwards compared with the neighbors map.

Having one point of contact with metric methods is useful long term when we want to work on scaling this code.

What do you think?

I went back and forth on whether this should be in compute_exact_duplicates vs compute_near_duplicates for a while. I'd be curious to get @prernadh 's opinion since this was a customer request but in my opinion hash based deduplication is closer to exact deduplication than it is to near deduplication.

mmm, I would think this is closer to near duplicates rather than exact as we are able to use a threshold to find close samples

fiftyone/brain/internal/core/duplicates.py

jacobsela · 2025-01-09T18:05:35Z

fiftyone/brain/internal/core/duplicates.py

+        for i, _id in enumerate(_ids):
+            nearby_indices = np.where(thresholded_distances[i, :])[0]
+            duplicate_ids = [_ids[j] for j in nearby_indices]
+            neighbors_map[_id] = duplicate_ids


This is different from the similarity index neighbors_map that gives keys of "unique" samples and values of duplicates of that sample. Not an issue but can be misleading.

Keep in mind this recounts neighbors both ways. May be worthwhile to only go over the upper triangle of the thresholded_distances matrix, i.e. nearby_indices = np.where(thresholded_distances[i, i:])[0] + i.

fiftyone/brain/internal/core/perceptual_hash.py

prernadh · 2025-01-10T02:11:26Z

Thanks for adding this in Markus. I do think this technique might fit in better as a near duplicate calculation.

mwoodson1 · 2025-01-10T21:36:41Z

@prernadh @jacobsela I have two concerns with putting this functionality into compute_near_duplicates

I believe hashing based deduplication is much better suited to the term "exact" though I see where you both are coming from. I define exact as very close to pixel matching modulo some small color or affine transform. Though you technically could use hashes for near deduplication I would not encourage people to do so since hash's get noisy pretty quickly.
The current implementation of compute_near_duplicates is pretty bloated already, with many hyperparameters and automatic functionality built it. It would require maybe 2 more hyperparams to add hashing instead of embeddings into that function.
compute_near_duplicates is pretty reliant on embeddings and we have established that pattern with our customers deeply. If we add hashing as an option to it, some customers may never know or discover that feature as easily as a change to compute_exact_duplicates which I am going to guess is a less commonly used feature.

If you guys believe strongly it should still go into compute_near_duplicates I can make the change next week.

mwoodson1 · 2025-01-15T23:04:36Z

Moved the implementation over to compute_near_duplicates per request.

The following test

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

dataset = foz.load_zoo_dataset('caltech101', drop_existing_dataset=True)
duplicates = fob.compute_near_duplicates(
    dataset,
    threshold=0.01,
    hash_method='dhash')

duplicates_view = duplicates.duplicates_view()

dataset.save_view("duplicates", duplicates_view, overwrite=True)
session = fo.launch_app(duplicates_view)

Yields the following view which is exactly what I got with the previous implementation.

jacobsela

left some notes about moving the additions (i.e. argument checking and computation of hashes) to compute_near_duplicates rather than having them in similarity.

Other than that LGTM.

fiftyone/brain/__init__.py

jacobsela · 2025-01-16T17:29:13Z

fiftyone/brain/__init__.py

@@ -544,6 +544,7 @@ def compute_similarity(
    brain_key=None,
    model=None,
    model_kwargs=None,
+    hash_method=None,


What do you think about removing these arguments from similarity and having them just in compute_near_duplicates? Do you foresee users wanting to use other similarity functionality besides duplicates on the hashes?

compute_near_duplicates can build the appropriate similarity index on the fly based on the arguments given.

I now see that this is effectively what you are doing in compute_similarity. The code looks good, but I think compute_near_duplicates is a better home for it. Every compute_near_duplicates is a compute_similarity, but not the other way around, so code that handles the special cases that arise for compute_near_duplicates should be there.

What do you mean by this? The functionality is primarily exposed through compute_near_duplicates but in order to reuse the code for distance computation there is a need to have some of the hashing code in compute_similarity.

Both functionality and implementation should be in compute_near_duplicates w/ the current implementation that you have i.e. within compute_near_duplicates:

verify arguments

compute hashes

call compute_similarity as you would usually, but with modified arguments

similarity_index = fb.compute_similarity( samples, backend="sklearn", roi_field=roi_field, embeddings=hashes # even better, do `embeddings = hashes` earlier model=model, # should be None because of argument checking model_kwargs=model_kwargs, # should be None because of argument checking force_square=force_square, alpha=alpha, batch_size=batch_size, num_workers=num_workers, skip_failures=skip_failures, progress=progress, metric='cosine' if hash_method is None else 'manhattan' # sklearn similarity already accepts this argument )

Run duplicates on the index and return it to the user as currently happens

I'd argue it makes more sense to keep the current layout for the following reasons.

It matches exactly how compute_near_duplicates already works, with compute_similarity doing both the embedding computation and the index creation.

If we break the above pattern then if a user wanted an index on hash's instead of embeddings they have to first call compute_near_duplicates or dig in the code to figure out how to compute hash's and make a call to compute_similarity elsewhere. Seems like to much work from a UX perspective

I think this discussion is getting to how we want to manage the brain from a design philosophy standpoint. This is probably something we should get together and discuss properly, because I’m sure it will come up a lot more in the future. What do you think?

Regarding the current point at hand:

First I’d like to say that my main concern is this code not being in similarity, rather than it being in compute_near_duplicates. If you’d prefer to have a new function like compute_perceptual_hash_duplicates or something along those lines, that’s also a good solution.

The case for moving to compute_near_duplicates:

Most importantly, similarity needs to stay clean. Similarity is the contact point of all other methods to metric methods in FO. Deduplication with hashes is a niche task, so its code shouldn’t go into the well-encapsulated, very widely-used code of similarity.

Since compute_near_duplicates (and by extension, hash deduplication) is effectively a macro for a specific use-case of similarity, it shouldn’t change the core similarity code. From hash deduplication’s standpoint, compute similarity can be a black box and it will still work fine.

compute_similarity already has all of the functionality and arguments needed to be used for hash deduplication. Why add more code when it’s not needed.

To address your points:

The core functionality of compute_similarity is to create a similarity index for the user. If you could get hash computation to fit into existing embedding computation code (by abstracting the hash as a “model” of sorts), then hash deduplication would transparently fit into compute_similarity. Since the implementation of the “embedding computation” is so different, it belongs elsewhere. On top of all of this, the embedding computation is a convenience (a good one because computing embeddings is very very widely needed), but arguably not the core functionality. This is why compute_similarity allows the user to pass embeddings as a numpy array or by field (in fact, it is better to precompute embeddings and pass them to similarity from a performance standpoint, because similarity can’t make strong assumptions on the user’s setup and as such is not super optimal).

This is why I asked above if you foresee users needing a similarity index with hashes. What is the use case that this serves? If this is in fact a very niche use case, it’s not worth changing very widely used, core code for.

fiftyone/brain/similarity.py

prernadh

Just some comments about function names - otherwise looks good.

prernadh · 2025-01-21T15:21:22Z

fiftyone/brain/internal/core/duplicates.py

+        if method is None or method in FILE_HASH_TYPES:
+            filehash = fou.compute_filehash(filepath, method=method)
+        else:
+            filehash = fbh.compute_image_hash(filepath, method=method)


I find it a bit confusing to add image hash computation under the _compute_filehash function. I would say its better to rename this function for clarity.

prernadh · 2025-01-21T15:22:13Z

fiftyone/brain/internal/core/duplicates.py

+        if method is None or method in FILE_HASH_TYPES:
+            filehash = fou.compute_filehash(filepath, method=method)
+        else:
+            filehash = fbh.compute_image_hash(filepath, method=method)


Similar comment to earlier - this function name is no longer accurate if we are not just calculating file hashses anymore here.

init commit

b920739

mwoodson1 added the feature Work on a feature request label Jan 9, 2025

mwoodson1 requested review from manushreegangwar, jacobsela and prernadh January 9, 2025 17:01

jacobsela requested changes Jan 9, 2025

View reviewed changes

setting defaults and code cleanup

b21a57f

prernadh reviewed Jan 10, 2025

View reviewed changes

fiftyone/brain/internal/core/perceptual_hash.py Outdated Show resolved Hide resolved

Markus Woodson added 2 commits January 10, 2025 15:16

removing unused function

f2f5180

removed ill performing hash function

ef5234a

Markus Woodson added 2 commits January 15, 2025 16:59

reverting exact dup implementation

3c00e5f

moving implementation to compute_near_duplicates

bade86b

jacobsela self-requested a review January 16, 2025 17:06

jacobsela requested changes Jan 16, 2025

View reviewed changes

prernadh approved these changes Jan 21, 2025

View reviewed changes

		import scipy


		def compute_image_hash(image_path, method="phash", hash_size=8):

[DRAFT] Perceptual Hashing Deduplication #226

Are you sure you want to change the base?

[DRAFT] Perceptual Hashing Deduplication #226

Uh oh!

Conversation

mwoodson1 commented Jan 9, 2025

Uh oh!

jacobsela left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jacobsela Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

prernadh commented Jan 10, 2025

Uh oh!

mwoodson1 commented Jan 10, 2025

Uh oh!

mwoodson1 commented Jan 15, 2025

Uh oh!

jacobsela left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

prernadh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jacobsela left a comment •

edited

Loading

jacobsela Jan 9, 2025 •

edited

Loading