Skip to content

[DRAFT] Perceptual Hashing Deduplication #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

mwoodson1
Copy link

This PR adds perceptual hashing-based deduplication on top of the existing file hashing and embedding-based deduplication methods. Perceptual hashs are much faster to compute than embedding and more robust than simple file hash's so this serves as a good in-between method to find duplicate images. The following hashs are implemented

  • Perceptual Hash
  • Average Hash
  • Wavelet Hash
  • Difference Hash

The API is implemented as an extension of fob.compute_exact_duplicates.

dataset = ...
duplicates = fob.compute_exact_duplicates(dataset, hash_method='dhash', threshold=0.01)

@mwoodson1 mwoodson1 added the feature Work on a feature request label Jan 9, 2025
Copy link
Contributor

@jacobsela jacobsela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. I added some minor notes and one major note about the option of having this in compute_near_duplicates. Let me know what you think about it.

From a correctness standpoint it works great, besides whash that I can't seem to get good results out of.

I made a notebook for comparing hash methods, as well as to compare against the full compute_near_duplicates with clip embeddings as a baseline.
Results:

  • all methods find everything file hash finds
  • choice of threshold of course matters a lot
  • generally hashes get quite good performance
  • from a cursory look - clip embeddings seem to have better P/R tradeoffs, but hashes were able to find some duplicates it was not able to find. Pretty interesting.
  • dhash seems to behave the most like clip

EDIT: I haven't played around with thresholds too much, more performance can be squeezed out probably.

import scipy


def compute_image_hash(image_path, method="phash", hash_size=8):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to this, not necessarily important to change here but wanted to note: we use this pattern (consolidate many implementations into 1 core function) a lot. It's not great obviously because you have to manually add each new option to the if-else. We should think about moving to registries over time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I typically prefer registries when the set of implementations is un-bounded (new model architectures for example). I believe the set of hash functions for images is relatively bounded so I wasn't too worried.

@@ -92,7 +96,14 @@ def compute_near_duplicates(
return similarity_index


def compute_exact_duplicates(samples, num_workers, skip_failures, progress):
def compute_exact_duplicates(
Copy link
Contributor

@jacobsela jacobsela Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have this in compute_near_duplicates? You can compute hashes into a field and pass that as the embedding field. It should give the same behavior. This serves the following:

  1. clarity - this isn't exact duplicates still
  2. This requires a threshold parameter. More inline with near duplicates than exact.
  3. This code uses sklearn.metrics.pairwise.pairwise_distances backend, just like the similarity index w/ duplicates mixin in compute near duplicates.
  4. near duplicates returns the underlying similarity index allowing for easier manipulation of the data afterwards compared with the neighbors map.
  5. Having one point of contact with metric methods is useful long term when we want to work on scaling this code.

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on whether this should be in compute_exact_duplicates vs compute_near_duplicates for a while. I'd be curious to get @prernadh 's opinion since this was a customer request but in my opinion hash based deduplication is closer to exact deduplication than it is to near deduplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, I would think this is closer to near duplicates rather than exact as we are able to use a threshold to find close samples

for i, _id in enumerate(_ids):
nearby_indices = np.where(thresholded_distances[i, :])[0]
duplicate_ids = [_ids[j] for j in nearby_indices]
neighbors_map[_id] = duplicate_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from the similarity index neighbors_map that gives keys of "unique" samples and values of duplicates of that sample. Not an issue but can be misleading.

Keep in mind this recounts neighbors both ways. May be worthwhile to only go over the upper triangle of the thresholded_distances matrix, i.e. nearby_indices = np.where(thresholded_distances[i, i:])[0] + i.

@prernadh
Copy link
Contributor

Thanks for adding this in Markus. I do think this technique might fit in better as a near duplicate calculation.

@mwoodson1
Copy link
Author

@prernadh @jacobsela I have two concerns with putting this functionality into compute_near_duplicates

  1. I believe hashing based deduplication is much better suited to the term "exact" though I see where you both are coming from. I define exact as very close to pixel matching modulo some small color or affine transform. Though you technically could use hashes for near deduplication I would not encourage people to do so since hash's get noisy pretty quickly.
  2. The current implementation of compute_near_duplicates is pretty bloated already, with many hyperparameters and automatic functionality built it. It would require maybe 2 more hyperparams to add hashing instead of embeddings into that function.
  3. compute_near_duplicates is pretty reliant on embeddings and we have established that pattern with our customers deeply. If we add hashing as an option to it, some customers may never know or discover that feature as easily as a change to compute_exact_duplicates which I am going to guess is a less commonly used feature.

If you guys believe strongly it should still go into compute_near_duplicates I can make the change next week.

@mwoodson1
Copy link
Author

Moved the implementation over to compute_near_duplicates per request.

The following test

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

dataset = foz.load_zoo_dataset('caltech101', drop_existing_dataset=True)
duplicates = fob.compute_near_duplicates(
    dataset,
    threshold=0.01,
    hash_method='dhash')

duplicates_view = duplicates.duplicates_view()

dataset.save_view("duplicates", duplicates_view, overwrite=True)
session = fo.launch_app(duplicates_view)

Yields the following view which is exactly what I got with the previous implementation.
Screenshot 2025-01-15 at 5 02 41 PM

@jacobsela jacobsela self-requested a review January 16, 2025 17:06
Copy link
Contributor

@jacobsela jacobsela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some notes about moving the additions (i.e. argument checking and computation of hashes) to compute_near_duplicates rather than having them in similarity.

Other than that LGTM.

@@ -544,6 +544,7 @@ def compute_similarity(
brain_key=None,
model=None,
model_kwargs=None,
hash_method=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about removing these arguments from similarity and having them just in compute_near_duplicates? Do you foresee users wanting to use other similarity functionality besides duplicates on the hashes?

compute_near_duplicates can build the appropriate similarity index on the fly based on the arguments given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that this is effectively what you are doing in compute_similarity. The code looks good, but I think compute_near_duplicates is a better home for it. Every compute_near_duplicates is a compute_similarity, but not the other way around, so code that handles the special cases that arise for compute_near_duplicates should be there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this? The functionality is primarily exposed through compute_near_duplicates but in order to reuse the code for distance computation there is a need to have some of the hashing code in compute_similarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both functionality and implementation should be in compute_near_duplicates w/ the current implementation that you have i.e. within compute_near_duplicates:

  1. verify arguments
  2. compute hashes
  3. call compute_similarity as you would usually, but with modified arguments
similarity_index = fb.compute_similarity(
            samples,
            backend="sklearn",
            roi_field=roi_field,
            embeddings=hashes # even better, do `embeddings = hashes` earlier 
            model=model, # should be None because of argument checking
            model_kwargs=model_kwargs, # should be None because of argument checking
            force_square=force_square,
            alpha=alpha,
            batch_size=batch_size,
            num_workers=num_workers,
            skip_failures=skip_failures,
            progress=progress,
            metric='cosine' if hash_method is None else 'manhattan' # sklearn similarity already accepts this argument
        )
  1. Run duplicates on the index and return it to the user as currently happens

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue it makes more sense to keep the current layout for the following reasons.

  1. It matches exactly how compute_near_duplicates already works, with compute_similarity doing both the embedding computation and the index creation.
  2. If we break the above pattern then if a user wanted an index on hash's instead of embeddings they have to first call compute_near_duplicates or dig in the code to figure out how to compute hash's and make a call to compute_similarity elsewhere. Seems like to much work from a UX perspective

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this discussion is getting to how we want to manage the brain from a design philosophy standpoint. This is probably something we should get together and discuss properly, because I’m sure it will come up a lot more in the future. What do you think?

Regarding the current point at hand:

First I’d like to say that my main concern is this code not being in similarity, rather than it being in compute_near_duplicates. If you’d prefer to have a new function like compute_perceptual_hash_duplicates or something along those lines, that’s also a good solution.

The case for moving to compute_near_duplicates:

  1. Most importantly, similarity needs to stay clean. Similarity is the contact point of all other methods to metric methods in FO. Deduplication with hashes is a niche task, so its code shouldn’t go into the well-encapsulated, very widely-used code of similarity.
  2. Since compute_near_duplicates (and by extension, hash deduplication) is effectively a macro for a specific use-case of similarity, it shouldn’t change the core similarity code. From hash deduplication’s standpoint, compute similarity can be a black box and it will still work fine.
  3. compute_similarity already has all of the functionality and arguments needed to be used for hash deduplication. Why add more code when it’s not needed.

To address your points:

  1. The core functionality of compute_similarity is to create a similarity index for the user. If you could get hash computation to fit into existing embedding computation code (by abstracting the hash as a “model” of sorts), then hash deduplication would transparently fit into compute_similarity. Since the implementation of the “embedding computation” is so different, it belongs elsewhere. On top of all of this, the embedding computation is a convenience (a good one because computing embeddings is very very widely needed), but arguably not the core functionality. This is why compute_similarity allows the user to pass embeddings as a numpy array or by field (in fact, it is better to precompute embeddings and pass them to similarity from a performance standpoint, because similarity can’t make strong assumptions on the user’s setup and as such is not super optimal).
  2. This is why I asked above if you foresee users needing a similarity index with hashes. What is the use case that this serves? If this is in fact a very niche use case, it’s not worth changing very widely used, core code for.

Copy link
Contributor

@prernadh prernadh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments about function names - otherwise looks good.

Comment on lines +178 to +181
if method is None or method in FILE_HASH_TYPES:
filehash = fou.compute_filehash(filepath, method=method)
else:
filehash = fbh.compute_image_hash(filepath, method=method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit confusing to add image hash computation under the _compute_filehash function. I would say its better to rename this function for clarity.

Comment on lines +191 to +194
if method is None or method in FILE_HASH_TYPES:
filehash = fou.compute_filehash(filepath, method=method)
else:
filehash = fbh.compute_image_hash(filepath, method=method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to earlier - this function name is no longer accurate if we are not just calculating file hashses anymore here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Work on a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants