-
Notifications
You must be signed in to change notification settings - Fork 11
[DRAFT] Perceptual Hashing Deduplication #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I added some minor notes and one major note about the option of having this in compute_near_duplicates
. Let me know what you think about it.
From a correctness standpoint it works great, besides whash
that I can't seem to get good results out of.
I made a notebook for comparing hash methods, as well as to compare against the full compute_near_duplicates
with clip embeddings as a baseline.
Results:
- all methods find everything file hash finds
- choice of threshold of course matters a lot
- generally hashes get quite good performance
- from a cursory look - clip embeddings seem to have better P/R tradeoffs, but hashes were able to find some duplicates it was not able to find. Pretty interesting.
dhash
seems to behave the most like clip
EDIT: I haven't played around with thresholds too much, more performance can be squeezed out probably.
import scipy | ||
|
||
|
||
def compute_image_hash(image_path, method="phash", hash_size=8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specific to this, not necessarily important to change here but wanted to note: we use this pattern (consolidate many implementations into 1 core function) a lot. It's not great obviously because you have to manually add each new option to the if-else. We should think about moving to registries over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I typically prefer registries when the set of implementations is un-bounded (new model architectures for example). I believe the set of hash functions for images is relatively bounded so I wasn't too worried.
@@ -92,7 +96,14 @@ def compute_near_duplicates( | |||
return similarity_index | |||
|
|||
|
|||
def compute_exact_duplicates(samples, num_workers, skip_failures, progress): | |||
def compute_exact_duplicates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have this in compute_near_duplicates
? You can compute hashes into a field and pass that as the embedding field. It should give the same behavior. This serves the following:
- clarity - this isn't exact duplicates still
- This requires a threshold parameter. More inline with near duplicates than exact.
- This code uses
sklearn.metrics.pairwise.pairwise_distances
backend, just like the similarity index w/ duplicates mixin in compute near duplicates. - near duplicates returns the underlying similarity index allowing for easier manipulation of the data afterwards compared with the neighbors map.
- Having one point of contact with metric methods is useful long term when we want to work on scaling this code.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on whether this should be in compute_exact_duplicates
vs compute_near_duplicates
for a while. I'd be curious to get @prernadh 's opinion since this was a customer request but in my opinion hash based deduplication is closer to exact deduplication than it is to near deduplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, I would think this is closer to near duplicates rather than exact as we are able to use a threshold to find close samples
for i, _id in enumerate(_ids): | ||
nearby_indices = np.where(thresholded_distances[i, :])[0] | ||
duplicate_ids = [_ids[j] for j in nearby_indices] | ||
neighbors_map[_id] = duplicate_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different from the similarity index neighbors_map that gives keys of "unique" samples and values of duplicates of that sample. Not an issue but can be misleading.
Keep in mind this recounts neighbors both ways. May be worthwhile to only go over the upper triangle of the thresholded_distances matrix
, i.e. nearby_indices = np.where(thresholded_distances[i, i:])[0] + i
.
Thanks for adding this in Markus. I do think this technique might fit in better as a near duplicate calculation. |
@prernadh @jacobsela I have two concerns with putting this functionality into
If you guys believe strongly it should still go into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some notes about moving the additions (i.e. argument checking and computation of hashes) to compute_near_duplicates
rather than having them in similarity
.
Other than that LGTM.
@@ -544,6 +544,7 @@ def compute_similarity( | |||
brain_key=None, | |||
model=None, | |||
model_kwargs=None, | |||
hash_method=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about removing these arguments from similarity
and having them just in compute_near_duplicates
? Do you foresee users wanting to use other similarity functionality besides duplicates on the hashes?
compute_near_duplicates
can build the appropriate similarity index on the fly based on the arguments given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now see that this is effectively what you are doing in compute_similarity
. The code looks good, but I think compute_near_duplicates
is a better home for it. Every compute_near_duplicates
is a compute_similarity
, but not the other way around, so code that handles the special cases that arise for compute_near_duplicates
should be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this? The functionality is primarily exposed through compute_near_duplicates
but in order to reuse the code for distance computation there is a need to have some of the hashing code in compute_similarity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both functionality and implementation should be in compute_near_duplicates
w/ the current implementation that you have i.e. within compute_near_duplicates
:
- verify arguments
- compute hashes
- call
compute_similarity
as you would usually, but with modified arguments
similarity_index = fb.compute_similarity(
samples,
backend="sklearn",
roi_field=roi_field,
embeddings=hashes # even better, do `embeddings = hashes` earlier
model=model, # should be None because of argument checking
model_kwargs=model_kwargs, # should be None because of argument checking
force_square=force_square,
alpha=alpha,
batch_size=batch_size,
num_workers=num_workers,
skip_failures=skip_failures,
progress=progress,
metric='cosine' if hash_method is None else 'manhattan' # sklearn similarity already accepts this argument
)
- Run duplicates on the index and return it to the user as currently happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue it makes more sense to keep the current layout for the following reasons.
- It matches exactly how
compute_near_duplicates
already works, withcompute_similarity
doing both the embedding computation and the index creation. - If we break the above pattern then if a user wanted an index on hash's instead of embeddings they have to first call
compute_near_duplicates
or dig in the code to figure out how to compute hash's and make a call tocompute_similarity
elsewhere. Seems like to much work from a UX perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this discussion is getting to how we want to manage the brain from a design philosophy standpoint. This is probably something we should get together and discuss properly, because I’m sure it will come up a lot more in the future. What do you think?
Regarding the current point at hand:
First I’d like to say that my main concern is this code not being in similarity, rather than it being in compute_near_duplicates
. If you’d prefer to have a new function like compute_perceptual_hash_duplicates
or something along those lines, that’s also a good solution.
The case for moving to compute_near_duplicates
:
- Most importantly, similarity needs to stay clean. Similarity is the contact point of all other methods to metric methods in FO. Deduplication with hashes is a niche task, so its code shouldn’t go into the well-encapsulated, very widely-used code of similarity.
- Since
compute_near_duplicates
(and by extension, hash deduplication) is effectively a macro for a specific use-case of similarity, it shouldn’t change the core similarity code. From hash deduplication’s standpoint, compute similarity can be a black box and it will still work fine. compute_similarity
already has all of the functionality and arguments needed to be used for hash deduplication. Why add more code when it’s not needed.
To address your points:
- The core functionality of
compute_similarity
is to create a similarity index for the user. If you could get hash computation to fit into existing embedding computation code (by abstracting the hash as a “model” of sorts), then hash deduplication would transparently fit intocompute_similarity
. Since the implementation of the “embedding computation” is so different, it belongs elsewhere. On top of all of this, the embedding computation is a convenience (a good one because computing embeddings is very very widely needed), but arguably not the core functionality. This is whycompute_similarity
allows the user to pass embeddings as a numpy array or by field (in fact, it is better to precompute embeddings and pass them to similarity from a performance standpoint, because similarity can’t make strong assumptions on the user’s setup and as such is not super optimal). - This is why I asked above if you foresee users needing a similarity index with hashes. What is the use case that this serves? If this is in fact a very niche use case, it’s not worth changing very widely used, core code for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments about function names - otherwise looks good.
if method is None or method in FILE_HASH_TYPES: | ||
filehash = fou.compute_filehash(filepath, method=method) | ||
else: | ||
filehash = fbh.compute_image_hash(filepath, method=method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit confusing to add image hash computation under the _compute_filehash
function. I would say its better to rename this function for clarity.
if method is None or method in FILE_HASH_TYPES: | ||
filehash = fou.compute_filehash(filepath, method=method) | ||
else: | ||
filehash = fbh.compute_image_hash(filepath, method=method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to earlier - this function name is no longer accurate if we are not just calculating file hashses anymore here.
This PR adds perceptual hashing-based deduplication on top of the existing file hashing and embedding-based deduplication methods. Perceptual hashs are much faster to compute than embedding and more robust than simple file hash's so this serves as a good in-between method to find duplicate images. The following hashs are implemented
The API is implemented as an extension of
fob.compute_exact_duplicates
.