Skip to content

[DRAFT] Perceptual Hashing Deduplication #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion fiftyone/brain/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

See https://github.com/voxel51/fiftyone for more information.

| Copyright 2017-2024, Voxel51, Inc.
| Copyright 2017-2025, Voxel51, Inc.
| `voxel51.com <https://voxel51.com/>`_
|
"""
Expand Down Expand Up @@ -544,6 +544,7 @@ def compute_similarity(
brain_key=None,
model=None,
model_kwargs=None,
hash_method=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about removing these arguments from similarity and having them just in compute_near_duplicates? Do you foresee users wanting to use other similarity functionality besides duplicates on the hashes?

compute_near_duplicates can build the appropriate similarity index on the fly based on the arguments given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that this is effectively what you are doing in compute_similarity. The code looks good, but I think compute_near_duplicates is a better home for it. Every compute_near_duplicates is a compute_similarity, but not the other way around, so code that handles the special cases that arise for compute_near_duplicates should be there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this? The functionality is primarily exposed through compute_near_duplicates but in order to reuse the code for distance computation there is a need to have some of the hashing code in compute_similarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both functionality and implementation should be in compute_near_duplicates w/ the current implementation that you have i.e. within compute_near_duplicates:

  1. verify arguments
  2. compute hashes
  3. call compute_similarity as you would usually, but with modified arguments
similarity_index = fb.compute_similarity(
            samples,
            backend="sklearn",
            roi_field=roi_field,
            embeddings=hashes # even better, do `embeddings = hashes` earlier 
            model=model, # should be None because of argument checking
            model_kwargs=model_kwargs, # should be None because of argument checking
            force_square=force_square,
            alpha=alpha,
            batch_size=batch_size,
            num_workers=num_workers,
            skip_failures=skip_failures,
            progress=progress,
            metric='cosine' if hash_method is None else 'manhattan' # sklearn similarity already accepts this argument
        )
  1. Run duplicates on the index and return it to the user as currently happens

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue it makes more sense to keep the current layout for the following reasons.

  1. It matches exactly how compute_near_duplicates already works, with compute_similarity doing both the embedding computation and the index creation.
  2. If we break the above pattern then if a user wanted an index on hash's instead of embeddings they have to first call compute_near_duplicates or dig in the code to figure out how to compute hash's and make a call to compute_similarity elsewhere. Seems like to much work from a UX perspective

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this discussion is getting to how we want to manage the brain from a design philosophy standpoint. This is probably something we should get together and discuss properly, because I’m sure it will come up a lot more in the future. What do you think?

Regarding the current point at hand:

First I’d like to say that my main concern is this code not being in similarity, rather than it being in compute_near_duplicates. If you’d prefer to have a new function like compute_perceptual_hash_duplicates or something along those lines, that’s also a good solution.

The case for moving to compute_near_duplicates:

  1. Most importantly, similarity needs to stay clean. Similarity is the contact point of all other methods to metric methods in FO. Deduplication with hashes is a niche task, so its code shouldn’t go into the well-encapsulated, very widely-used code of similarity.
  2. Since compute_near_duplicates (and by extension, hash deduplication) is effectively a macro for a specific use-case of similarity, it shouldn’t change the core similarity code. From hash deduplication’s standpoint, compute similarity can be a black box and it will still work fine.
  3. compute_similarity already has all of the functionality and arguments needed to be used for hash deduplication. Why add more code when it’s not needed.

To address your points:

  1. The core functionality of compute_similarity is to create a similarity index for the user. If you could get hash computation to fit into existing embedding computation code (by abstracting the hash as a “model” of sorts), then hash deduplication would transparently fit into compute_similarity. Since the implementation of the “embedding computation” is so different, it belongs elsewhere. On top of all of this, the embedding computation is a convenience (a good one because computing embeddings is very very widely needed), but arguably not the core functionality. This is why compute_similarity allows the user to pass embeddings as a numpy array or by field (in fact, it is better to precompute embeddings and pass them to similarity from a performance standpoint, because similarity can’t make strong assumptions on the user’s setup and as such is not super optimal).
  2. This is why I asked above if you foresee users needing a similarity index with hashes. What is the use case that this serves? If this is in fact a very niche use case, it’s not worth changing very widely used, core code for.

force_square=False,
alpha=None,
batch_size=None,
Expand Down Expand Up @@ -631,6 +632,8 @@ def compute_similarity(
must expose embeddings (``model.has_embeddings = True``)
model_kwargs (None): a dictionary of optional keyword arguments to pass
to the model's ``Config`` when a model name is provided
hash_method (None): the perceptual hashing method to use in place of
embeddings. The supported values are ``["dhash", "phash", "ahash"]``
force_square (False): whether to minimally manipulate the patch
bounding boxes into squares prior to extraction. Only applicable
when a ``model`` and ``patches_field``/``roi_field`` are specified
Expand Down Expand Up @@ -672,6 +675,7 @@ def compute_similarity(
brain_key,
model,
model_kwargs,
hash_method,
force_square,
alpha,
batch_size,
Expand All @@ -691,6 +695,7 @@ def compute_near_duplicates(
similarity_index=None,
model=None,
model_kwargs=None,
hash_method=None,
force_square=False,
alpha=None,
batch_size=None,
Expand Down Expand Up @@ -745,6 +750,8 @@ def compute_near_duplicates(
(``model.has_embeddings = True``)
model_kwargs (None): a dictionary of optional keyword arguments to pass
to the model's ``Config`` when a model name is provided
hash_method (None): the perceptual hashing method to use in place of
embeddings. The supported values are ``["dhash", "phash", "ahash"]``
force_square (False): whether to minimally manipulate the patch
bounding boxes into squares prior to extraction. Only applicable
when a ``model`` and ``roi_field`` are specified
Expand Down Expand Up @@ -779,6 +786,7 @@ def compute_near_duplicates(
similarity_index=similarity_index,
model=model,
model_kwargs=model_kwargs,
hash_method=hash_method,
force_square=force_square,
alpha=alpha,
batch_size=batch_size,
Expand Down
19 changes: 17 additions & 2 deletions fiftyone/brain/internal/core/duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,16 @@
import fiftyone.brain as fb
import fiftyone.brain.similarity as fbs
import fiftyone.brain.internal.core.utils as fbu
import fiftyone.brain.internal.core.perceptual_hash as fbh

from sklearn.metrics.pairwise import pairwise_distances
import numpy as np


logger = logging.getLogger(__name__)

_DEFAULT_MODEL = "resnet18-imagenet-torch"
FILE_HASH_TYPES = ["md5", "sha1", "sha256", "sha512"]


def compute_near_duplicates(
Expand All @@ -34,6 +39,7 @@ def compute_near_duplicates(
similarity_index=None,
model=None,
model_kwargs=None,
hash_method=None,
force_square=False,
alpha=None,
batch_size=None,
Expand Down Expand Up @@ -62,6 +68,7 @@ def compute_near_duplicates(
model is None
and embeddings is None
and similarity_index is None
and hash_method is None
and not embeddings_exist
):
model = _DEFAULT_MODEL
Expand All @@ -74,6 +81,7 @@ def compute_near_duplicates(
embeddings=embeddings_field or embeddings,
model=model,
model_kwargs=model_kwargs,
hash_method=hash_method,
force_square=force_square,
alpha=alpha,
batch_size=batch_size,
Expand Down Expand Up @@ -139,6 +147,7 @@ def compute_exact_duplicates(samples, num_workers, skip_failures, progress):

def _compute_filehashes(samples, method, progress):
ids, filepaths = samples.values(["id", "filepath"])
# I need embeddings, sample_ids, label_ids

with fou.ProgressBar(total=len(ids), progress=progress) as pb:
return {
Expand Down Expand Up @@ -166,7 +175,10 @@ def _compute_filehashes_multi(samples, method, num_workers, progress):

def _compute_filehash(filepath, method):
try:
filehash = fou.compute_filehash(filepath, method=method)
if method is None or method in FILE_HASH_TYPES:
filehash = fou.compute_filehash(filepath, method=method)
else:
filehash = fbh.compute_image_hash(filepath, method=method)
Comment on lines +178 to +181
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit confusing to add image hash computation under the _compute_filehash function. I would say its better to rename this function for clarity.

except:
filehash = None

Expand All @@ -176,7 +188,10 @@ def _compute_filehash(filepath, method):
def _do_compute_filehash(args):
_id, filepath, method = args
try:
filehash = fou.compute_filehash(filepath, method=method)
if method is None or method in FILE_HASH_TYPES:
filehash = fou.compute_filehash(filepath, method=method)
else:
filehash = fbh.compute_image_hash(filepath, method=method)
Comment on lines +191 to +194
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to earlier - this function name is no longer accurate if we are not just calculating file hashses anymore here.

except:
filehash = None

Expand Down
122 changes: 122 additions & 0 deletions fiftyone/brain/internal/core/perceptual_hash.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
"""
Image hashing methods.

| Copyright 2017-2024, Voxel51, Inc.
| `voxel51.com <https://voxel51.com/>`_
|
"""

import numpy as np
import eta.core.image as etai
import scipy


def compute_image_hash(image_path, method="phash", hash_size=8):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to this, not necessarily important to change here but wanted to note: we use this pattern (consolidate many implementations into 1 core function) a lot. It's not great obviously because you have to manually add each new option to the if-else. We should think about moving to registries over time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I typically prefer registries when the set of implementations is un-bounded (new model architectures for example). I believe the set of hash functions for images is relatively bounded so I wasn't too worried.

"""
Computes a hash of the input image.

Args:
image_path: Input image path.
method: The hashing method to use. Supported values are
"ahash", "phash", and "dhash".
hash_size: Size of the hash (default is 8x8).

Returns:
A 1D NumPy array representing the hash.
"""
image = etai.read(image_path)
if method == "ahash":
return ahash(image, hash_size=hash_size)
elif method == "phash":
return phash(image, hash_size=hash_size)
elif method == "dhash":
return dhash(image, hash_size=hash_size)
else:
raise ValueError("Unsupported hashing method '%s'" % method)


def ahash(image, hash_size=8):
"""
Computes the average hash (aHash) of an image.

Args:
image: Input image as a NumPy array.
hash_size: Size of the hash (default is 8x8).

Returns:
A 1D NumPy array representing the hash.
"""
# Step 1: Convert to grayscale
gray = etai.rgb_to_gray(image)

# Step 2: Resize to hash_size x hash_size
resized = etai.resize(gray, hash_size, hash_size)

# Step 3: Compute the mean pixel value
mean = resized.mean()

# Step 4: Create the binary hash
binary_hash = (resized >= mean).astype(np.uint8)

# Step 5: Flatten the hash to 1D
flat_hash = binary_hash.flatten()

return flat_hash


def phash(image, hash_size=8):
"""
Computes the perceptual hash (pHash) of an image.

Args:
image: Input image as a NumPy array.
hash_size: Size of the hash (default is 8x8).

Returns:
A 1D NumPy array representing the hash.
"""
# Step 1: Convert to grayscale
gray = etai.rgb_to_gray(image)

# Step 2: Resize to hash_size x hash_size
resized = etai.resize(gray, hash_size, hash_size)

# Step 3: Compute the Discrete Cosine Transform (DCT)
dct = scipy.fft.dct(resized, norm="ortho")

# Step 4: Extract the top-left hash_size x hash_size values
dct = dct[:hash_size, :hash_size]

# Step 5: Compute the median of the top-left values
median = np.median(dct)

# Step 6: Create the binary hash
binary_hash = (dct >= median).astype(np.uint8)

# Step 7: Flatten the hash to 1D
flat_hash = binary_hash.flatten()

return flat_hash


def dhash(image, hash_size=8):
"""
Compute the dHash for the input image.

:param image: Input image to hash (as a NumPy array).
:param hash_size: Size of the hash (default 8x8).
:return: The dHash value of the image as a 64-bit integer.
"""
# Convert the image to grayscale
gray = etai.rgb_to_gray(image)

# Resize the image to (hash_size + 1, hash_size)
resized = etai.resize(gray, hash_size + 1, hash_size)

# Compute the differences between adjacent pixels
diff = resized[:, 1:] > resized[:, :-1]

# Convert the difference image to a binary array
binary_array = diff.flatten().astype(int)

return binary_array
70 changes: 52 additions & 18 deletions fiftyone/brain/similarity.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,12 @@
import fiftyone.core.fields as fof
import fiftyone.core.labels as fol
import fiftyone.core.patches as fop
import fiftyone.core.media as fomm
import fiftyone.core.stages as fos
import fiftyone.core.utils as fou
import fiftyone.core.validation as fov
import fiftyone.zoo as foz
import fiftyone.brain.internal.core.duplicates as fbd
from fiftyone import ViewField as F

fbu = fou.lazy_import("fiftyone.brain.internal.core.utils")
Expand All @@ -51,6 +53,7 @@ def compute_similarity(
brain_key,
model,
model_kwargs,
hash_method,
force_square,
alpha,
batch_size,
Expand All @@ -69,6 +72,17 @@ def compute_similarity(
samples, roi_field, _ALLOWED_ROI_FIELD_TYPES
)

if hash_method is not None and backend != "sklearn":
raise ValueError(
"The `hash_method` parameter is only supported by the 'sklearn' "
"backend"
)

if hash_method is not None and samples.media_type != fomm.IMAGE:
raise ValueError(
"The `hash_method` parameter is only supported for image datasets"
)

# Allow for `embeddings_field=XXX` and `embeddings=False` together
embeddings_field = kwargs.pop("embeddings_field", None)
if embeddings_field is not None or etau.is_str(embeddings):
Expand All @@ -86,9 +100,10 @@ def compute_similarity(
embeddings_exist = None

if model is None and embeddings is None and not embeddings_exist:
model = _DEFAULT_MODEL
if batch_size is None:
batch_size = _DEFAULT_BATCH_SIZE
if hash_method is None:
model = _DEFAULT_MODEL
if batch_size is None:
batch_size = _DEFAULT_BATCH_SIZE

if etau.is_str(model):
_model_kwargs = model_kwargs or {}
Expand All @@ -101,13 +116,18 @@ def compute_similarity(
_model = model
supports_prompts = None

metric = "cosine"
if hash_method is not None:
metric = "manhattan"

config = _parse_config(
backend,
embeddings_field=embeddings_field,
patches_field=patches_field,
roi_field=roi_field,
model=model,
model_kwargs=model_kwargs,
metric=metric,
supports_prompts=supports_prompts,
**kwargs,
)
Expand Down Expand Up @@ -139,21 +159,35 @@ def compute_similarity(
handle_missing = "skip"
agg_fcn = None

embeddings, sample_ids, label_ids = fbu.get_embeddings(
samples,
model=_model,
patches_field=patches_field or roi_field,
embeddings=embeddings,
embeddings_field=embeddings_field,
force_square=force_square,
alpha=alpha,
handle_missing=handle_missing,
agg_fcn=agg_fcn,
batch_size=batch_size,
num_workers=num_workers,
skip_failures=skip_failures,
progress=progress,
)
if model is not None:
embeddings, sample_ids, label_ids = fbu.get_embeddings(
samples,
model=_model,
patches_field=patches_field or roi_field,
embeddings=embeddings,
embeddings_field=embeddings_field,
force_square=force_square,
alpha=alpha,
handle_missing=handle_missing,
agg_fcn=agg_fcn,
batch_size=batch_size,
num_workers=num_workers,
skip_failures=skip_failures,
progress=progress,
)
else:
assert hash_method is not None
hashes = fbd._compute_filehashes(samples, hash_method, progress)
sample_ids, label_ids = fbu.get_ids(
samples,
patches_field=patches_field or roi_field,
data=hashes,
data_type="hash",
handle_missing=handle_missing,
ref_sample_ids=None,
)
embeddings = np.asarray(list(hashes.values())).astype(np.float64)

else:
embeddings = None

Expand Down