Skip to content

Validity index computation overflows in all-points-core-distance calculation #665

@bapierro

Description

@bapierro

Hi, I encountered numerical overflow issues with the current implementation of the all_points_core_distance function, specifically when raising 1 / d_ij to a large power d. For large d values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a comment already mentioned this problem.

I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:

def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
     #"""
    distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
        distance_matrix != 0]) ** d
    result = distance_matrix.sum(axis=1)
    result /= distance_matrix.shape[0] - 1

    if result.sum() == 0:
        result = np.zeros(len(distance_matrix))
    else:
        result **= (-1.0 / d)

    return result

Below is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:

import numpy as np
from scipy.special import logsumexp

def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
    """
    N = distance_matrix.shape[0]
    dists = distance_matrix.copy()
    dists[dists == 0] = np.nan
    s_ij = -d * np.log(dists)
    np.fill_diagonal(s_ij, -np.inf)

    log_S_i = logsumexp(s_ij, axis=1)
    log_m_i = log_S_i - np.log(N - 1)
    log_apcd_i = - (1.0 / d) * log_m_i
    apcd_i = np.exp(log_apcd_i)

    apcd_i[np.isinf(apcd_i)] = 0
    apcd_i[np.isnan(apcd_i)] = 0
    return apcd_i

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bapierro

        Issue actions

          Validity index computation overflows in all-points-core-distance calculation · Issue #665 · scikit-learn-contrib/hdbscan