Validity index computation overflows in all-points-core-distance calculation

Hi, I encountered numerical overflow issues with the current implementation of the `all_points_core_distance` function, specifically when raising `1 / d_ij` to a large power `d`. For large `d` values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a [comment](https://github.com/scikit-learn-contrib/hdbscan/issues/197#issuecomment-430569865) already mentioned this problem. 

I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:

```python
def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
     #"""
    distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
        distance_matrix != 0]) ** d
    result = distance_matrix.sum(axis=1)
    result /= distance_matrix.shape[0] - 1

    if result.sum() == 0:
        result = np.zeros(len(distance_matrix))
    else:
        result **= (-1.0 / d)

    return result
```

Below is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:

```python
import numpy as np
from scipy.special import logsumexp

def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
    """
    N = distance_matrix.shape[0]
    dists = distance_matrix.copy()
    dists[dists == 0] = np.nan
    s_ij = -d * np.log(dists)
    np.fill_diagonal(s_ij, -np.inf)

    log_S_i = logsumexp(s_ij, axis=1)
    log_m_i = log_S_i - np.log(N - 1)
    log_apcd_i = - (1.0 / d) * log_m_i
    apcd_i = np.exp(log_apcd_i)

    apcd_i[np.isinf(apcd_i)] = 0
    apcd_i[np.isnan(apcd_i)] = 0
    return apcd_i
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validity index computation overflows in all-points-core-distance calculation #665

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Validity index computation overflows in all-points-core-distance calculation #665

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions