-
Notifications
You must be signed in to change notification settings - Fork 517
Open
Description
Hi, I encountered numerical overflow issues with the current implementation of the all_points_core_distance
function, specifically when raising 1 / d_ij
to a large power d
. For large d
values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a comment already mentioned this problem.
I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:
def all_points_core_distance(distance_matrix, d=2.0):
"""
Compute the all-points-core-distance for all the points of a cluster.
Parameters
----------
distance_matrix : array (cluster_size, cluster_size)
The pairwise distance matrix between points in the cluster.
d : integer
The dimension of the data set, which is used in the computation
of the all-point-core-distance as per the paper.
Returns
-------
core_distances : array (cluster_size,)
The all-points-core-distance of each point in the cluster
References
----------
Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
#"""
distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
distance_matrix != 0]) ** d
result = distance_matrix.sum(axis=1)
result /= distance_matrix.shape[0] - 1
if result.sum() == 0:
result = np.zeros(len(distance_matrix))
else:
result **= (-1.0 / d)
return result
Below is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:
import numpy as np
from scipy.special import logsumexp
def all_points_core_distance(distance_matrix, d=2.0):
"""
Compute the all-points-core-distance for all the points of a cluster.
Parameters
----------
distance_matrix : array (cluster_size, cluster_size)
The pairwise distance matrix between points in the cluster.
d : integer
The dimension of the data set, which is used in the computation
of the all-point-core-distance as per the paper.
Returns
-------
core_distances : array (cluster_size,)
The all-points-core-distance of each point in the cluster
References
----------
Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
"""
N = distance_matrix.shape[0]
dists = distance_matrix.copy()
dists[dists == 0] = np.nan
s_ij = -d * np.log(dists)
np.fill_diagonal(s_ij, -np.inf)
log_S_i = logsumexp(s_ij, axis=1)
log_m_i = log_S_i - np.log(N - 1)
log_apcd_i = - (1.0 / d) * log_m_i
apcd_i = np.exp(log_apcd_i)
apcd_i[np.isinf(apcd_i)] = 0
apcd_i[np.isnan(apcd_i)] = 0
return apcd_i
Activity