Skip to content

Semi-metric dissimilarities in glcust and differing branch lengths  #81

Open
@mike-kratz

Description

@mike-kratz

Hi there!

I work with ecological data, in particular microbial ecology, and we cannot use Euclidean distance for comparing community dissimilarities (either using cluster analysis, PCoA, or NMDS) since Euclidean dissimilarities perform poorly when datasets have many zeroes, which is almost always the case with microbial sequencing data. We tend to use Bray-Curtis dissimilarity (also known as percentage-similarity) which is semi-metric and does not obey the triangle-inequality theorem. Would genieclust not work for this type of dissimilairty matrix?

Also, when I used genie clust on my environmental data, which is fine to use Euclidean distances for since it does not have double-zeroes, the branch height was very different from the original Euclidean pairwise distances shown in the output matrix. i.e., it showed groups had more Euclidean similarity than the original input matrix, while hierarchical clustering with "average" linkage tended to show the original values more accurately. See below:

Genie clust dendrogram
image

Standard hierarchical clustering with average linkage
image

Snapshot of original Euclidean dissimilarity matrix (notice that most pairwise dissimilarities are greater than 1, but the genie dendrogram shows most the branch lengths are around 1)
image

Thank you for your help,

Mike

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions