Skip to content

Latest commit

 

History

History
43 lines (31 loc) · 3.81 KB

File metadata and controls

43 lines (31 loc) · 3.81 KB

Clustering high-dimensional data such as the embeddings from language models poses unique challenges. Unlike K-Means, where you need to specify the number of clusters beforehand, this repository offers an advanced solution using the DBSCAN algorithm—a more adaptable and insightful method, especially useful for large and complex datasets.

What Does the Script Do?

This collection of scripts:

  1. Loads the CSV file containing high-dimensional vectors.
  2. Utilizes the DBSCAN algorithm to cluster entities based on their similarity without the need for predefining a specific number of clusters.
  3. Identifies and separates outliers or noise from the main clusters to prevent the merging of dissimilar entities.
  4. Exports the results with assigned cluster IDs for each entity to an Excel file for convenient review and analysis.

Main Features

  • Automated Cluster Detection: Finds the natural number of clusters in the data without any preset conditions.
  • Customization: Adjustable similarity thresholds and minimum sample sizes to fine-tune the clustering process.
  • High-Dimensional Data Handling: Developed to work with embeddings from models such as OpenAI's text-ada-003-large.
  • Noise and Outlier Management: Isolates less similar vectors effectively, maintaining cleaner and more meaningful clusters.
  • Silhouette Score Assessment: Provides an option to measure the clustering quality using the silhouette score, where feasible.

How to Use the Main Script

  1. Install the necessary Python packages pandas, numpy, scikit-learn, and openpyxl.
  2. Set your CSV file path to input_csv_path, where your embeddings are stored.
  3. Run the script. It will automatically perform clustering, identify the noise, and save the results.

CleanShot 2024-02-04 at 16 32 44@2x

sweet_spot_finder.py

The sweet_spot_finder.py script assists in finding the optimal DBSCAN parameters by testing different combinations of similarity thresholds and minimum samples. It runs multiple iterations of the clustering process in parallel and reports the number of clusters formed for each configuration. This helps in identifying the "sweet spot," where the clustering logic best aligns with the natural structure of the data.

How to Use sweet_spot_finder.py

  1. Set the input CSV file path by changing input_csv_path in the script.
  2. Review and adjust the ranges for similarity_thresholds and min_samples_values to fit your dataset and clustering goals.
  3. Execute the script. The output will display different configurations and their corresponding number of clusters, aiding you in selecting the best parameters for DBSCAN.
CleanShot.2024-02-04.at.16.37.30.mp4

classify_hdbscan.py

The added HDBSCAN & PCA functionality helps in:

  • Reducing Computation Time: By trimming down the dimensions, PCA speeds up the clustering process while still retaining the essential characteristics of the data.
  • Enhancing Clustering Performance: With fewer dimensions, clustering algorithms like HDBSCAN can perform more efficiently and potentially yield more meaningful clustering results.
  • Facilitating Data Visualization: Lower-dimensional data can be plotted and visualized, aiding in a more intuitive understanding and analysis of the clusters formed.

To take advantage of PCA in your clustering workflow, simply adjust the number of desired components in N_COMPONENTS constant and follow the standard script execution process. This addition empowers you with more control over your clustering journey, ensuring that your exploration of high-dimensional data spaces is both manageable and insightful.