|
2 | 2 |
|
3 | 3 |
|
4 | 4 |
|
| 5 | +(sec:clustbench-usage)= |
| 6 | +# Using *clustbench* |
5 | 7 |
|
| 8 | +The Python version of the *clustering-benchmarks* package |
| 9 | +can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/), |
| 10 | +e.g., via a call to: |
6 | 11 |
|
| 12 | +``` |
| 13 | +pip3 install clustering-benchmarks |
| 14 | +``` |
7 | 15 |
|
| 16 | +from the command line. Alternatively, please use your favourite Python |
| 17 | +package manager. |
8 | 18 |
|
9 | 19 |
|
| 20 | +Once installed, we can import it by calling: |
10 | 21 |
|
11 | 22 |
|
| 23 | +``` python |
| 24 | +import clustbench |
| 25 | +``` |
12 | 26 |
|
| 27 | +Below we discuss its basic features/usage. |
13 | 28 |
|
14 | 29 |
|
| 30 | +::::{note} |
| 31 | +*To learn more about Python, check out Marek's open-access (free!) textbook* |
| 32 | +[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/) |
| 33 | +{cite}`datawranglingpy`. |
| 34 | +:::: |
15 | 35 |
|
16 | 36 |
|
| 37 | +## Fetching Benchmark Data |
17 | 38 |
|
| 39 | +The datasets from the {ref}`sec:suite-v1` can be accessed easily. |
| 40 | +It is best to [download](https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0) |
| 41 | +the whole repository onto our disk first. |
| 42 | +Let us assume they are available in the following directory: |
18 | 43 |
|
19 | 44 |
|
| 45 | +``` python |
| 46 | +# load from a local library (download the suite manually) |
| 47 | +import os.path |
| 48 | +data_path = os.path.join("~", "Projects", "clustering-data-v1") |
| 49 | +``` |
20 | 50 |
|
| 51 | +Here is the list of the currently available benchmark batteries |
| 52 | +(dataset collections): |
21 | 53 |
|
22 | 54 |
|
| 55 | +``` python |
| 56 | +print(clustbench.get_battery_names(path=data_path)) |
| 57 | +## ['fcps', 'g2mg', 'graves', 'h2mg', 'mnist', 'other', 'sipu', 'uci', 'wut'] |
| 58 | +``` |
23 | 59 |
|
| 60 | +We can list the datasets in an example battery by calling: |
24 | 61 |
|
25 | 62 |
|
| 63 | +``` python |
| 64 | +battery = "wut" |
| 65 | +print(clustbench.get_dataset_names(battery, path=data_path)) |
| 66 | +## ['circles', 'cross', 'graph', 'isolation', 'labirynth', 'mk1', 'mk2', 'mk3', 'mk4', 'olympic', 'smile', 'stripes', 'trajectories', 'trapped_lovers', 'twosplashes', 'windows', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3'] |
| 67 | +``` |
26 | 68 |
|
| 69 | +For instance, let us load the `wut/x2` dataset: |
27 | 70 |
|
28 | 71 |
|
29 | 72 |
|
| 73 | +``` python |
| 74 | +dataset = "x2" |
| 75 | +b = clustbench.load_dataset(battery, dataset, path=data_path) |
| 76 | +``` |
30 | 77 |
|
| 78 | +The above call returned a named tuple. For instance, the corresponding README |
| 79 | +file can be inspected by accessing the `description` field: |
31 | 80 |
|
32 | 81 |
|
| 82 | +``` python |
| 83 | +print(b.description) |
| 84 | +## Author: Eliza Kaczorek (Warsaw University of Technology) |
| 85 | +## |
| 86 | +## `labels0` come from the Author herself. |
| 87 | +## `labels1` were generated by Marek Gagolewski. |
| 88 | +## `0` denotes the noise class (if present). |
| 89 | +``` |
33 | 90 |
|
| 91 | +Moreover, the `data` field gives the data matrix, `labels` is the list |
| 92 | +of all ground truth partitions (encoded as label vectors), |
| 93 | +and `n_clusters` gives the corresponding numbers of subsets. |
| 94 | +In case of any doubt, we can always consult the official documentation |
| 95 | +of the {any}`clustbench.load_dataset` function. |
34 | 96 |
|
| 97 | +::::{note} |
| 98 | +Particular datasets can be retrieved from an online repository directly |
| 99 | +(no need to download the whole battery first) by calling: |
| 100 | + |
| 101 | + |
| 102 | +``` python |
| 103 | +data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" |
| 104 | +b = clustbench.load_dataset("wut", "x2", url=data_url) |
| 105 | +``` |
| 106 | +:::: |
| 107 | + |
| 108 | +For instance, here is the shape (*n* and *d*) of the data matrix, |
| 109 | +the number of reference partitions, and their cardinalities *k*, |
| 110 | +respectively: |
| 111 | + |
| 112 | + |
| 113 | +``` python |
| 114 | +print(b.data.shape, len(b.labels), b.n_clusters) |
| 115 | +## (120, 2) 2 [3 4] |
| 116 | +``` |
| 117 | + |
| 118 | +The following figure (generated via a call to |
| 119 | +[`genieclust`](https://genieclust.gagolewski.com/)`.plots.plot_scatter`) |
| 120 | +illustrates the benchmark dataset at hand. |
| 121 | + |
| 122 | + |
| 123 | +``` python |
| 124 | +import genieclust |
| 125 | +for i in range(len(b.labels)): |
| 126 | + plt.subplot(1, len(b.labels), i+1) |
| 127 | + genieclust.plots.plot_scatter( |
| 128 | + b.data, labels=b.labels[i]-1, axis="equal", title=f"labels{i}" |
| 129 | + ) |
| 130 | +plt.show() |
| 131 | +``` |
| 132 | + |
| 133 | +(fig:using-clustbench-example1)= |
| 134 | +```{figure} clustbench-usage-figures/using-clustbench-example1-1.* |
| 135 | +An example benchmark dataset and the corresponding ground truth labels. |
| 136 | +``` |
| 137 | + |
| 138 | + |
| 139 | +## Fetching Precomputed Results |
| 140 | + |
| 141 | +Let us study one of the sets of |
| 142 | +[precomputed clustering results](https://github.com/gagolews/clustering-results-v1) |
| 143 | +stored in the following directory: |
| 144 | + |
| 145 | + |
| 146 | + |
| 147 | +``` python |
| 148 | +results_path = os.path.join("~", "Projects", "clustering-results-v1", "original") |
| 149 | +``` |
| 150 | + |
| 151 | +They can be fetched by calling: |
| 152 | + |
| 153 | + |
| 154 | +``` python |
| 155 | +method_group = "Genie" # or "*" for everything |
| 156 | +res = clustbench.load_results( |
| 157 | + method_group, b.battery, b.dataset, b.n_clusters, path=results_path |
| 158 | +) |
| 159 | +print(list(res.keys())) |
| 160 | +## ['Genie_G0.1', 'Genie_G0.3', 'Genie_G0.5', 'Genie_G0.7', 'Genie_G1.0'] |
| 161 | +``` |
| 162 | + |
| 163 | +We thus have got access to precomputed data |
| 164 | +generated by the [*Genie*](https://genieclust.gagolewski.com) |
| 165 | +algorithm with different `gini_threshold` parameter settings. |
| 166 | + |
| 167 | + |
| 168 | + |
| 169 | +## Computing External Cluster Validity Measures |
| 170 | + |
| 171 | + |
| 172 | +Different |
| 173 | +{ref}`external cluster validity measures <sec:external-validity-measures>` |
| 174 | +can be computed by calling {any}`clustbench.get_score`: |
| 175 | + |
| 176 | + |
| 177 | + |
| 178 | +``` python |
| 179 | +pd.Series({ # for aesthetics |
| 180 | + method: clustbench.get_score(b.labels, res[method]) |
| 181 | + for method in res.keys() |
| 182 | +}) |
| 183 | +## Genie_G0.1 0.870000 |
| 184 | +## Genie_G0.3 0.870000 |
| 185 | +## Genie_G0.5 0.590909 |
| 186 | +## Genie_G0.7 0.666667 |
| 187 | +## Genie_G1.0 0.010000 |
| 188 | +## dtype: float64 |
| 189 | +``` |
| 190 | + |
| 191 | +By default, normalised clustering accuracy is applied. |
| 192 | +As explained in the tutorial, we compare the predicted clusterings against |
| 193 | +{ref}`all <sec:many-partitions>` the reference partitions |
| 194 | +({ref}`ignoring noise points <sec:noise-points>`) |
| 195 | +and report the maximal score. |
| 196 | + |
| 197 | +Let us depict the results for the `"Genie_G0.3"` method: |
| 198 | + |
| 199 | + |
| 200 | +``` python |
| 201 | +method = "Genie_G0.3" |
| 202 | +for i, k in enumerate(res[method].keys()): |
| 203 | + plt.subplot(1, len(res[method]), i+1) |
| 204 | + genieclust.plots.plot_scatter( |
| 205 | + b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}" |
| 206 | + ) |
| 207 | +plt.show() |
| 208 | +``` |
| 209 | + |
| 210 | +(fig:using-clustbench-example2)= |
| 211 | +```{figure} clustbench-usage-figures/using-clustbench-example2-3.* |
| 212 | +Results generated by Genie. |
| 213 | +``` |
| 214 | + |
| 215 | + |
| 216 | +## Applying Clustering Methods Manually |
| 217 | + |
| 218 | +Naturally, the aim of this benchmark framework is also to test new methods. |
| 219 | +We can use {any}`clustbench.fit_predict_many` to generate |
| 220 | +all the partitions required to compare ourselves against the reference labels. |
| 221 | + |
| 222 | +For instance, let us investigate the behaviour of the k-means algorithm: |
| 223 | + |
| 224 | + |
| 225 | +``` python |
| 226 | +import sklearn.cluster |
| 227 | +m = sklearn.cluster.KMeans(n_init=10) |
| 228 | +res["KMeans"] = clustbench.fit_predict_many(m, b.data, b.n_clusters) |
| 229 | +clustbench.get_score(b.labels, res["KMeans"]) |
| 230 | +## np.float64(0.9848484848484849) |
| 231 | +``` |
| 232 | + |
| 233 | +We see that k-means (which specialises in detecting symmetric Gaussian-like blobs) |
| 234 | +performs better than *Genie* on this particular dataset. |
| 235 | + |
| 236 | + |
| 237 | +``` python |
| 238 | +method = "KMeans" |
| 239 | +for i, k in enumerate(res[method].keys()): |
| 240 | + plt.subplot(1, len(res[method]), i+1) |
| 241 | + genieclust.plots.plot_scatter( |
| 242 | + b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}" |
| 243 | + ) |
| 244 | +plt.show() |
| 245 | +``` |
| 246 | + |
| 247 | +(fig:using-clustbench-example3)= |
| 248 | +```{figure} clustbench-usage-figures/using-clustbench-example3-5.* |
| 249 | +Results generated by K-Means. |
| 250 | +``` |
| 251 | + |
| 252 | +For more functions, please refer to the package's documentation (in the next section). |
| 253 | +Moreover, {ref}`sec:colouriser` describes a standalone application |
| 254 | +that can be used to prepare our own two-dimensional datasets. |
| 255 | + |
| 256 | +Note that you do not have to use the *clustering-benchmark* package |
| 257 | +to access the benchmark datasets from our repository. |
| 258 | +The {ref}`sec:how-to-access` section mentions that most operations |
| 259 | +involve simple operations on files and directories which you can |
| 260 | +implement manually. The package was developed merely for the users' |
| 261 | +convenience. |
0 commit comments