|
2 | 2 |
|
3 | 3 |
|
4 | 4 |
|
5 |
| -(sec:clustbench-usage)= |
6 |
| -# Using *clustbench* |
7 | 5 |
|
8 | 6 |
|
9 |
| -The Python version of the *clustering-benchmarks* package |
10 |
| -can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/), |
11 |
| -e.g., via a call to: |
12 | 7 |
|
13 |
| -``` |
14 |
| -pip3 install clustering-benchmarks |
15 |
| -``` |
16 | 8 |
|
17 |
| -from the command line. Alternatively, please use your favourite Python |
18 |
| -package manager. |
19 | 9 |
|
20 | 10 |
|
21 |
| -Once installed, we can import it by calling: |
22 | 11 |
|
23 | 12 |
|
24 | 13 |
|
25 |
| -```python |
26 |
| -import clustbench |
27 |
| -``` |
28 | 14 |
|
29 |
| -Below we discuss its basic features/usage. |
30 | 15 |
|
31 | 16 |
|
32 |
| -::::{note} |
33 |
| -*To learn more about Python, |
34 |
| -check out Marek's open-access (free!) textbook* |
35 |
| -[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/) |
36 |
| -{cite}`datawranglingpy`. |
37 |
| -:::: |
38 | 17 |
|
39 | 18 |
|
40 |
| -## Fetching Benchmark Data |
41 | 19 |
|
42 |
| -The datasets from the {ref}`sec:suite-v1` can be accessed easily. |
43 |
| -It is best to [download](https://github.com/gagolews/clustering-data-v1) |
44 |
| -the whole repository onto our disk first. |
45 |
| -Let us assume they are available in the following directory: |
46 | 20 |
|
47 | 21 |
|
48 | 22 |
|
49 |
| -```python |
50 |
| -# load from a local library (download the suite manually) |
51 |
| -import os.path |
52 |
| -data_path = os.path.join("~", "Projects", "clustering-data-v1") |
53 |
| -``` |
54 | 23 |
|
55 |
| -Here is the list of the currently available benchmark batteries |
56 |
| -(dataset collections): |
57 | 24 |
|
58 | 25 |
|
59 | 26 |
|
60 |
| -```python |
61 |
| -print(clustbench.get_battery_names(path=data_path)) |
62 |
| -## ['fcps', 'g2mg', 'graves', 'h2mg', 'mnist', 'other', 'sipu', 'uci', 'wut'] |
63 |
| -``` |
64 | 27 |
|
65 |
| -We can list the datasets in an example battery by calling: |
66 | 28 |
|
67 | 29 |
|
68 | 30 |
|
69 |
| -```python |
70 |
| -battery = "wut" |
71 |
| -print(clustbench.get_dataset_names(battery, path=data_path)) |
72 |
| -## ['circles', 'cross', 'graph', 'isolation', 'labirynth', 'mk1', 'mk2', 'mk3', 'mk4', 'olympic', 'smile', 'stripes', 'trajectories', 'trapped_lovers', 'twosplashes', 'windows', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3'] |
73 |
| -``` |
74 | 31 |
|
75 |
| -For instance, let us load the `wut/x2` dataset: |
76 | 32 |
|
77 | 33 |
|
78 | 34 |
|
79 |
| - |
80 |
| -```python |
81 |
| -dataset = "x2" |
82 |
| -b = clustbench.load_dataset(battery, dataset, path=data_path) |
83 |
| -``` |
84 |
| - |
85 |
| -The above call returned a named tuple. |
86 |
| -For instance, the README file can be accessed via the `description` |
87 |
| -field: |
88 |
| - |
89 |
| - |
90 |
| - |
91 |
| -```python |
92 |
| -print(b.description) |
93 |
| -## Author: Eliza Kaczorek (Warsaw University of Technology) |
94 |
| -## |
95 |
| -## `labels0` come from the Author herself. |
96 |
| -## `labels1` were generated by Marek Gagolewski. |
97 |
| -## `0` denotes the noise class (if present). |
98 |
| -``` |
99 |
| - |
100 |
| -What is more, the `data` field is the data matrix, `labels` gives the list |
101 |
| -of all ground truth partitions (encoded as label vectors), |
102 |
| -and `n_clusters` gives the corresponding numbers of subsets. |
103 |
| -In case of any doubt, we can always consult the official documentation |
104 |
| -of the {any}`clustbench.load_dataset` function. |
105 |
| - |
106 |
| -For instance, here is the shape (*n* and *d*) of the data matrix, |
107 |
| -the number of reference partitions, and their cardinalities *k*, |
108 |
| -respectively: |
109 |
| - |
110 |
| - |
111 |
| - |
112 |
| -```python |
113 |
| -print(b.data.shape, len(b.labels), b.n_clusters) |
114 |
| -## (120, 2) 2 [3 4] |
115 |
| -``` |
116 |
| - |
117 |
| -The following figure (generated via a call to |
118 |
| -[`genieclust`](https://genieclust.gagolewski.com/)`.plots.plot_scatter`) |
119 |
| -illustrates the benchmark dataset at hand. |
120 |
| - |
121 |
| - |
122 |
| - |
123 |
| -```python |
124 |
| -import genieclust |
125 |
| -for i in range(len(b.labels)): |
126 |
| - plt.subplot(1, len(b.labels), i+1) |
127 |
| - genieclust.plots.plot_scatter( |
128 |
| - b.data, labels=b.labels[i]-1, axis="equal", title=f"labels{i}" |
129 |
| - ) |
130 |
| -plt.show() |
131 |
| -``` |
132 |
| - |
133 |
| -(fig:using-clustbench-example1)= |
134 |
| -```{figure} clustbench-usage-figures/using-clustbench-example1-1.* |
135 |
| -An example benchmark dataset and the corresponding ground truth labels. |
136 |
| -``` |
137 |
| - |
138 |
| - |
139 |
| -## Fetching Precomputed Results |
140 |
| - |
141 |
| -Let us study one of the sets of |
142 |
| -[precomputed clustering results](https://github.com/gagolews/clustering-results-v1) |
143 |
| -stored in the following directory: |
144 |
| - |
145 |
| - |
146 |
| - |
147 |
| - |
148 |
| -```python |
149 |
| -results_path = os.path.join("~", "Projects", "clustering-results-v1", "original") |
150 |
| -``` |
151 |
| - |
152 |
| -They can be fetched by calling: |
153 |
| - |
154 |
| - |
155 |
| - |
156 |
| -```python |
157 |
| -method_group = "Genie" # or "*" for everything |
158 |
| -res = clustbench.load_results( |
159 |
| - method_group, b.battery, b.dataset, b.n_clusters, path=results_path |
160 |
| -) |
161 |
| -print(list(res.keys())) |
162 |
| -## ['Genie_G0.1', 'Genie_G0.3', 'Genie_G0.5', 'Genie_G0.7', 'Genie_G1.0'] |
163 |
| -``` |
164 |
| - |
165 |
| -We thus have got access to precomputed data |
166 |
| -generated by the [*Genie*](https://genieclust.gagolewski.com) |
167 |
| -algorithm with different `gini_threshold` parameter settings. |
168 |
| - |
169 |
| - |
170 |
| - |
171 |
| -## Computing External Cluster Validity Measures |
172 |
| - |
173 |
| - |
174 |
| -Different |
175 |
| -{ref}`external cluster validity measures <sec:external-validity-measures>` |
176 |
| -can be computed by calling {any}`clustbench.get_score`: |
177 |
| - |
178 |
| - |
179 |
| - |
180 |
| - |
181 |
| -```python |
182 |
| -pd.Series({ # for aesthetics |
183 |
| - method: clustbench.get_score(b.labels, res[method]) |
184 |
| - for method in res.keys() |
185 |
| -}) |
186 |
| -## Genie_G0.1 0.870000 |
187 |
| -## Genie_G0.3 0.870000 |
188 |
| -## Genie_G0.5 0.590909 |
189 |
| -## Genie_G0.7 0.666667 |
190 |
| -## Genie_G1.0 0.010000 |
191 |
| -## dtype: float64 |
192 |
| -``` |
193 |
| - |
194 |
| -By default, normalised clustering accuracy is applied. |
195 |
| -As explained in the tutorial, we compare the predicted clusterings against |
196 |
| -{ref}`all <sec:many-partitions>` the reference partitions |
197 |
| -({ref}`ignoring noise points <sec:noise-points>`) |
198 |
| -and report the maximal score. |
199 |
| - |
200 |
| -Let us depict the results for the `"Genie_G0.3"` method: |
201 |
| - |
202 |
| - |
203 |
| - |
204 |
| -```python |
205 |
| -method = "Genie_G0.3" |
206 |
| -for i, k in enumerate(res[method].keys()): |
207 |
| - plt.subplot(1, len(res[method]), i+1) |
208 |
| - genieclust.plots.plot_scatter( |
209 |
| - b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}" |
210 |
| - ) |
211 |
| -plt.show() |
212 |
| -``` |
213 |
| - |
214 |
| -(fig:using-clustbench-example2)= |
215 |
| -```{figure} clustbench-usage-figures/using-clustbench-example2-3.* |
216 |
| -Results generated by Genie. |
217 |
| -``` |
218 |
| - |
219 |
| - |
220 |
| -## Applying Clustering Methods Manually |
221 |
| - |
222 |
| -Naturally, the aim of this benchmark framework is also to test new methods. |
223 |
| -We can use {any}`clustbench.fit_predict_many` to generate |
224 |
| -all the partitions required to compare ourselves against the reference labels. |
225 |
| - |
226 |
| -For instance, let us investigate the behaviour of the k-means algorithm: |
227 |
| - |
228 |
| - |
229 |
| - |
230 |
| -```python |
231 |
| -import sklearn.cluster |
232 |
| -m = sklearn.cluster.KMeans(n_init=10) |
233 |
| -res["KMeans"] = clustbench.fit_predict_many(m, b.data, b.n_clusters) |
234 |
| -clustbench.get_score(b.labels, res["KMeans"]) |
235 |
| -## 0.9848484848484849 |
236 |
| -``` |
237 |
| - |
238 |
| -We see that k-means (which specialises in detecting symmetric Gaussian-like blobs) |
239 |
| -performs better than *Genie* on this particular dataset. |
240 |
| - |
241 |
| - |
242 |
| - |
243 |
| -```python |
244 |
| -method = "KMeans" |
245 |
| -for i, k in enumerate(res[method].keys()): |
246 |
| - plt.subplot(1, len(res[method]), i+1) |
247 |
| - genieclust.plots.plot_scatter( |
248 |
| - b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}" |
249 |
| - ) |
250 |
| -plt.show() |
251 |
| -``` |
252 |
| - |
253 |
| -(fig:using-clustbench-example3)= |
254 |
| -```{figure} clustbench-usage-figures/using-clustbench-example3-5.* |
255 |
| -Results generated by K-Means. |
256 |
| -``` |
257 |
| - |
258 |
| -For more functions, please refer to the package's documentation (in the next section). |
259 |
| -Moreover, {ref}`sec:colouriser` describes a standalone application |
260 |
| -that can be used to prepare our own two-dimensional datasets. |
261 |
| - |
262 |
| -Note that you do not have to use the *clustering-benchmark* package |
263 |
| -to access the benchmark datasets from our repository. |
264 |
| -The {ref}`sec:how-to-access` section mentions that most operations |
265 |
| -involve simple operations on files and directories which you can |
266 |
| -implement manually. The package was developed merely for the users' |
267 |
| -convenience. |
0 commit comments