Skip to content

Commit 50b5b5d

Browse files
committed
v1.1.6
1 parent 772df8e commit 50b5b5d

File tree

75 files changed

+1020
-395
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+1020
-395
lines changed

.devel/sphinx/news.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# Changelog
22

33

4-
## 1.1.x (2025-xx-xx)
4+
## 1.1.6 (2025-02-03)
55

66
- [BUGFIX] `numpy.DataSource` is now `numpy.lib.npyio.DataSource`.
7+
- `load_dataset` now guarantees to return a C-contiguous `data` matrix.
78

89

910
## 1.1.5 (2024-08-22)
Binary file not shown.
Loading
Binary file not shown.
Loading
Binary file not shown.
Loading

.devel/sphinx/weave/clustbench-usage.Rmd

+1-2
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,7 @@ Below we discuss its basic features/usage.
2323

2424

2525
::::{note}
26-
*To learn more about Python,
27-
check out Marek's open-access (free!) textbook*
26+
*To learn more about Python, check out Marek's open-access (free!) textbook*
2827
[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/)
2928
{cite}`datawranglingpy`.
3029
::::

.devel/sphinx/weave/clustbench-usage.md

+227
Original file line numberDiff line numberDiff line change
@@ -2,33 +2,260 @@
22

33

44

5+
(sec:clustbench-usage)=
6+
# Using *clustbench*
57

8+
The Python version of the *clustering-benchmarks* package
9+
can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/),
10+
e.g., via a call to:
611

12+
```
13+
pip3 install clustering-benchmarks
14+
```
715

16+
from the command line. Alternatively, please use your favourite Python
17+
package manager.
818

919

20+
Once installed, we can import it by calling:
1021

1122

23+
``` python
24+
import clustbench
25+
```
1226

27+
Below we discuss its basic features/usage.
1328

1429

30+
::::{note}
31+
*To learn more about Python, check out Marek's open-access (free!) textbook*
32+
[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/)
33+
{cite}`datawranglingpy`.
34+
::::
1535

1636

37+
## Fetching Benchmark Data
1738

39+
The datasets from the {ref}`sec:suite-v1` can be accessed easily.
40+
It is best to [download](https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0)
41+
the whole repository onto our disk first.
42+
Let us assume they are available in the following directory:
1843

1944

45+
``` python
46+
# load from a local library (download the suite manually)
47+
import os.path
48+
data_path = os.path.join("~", "Projects", "clustering-data-v1")
49+
```
2050

51+
Here is the list of the currently available benchmark batteries
52+
(dataset collections):
2153

2254

55+
``` python
56+
print(clustbench.get_battery_names(path=data_path))
57+
## ['fcps', 'g2mg', 'graves', 'h2mg', 'mnist', 'other', 'sipu', 'uci', 'wut']
58+
```
2359

60+
We can list the datasets in an example battery by calling:
2461

2562

63+
``` python
64+
battery = "wut"
65+
print(clustbench.get_dataset_names(battery, path=data_path))
66+
## ['circles', 'cross', 'graph', 'isolation', 'labirynth', 'mk1', 'mk2', 'mk3', 'mk4', 'olympic', 'smile', 'stripes', 'trajectories', 'trapped_lovers', 'twosplashes', 'windows', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3']
67+
```
2668

69+
For instance, let us load the `wut/x2` dataset:
2770

2871

2972

73+
``` python
74+
dataset = "x2"
75+
b = clustbench.load_dataset(battery, dataset, path=data_path)
76+
```
3077

78+
The above call returned a named tuple. For instance, the corresponding README
79+
file can be inspected by accessing the `description` field:
3180

3281

82+
``` python
83+
print(b.description)
84+
## Author: Eliza Kaczorek (Warsaw University of Technology)
85+
##
86+
## `labels0` come from the Author herself.
87+
## `labels1` were generated by Marek Gagolewski.
88+
## `0` denotes the noise class (if present).
89+
```
3390

91+
Moreover, the `data` field gives the data matrix, `labels` is the list
92+
of all ground truth partitions (encoded as label vectors),
93+
and `n_clusters` gives the corresponding numbers of subsets.
94+
In case of any doubt, we can always consult the official documentation
95+
of the {any}`clustbench.load_dataset` function.
3496

97+
::::{note}
98+
Particular datasets can be retrieved from an online repository directly
99+
(no need to download the whole battery first) by calling:
100+
101+
102+
``` python
103+
data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
104+
b = clustbench.load_dataset("wut", "x2", url=data_url)
105+
```
106+
::::
107+
108+
For instance, here is the shape (*n* and *d*) of the data matrix,
109+
the number of reference partitions, and their cardinalities *k*,
110+
respectively:
111+
112+
113+
``` python
114+
print(b.data.shape, len(b.labels), b.n_clusters)
115+
## (120, 2) 2 [3 4]
116+
```
117+
118+
The following figure (generated via a call to
119+
[`genieclust`](https://genieclust.gagolewski.com/)`.plots.plot_scatter`)
120+
illustrates the benchmark dataset at hand.
121+
122+
123+
``` python
124+
import genieclust
125+
for i in range(len(b.labels)):
126+
plt.subplot(1, len(b.labels), i+1)
127+
genieclust.plots.plot_scatter(
128+
b.data, labels=b.labels[i]-1, axis="equal", title=f"labels{i}"
129+
)
130+
plt.show()
131+
```
132+
133+
(fig:using-clustbench-example1)=
134+
```{figure} clustbench-usage-figures/using-clustbench-example1-1.*
135+
An example benchmark dataset and the corresponding ground truth labels.
136+
```
137+
138+
139+
## Fetching Precomputed Results
140+
141+
Let us study one of the sets of
142+
[precomputed clustering results](https://github.com/gagolews/clustering-results-v1)
143+
stored in the following directory:
144+
145+
146+
147+
``` python
148+
results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
149+
```
150+
151+
They can be fetched by calling:
152+
153+
154+
``` python
155+
method_group = "Genie" # or "*" for everything
156+
res = clustbench.load_results(
157+
method_group, b.battery, b.dataset, b.n_clusters, path=results_path
158+
)
159+
print(list(res.keys()))
160+
## ['Genie_G0.1', 'Genie_G0.3', 'Genie_G0.5', 'Genie_G0.7', 'Genie_G1.0']
161+
```
162+
163+
We thus have got access to precomputed data
164+
generated by the [*Genie*](https://genieclust.gagolewski.com)
165+
algorithm with different `gini_threshold` parameter settings.
166+
167+
168+
169+
## Computing External Cluster Validity Measures
170+
171+
172+
Different
173+
{ref}`external cluster validity measures <sec:external-validity-measures>`
174+
can be computed by calling {any}`clustbench.get_score`:
175+
176+
177+
178+
``` python
179+
pd.Series({ # for aesthetics
180+
method: clustbench.get_score(b.labels, res[method])
181+
for method in res.keys()
182+
})
183+
## Genie_G0.1 0.870000
184+
## Genie_G0.3 0.870000
185+
## Genie_G0.5 0.590909
186+
## Genie_G0.7 0.666667
187+
## Genie_G1.0 0.010000
188+
## dtype: float64
189+
```
190+
191+
By default, normalised clustering accuracy is applied.
192+
As explained in the tutorial, we compare the predicted clusterings against
193+
{ref}`all <sec:many-partitions>` the reference partitions
194+
({ref}`ignoring noise points <sec:noise-points>`)
195+
and report the maximal score.
196+
197+
Let us depict the results for the `"Genie_G0.3"` method:
198+
199+
200+
``` python
201+
method = "Genie_G0.3"
202+
for i, k in enumerate(res[method].keys()):
203+
plt.subplot(1, len(res[method]), i+1)
204+
genieclust.plots.plot_scatter(
205+
b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"
206+
)
207+
plt.show()
208+
```
209+
210+
(fig:using-clustbench-example2)=
211+
```{figure} clustbench-usage-figures/using-clustbench-example2-3.*
212+
Results generated by Genie.
213+
```
214+
215+
216+
## Applying Clustering Methods Manually
217+
218+
Naturally, the aim of this benchmark framework is also to test new methods.
219+
We can use {any}`clustbench.fit_predict_many` to generate
220+
all the partitions required to compare ourselves against the reference labels.
221+
222+
For instance, let us investigate the behaviour of the k-means algorithm:
223+
224+
225+
``` python
226+
import sklearn.cluster
227+
m = sklearn.cluster.KMeans(n_init=10)
228+
res["KMeans"] = clustbench.fit_predict_many(m, b.data, b.n_clusters)
229+
clustbench.get_score(b.labels, res["KMeans"])
230+
## np.float64(0.9848484848484849)
231+
```
232+
233+
We see that k-means (which specialises in detecting symmetric Gaussian-like blobs)
234+
performs better than *Genie* on this particular dataset.
235+
236+
237+
``` python
238+
method = "KMeans"
239+
for i, k in enumerate(res[method].keys()):
240+
plt.subplot(1, len(res[method]), i+1)
241+
genieclust.plots.plot_scatter(
242+
b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"
243+
)
244+
plt.show()
245+
```
246+
247+
(fig:using-clustbench-example3)=
248+
```{figure} clustbench-usage-figures/using-clustbench-example3-5.*
249+
Results generated by K-Means.
250+
```
251+
252+
For more functions, please refer to the package's documentation (in the next section).
253+
Moreover, {ref}`sec:colouriser` describes a standalone application
254+
that can be used to prepare our own two-dimensional datasets.
255+
256+
Note that you do not have to use the *clustering-benchmark* package
257+
to access the benchmark datasets from our repository.
258+
The {ref}`sec:how-to-access` section mentions that most operations
259+
involve simple operations on files and directories which you can
260+
implement manually. The package was developed merely for the users'
261+
convenience.
Loading

.devel/sphinx/weave/external-validity-measures.Rmd

+2-2
Original file line numberDiff line numberDiff line change
@@ -419,7 +419,7 @@ C[:, o-1]
419419

420420

421421

422-
422+
<!--
423423
### Pair Sets Index
424424
425425
If the symmetry property is required, the pair sets index
@@ -488,7 +488,7 @@ for negative values are difficult to interpret.
488488
489489
490490
*Implementation: [`genieclust`](https://genieclust.gagolewski.com/)`.compare_partitions.pair_sets_index`*.
491-
491+
-->
492492

493493
## Counting Concordant and Discordant Point Pairs
494494

0 commit comments

Comments
 (0)