Skip to content

Commit 87f0b1d

Browse files
committed
more
1 parent fe17ecd commit 87f0b1d

File tree

6 files changed

+12
-238
lines changed

6 files changed

+12
-238
lines changed

.devel/sphinx/news.md

+5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
# Changelog
22

33

4+
## 1.1.x (2025-xx-xx)
5+
6+
- [BUGFIX] `numpy.DataSource` is now `numpy.lib.npyio.DataSource`.
7+
8+
49
## 1.1.5 (2024-08-22)
510

611
- Minor extensions/updates.

.devel/sphinx/weave/clustbench-usage.Rmd

-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
(sec:clustbench-usage)=
22
# Using *clustbench*
33

4-
54
The Python version of the *clustering-benchmarks* package
65
can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/),
76
e.g., via a call to:

.devel/sphinx/weave/clustbench-usage.md

-233
Original file line numberDiff line numberDiff line change
@@ -2,266 +2,33 @@
22

33

44

5-
(sec:clustbench-usage)=
6-
# Using *clustbench*
75

86

9-
The Python version of the *clustering-benchmarks* package
10-
can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/),
11-
e.g., via a call to:
127

13-
```
14-
pip3 install clustering-benchmarks
15-
```
168

17-
from the command line. Alternatively, please use your favourite Python
18-
package manager.
199

2010

21-
Once installed, we can import it by calling:
2211

2312

2413

25-
```python
26-
import clustbench
27-
```
2814

29-
Below we discuss its basic features/usage.
3015

3116

32-
::::{note}
33-
*To learn more about Python,
34-
check out Marek's open-access (free!) textbook*
35-
[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/)
36-
{cite}`datawranglingpy`.
37-
::::
3817

3918

40-
## Fetching Benchmark Data
4119

42-
The datasets from the {ref}`sec:suite-v1` can be accessed easily.
43-
It is best to [download](https://github.com/gagolews/clustering-data-v1)
44-
the whole repository onto our disk first.
45-
Let us assume they are available in the following directory:
4620

4721

4822

49-
```python
50-
# load from a local library (download the suite manually)
51-
import os.path
52-
data_path = os.path.join("~", "Projects", "clustering-data-v1")
53-
```
5423

55-
Here is the list of the currently available benchmark batteries
56-
(dataset collections):
5724

5825

5926

60-
```python
61-
print(clustbench.get_battery_names(path=data_path))
62-
## ['fcps', 'g2mg', 'graves', 'h2mg', 'mnist', 'other', 'sipu', 'uci', 'wut']
63-
```
6427

65-
We can list the datasets in an example battery by calling:
6628

6729

6830

69-
```python
70-
battery = "wut"
71-
print(clustbench.get_dataset_names(battery, path=data_path))
72-
## ['circles', 'cross', 'graph', 'isolation', 'labirynth', 'mk1', 'mk2', 'mk3', 'mk4', 'olympic', 'smile', 'stripes', 'trajectories', 'trapped_lovers', 'twosplashes', 'windows', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3']
73-
```
7431

75-
For instance, let us load the `wut/x2` dataset:
7632

7733

7834

79-
80-
```python
81-
dataset = "x2"
82-
b = clustbench.load_dataset(battery, dataset, path=data_path)
83-
```
84-
85-
The above call returned a named tuple.
86-
For instance, the README file can be accessed via the `description`
87-
field:
88-
89-
90-
91-
```python
92-
print(b.description)
93-
## Author: Eliza Kaczorek (Warsaw University of Technology)
94-
##
95-
## `labels0` come from the Author herself.
96-
## `labels1` were generated by Marek Gagolewski.
97-
## `0` denotes the noise class (if present).
98-
```
99-
100-
What is more, the `data` field is the data matrix, `labels` gives the list
101-
of all ground truth partitions (encoded as label vectors),
102-
and `n_clusters` gives the corresponding numbers of subsets.
103-
In case of any doubt, we can always consult the official documentation
104-
of the {any}`clustbench.load_dataset` function.
105-
106-
For instance, here is the shape (*n* and *d*) of the data matrix,
107-
the number of reference partitions, and their cardinalities *k*,
108-
respectively:
109-
110-
111-
112-
```python
113-
print(b.data.shape, len(b.labels), b.n_clusters)
114-
## (120, 2) 2 [3 4]
115-
```
116-
117-
The following figure (generated via a call to
118-
[`genieclust`](https://genieclust.gagolewski.com/)`.plots.plot_scatter`)
119-
illustrates the benchmark dataset at hand.
120-
121-
122-
123-
```python
124-
import genieclust
125-
for i in range(len(b.labels)):
126-
plt.subplot(1, len(b.labels), i+1)
127-
genieclust.plots.plot_scatter(
128-
b.data, labels=b.labels[i]-1, axis="equal", title=f"labels{i}"
129-
)
130-
plt.show()
131-
```
132-
133-
(fig:using-clustbench-example1)=
134-
```{figure} clustbench-usage-figures/using-clustbench-example1-1.*
135-
An example benchmark dataset and the corresponding ground truth labels.
136-
```
137-
138-
139-
## Fetching Precomputed Results
140-
141-
Let us study one of the sets of
142-
[precomputed clustering results](https://github.com/gagolews/clustering-results-v1)
143-
stored in the following directory:
144-
145-
146-
147-
148-
```python
149-
results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
150-
```
151-
152-
They can be fetched by calling:
153-
154-
155-
156-
```python
157-
method_group = "Genie" # or "*" for everything
158-
res = clustbench.load_results(
159-
method_group, b.battery, b.dataset, b.n_clusters, path=results_path
160-
)
161-
print(list(res.keys()))
162-
## ['Genie_G0.1', 'Genie_G0.3', 'Genie_G0.5', 'Genie_G0.7', 'Genie_G1.0']
163-
```
164-
165-
We thus have got access to precomputed data
166-
generated by the [*Genie*](https://genieclust.gagolewski.com)
167-
algorithm with different `gini_threshold` parameter settings.
168-
169-
170-
171-
## Computing External Cluster Validity Measures
172-
173-
174-
Different
175-
{ref}`external cluster validity measures <sec:external-validity-measures>`
176-
can be computed by calling {any}`clustbench.get_score`:
177-
178-
179-
180-
181-
```python
182-
pd.Series({ # for aesthetics
183-
method: clustbench.get_score(b.labels, res[method])
184-
for method in res.keys()
185-
})
186-
## Genie_G0.1 0.870000
187-
## Genie_G0.3 0.870000
188-
## Genie_G0.5 0.590909
189-
## Genie_G0.7 0.666667
190-
## Genie_G1.0 0.010000
191-
## dtype: float64
192-
```
193-
194-
By default, normalised clustering accuracy is applied.
195-
As explained in the tutorial, we compare the predicted clusterings against
196-
{ref}`all <sec:many-partitions>` the reference partitions
197-
({ref}`ignoring noise points <sec:noise-points>`)
198-
and report the maximal score.
199-
200-
Let us depict the results for the `"Genie_G0.3"` method:
201-
202-
203-
204-
```python
205-
method = "Genie_G0.3"
206-
for i, k in enumerate(res[method].keys()):
207-
plt.subplot(1, len(res[method]), i+1)
208-
genieclust.plots.plot_scatter(
209-
b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"
210-
)
211-
plt.show()
212-
```
213-
214-
(fig:using-clustbench-example2)=
215-
```{figure} clustbench-usage-figures/using-clustbench-example2-3.*
216-
Results generated by Genie.
217-
```
218-
219-
220-
## Applying Clustering Methods Manually
221-
222-
Naturally, the aim of this benchmark framework is also to test new methods.
223-
We can use {any}`clustbench.fit_predict_many` to generate
224-
all the partitions required to compare ourselves against the reference labels.
225-
226-
For instance, let us investigate the behaviour of the k-means algorithm:
227-
228-
229-
230-
```python
231-
import sklearn.cluster
232-
m = sklearn.cluster.KMeans(n_init=10)
233-
res["KMeans"] = clustbench.fit_predict_many(m, b.data, b.n_clusters)
234-
clustbench.get_score(b.labels, res["KMeans"])
235-
## 0.9848484848484849
236-
```
237-
238-
We see that k-means (which specialises in detecting symmetric Gaussian-like blobs)
239-
performs better than *Genie* on this particular dataset.
240-
241-
242-
243-
```python
244-
method = "KMeans"
245-
for i, k in enumerate(res[method].keys()):
246-
plt.subplot(1, len(res[method]), i+1)
247-
genieclust.plots.plot_scatter(
248-
b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"
249-
)
250-
plt.show()
251-
```
252-
253-
(fig:using-clustbench-example3)=
254-
```{figure} clustbench-usage-figures/using-clustbench-example3-5.*
255-
Results generated by K-Means.
256-
```
257-
258-
For more functions, please refer to the package's documentation (in the next section).
259-
Moreover, {ref}`sec:colouriser` describes a standalone application
260-
that can be used to prepare our own two-dimensional datasets.
261-
262-
Note that you do not have to use the *clustering-benchmark* package
263-
to access the benchmark datasets from our repository.
264-
The {ref}`sec:how-to-access` section mentions that most operations
265-
involve simple operations on files and directories which you can
266-
implement manually. The package was developed merely for the users'
267-
convenience.

.devel/sphinx/weave/how-to-access.Rmd

+5-4
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,10 @@ the [*rpy2*](https://pypi.org/project/rpy2/) package.
8787

8888
## MATLAB
8989

90-
Unfortunately, MATLAB does not seem to be able to un*gzip* files
91-
on the fly, but these can be decompressed to a temporary folder
90+
Unfortunately, MATLAB is not free software.
91+
92+
It does not seem to be able to un*gzip* files
93+
on the fly, but they can be decompressed to a temporary folder
9294
manually.
9395

9496
```matlab
@@ -100,10 +102,9 @@ labels = readmatrix(char(gunzip(base_name + ".labels0.gz", t)), FileType="text")
100102

101103
Note that there is also a MATLAB
102104
[interface](https://au.mathworks.com/products/matlab/matlab-and-python.html)
103-
for Python. This way, algorithms that have only been implemented in the
105+
for Python. This way, algorithms that have only been implemented in the
104106
former can be called from within the latter.
105107

106-
Unfortunately, MATLAB is not free software.
107108

108109
<!--
109110
## Julia

NEWS

+1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
## 1.1.x (2025-xx-xx)
55

66
- [BUGFIX] `numpy.DataSource` is now `numpy.lib.npyio.DataSource`.
7+
- `load_dataset` now guarantees to return a C-contiguous `data` matrix.
78

89

910
## 1.1.5 (2024-08-22)

clustbench/load_dataset.py

+1
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ def load_dataset(
127127

128128
data_file = base_name + ".data.gz"
129129
data = np.loadtxt(data_file, ndmin=2)
130+
data = np.ascontiguousarray(data)
130131

131132
if data.ndim != 2:
132133
raise ValueError("Not a matrix.")

0 commit comments

Comments
 (0)