more

gagolews · gagolews · commit 87f0b1d02915 · 2025-01-31T11:04:48.000+01:00
diff --git a/.devel/sphinx/news.md b/.devel/sphinx/news.md
@@ -1,6 +1,11 @@
 # Changelog
 
 
+## 1.1.x (2025-xx-xx)
+
+-   [BUGFIX] `numpy.DataSource` is now `numpy.lib.npyio.DataSource`.
+
+
 ## 1.1.5 (2024-08-22)
 
 -   Minor extensions/updates.
diff --git a/.devel/sphinx/weave/clustbench-usage.Rmd b/.devel/sphinx/weave/clustbench-usage.Rmd
@@ -1,7 +1,6 @@
 (sec:clustbench-usage)=
 # Using *clustbench*
 
-
 The Python version of the *clustering-benchmarks* package
 can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/),
 e.g., via a call to:
diff --git a/.devel/sphinx/weave/clustbench-usage.md b/.devel/sphinx/weave/clustbench-usage.md
@@ -2,266 +2,33 @@
 
 
 
-(sec:clustbench-usage)=
-# Using *clustbench*
 
 
-The Python version of the *clustering-benchmarks* package
-can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/),
-e.g., via a call to:
 
-```
-pip3 install clustering-benchmarks
-```
 
-from the command line. Alternatively, please use your favourite Python
-package manager.
 
 
-Once installed, we can import it by calling:
 
 
 
-```python
-import clustbench
-```
 
-Below we discuss its basic features/usage.
 
 
-::::{note}
-*To learn more about Python,
-check out Marek's open-access (free!) textbook*
-[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/)
-{cite}`datawranglingpy`.
-::::
 
 
-## Fetching Benchmark Data
 
-The datasets from the {ref}`sec:suite-v1` can be accessed easily.
-It is best to [download](https://github.com/gagolews/clustering-data-v1)
-the whole repository onto our disk first.
-Let us assume they are available in the following directory:
 
 
 
-```python
-# load from a local library (download the suite manually)
-import os.path
-data_path = os.path.join("~", "Projects", "clustering-data-v1")
-```
 
-Here is the list of the currently available benchmark batteries
-(dataset collections):
 
 
 
-```python
-print(clustbench.get_battery_names(path=data_path))
-## ['fcps', 'g2mg', 'graves', 'h2mg', 'mnist', 'other', 'sipu', 'uci', 'wut']
-```
 
-We can list the datasets in an example battery by calling:
 
 
 
-```python
-battery = "wut"
-print(clustbench.get_dataset_names(battery, path=data_path))
-## ['circles', 'cross', 'graph', 'isolation', 'labirynth', 'mk1', 'mk2', 'mk3', 'mk4', 'olympic', 'smile', 'stripes', 'trajectories', 'trapped_lovers', 'twosplashes', 'windows', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3']
-```
 
-For instance, let us load the `wut/x2` dataset:
 
 
 
-
-```python
-dataset = "x2"
-b = clustbench.load_dataset(battery, dataset, path=data_path)
-```
-
-The above call returned a named tuple.
-For instance, the README file can be accessed via the `description`
-field:
-
-
-
-```python
-print(b.description)
-## Author: Eliza Kaczorek (Warsaw University of Technology)
-## 
-## `labels0` come from the Author herself.
-## `labels1` were generated by Marek Gagolewski.
-## `0` denotes the noise class (if present).
-```
-
-What is more, the `data` field is the data matrix, `labels` gives the list
-of all ground truth partitions (encoded as label vectors),
-and `n_clusters` gives the corresponding numbers of subsets.
-In case of any doubt, we can always consult the official documentation
-of the {any}`clustbench.load_dataset` function.
-
-For instance, here is the shape (*n* and *d*) of the data matrix,
-the number of reference partitions, and their cardinalities *k*,
-respectively:
-
-
-
-```python
-print(b.data.shape, len(b.labels), b.n_clusters)
-## (120, 2) 2 [3 4]
-```
-
-The following figure (generated via a call to
-[`genieclust`](https://genieclust.gagolewski.com/)`.plots.plot_scatter`)
-illustrates the benchmark dataset at hand.
-
-
-
-```python
-import genieclust
-for i in range(len(b.labels)):
-    plt.subplot(1, len(b.labels), i+1)
-    genieclust.plots.plot_scatter(
-        b.data, labels=b.labels[i]-1, axis="equal", title=f"labels{i}"
-    )
-plt.show()
-```
-
-(fig:using-clustbench-example1)=
-```{figure} clustbench-usage-figures/using-clustbench-example1-1.*
-An example benchmark dataset and the corresponding ground truth labels.
-```
-
-
-## Fetching Precomputed Results
-
-Let us study one of the sets of
-[precomputed clustering results](https://github.com/gagolews/clustering-results-v1)
-stored in the following directory:
-
-
-
-
-```python
-results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
-```
-
-They can be fetched by calling:
-
-
-
-```python
-method_group = "Genie"  # or "*" for everything
-res = clustbench.load_results(
-    method_group, b.battery, b.dataset, b.n_clusters, path=results_path
-)
-print(list(res.keys()))
-## ['Genie_G0.1', 'Genie_G0.3', 'Genie_G0.5', 'Genie_G0.7', 'Genie_G1.0']
-```
-
-We thus have got access to precomputed data
-generated by the [*Genie*](https://genieclust.gagolewski.com)
-algorithm with different `gini_threshold` parameter settings.
-
-
-
-## Computing External Cluster Validity Measures
-
-
-Different
-{ref}`external cluster validity measures <sec:external-validity-measures>`
-can be computed by calling {any}`clustbench.get_score`:
-
-
-
-
-```python
-pd.Series({  # for aesthetics
-    method: clustbench.get_score(b.labels, res[method])
-    for method in res.keys()
-})
-## Genie_G0.1    0.870000
-## Genie_G0.3    0.870000
-## Genie_G0.5    0.590909
-## Genie_G0.7    0.666667
-## Genie_G1.0    0.010000
-## dtype: float64
-```
-
-By default, normalised clustering accuracy is applied.
-As explained in the tutorial, we compare the predicted clusterings against
-{ref}`all <sec:many-partitions>` the reference partitions
-({ref}`ignoring noise points <sec:noise-points>`)
-and report the maximal score.
-
-Let us depict the results for the `"Genie_G0.3"` method:
-
-
-
-```python
-method = "Genie_G0.3"
-for i, k in enumerate(res[method].keys()):
-    plt.subplot(1, len(res[method]), i+1)
-    genieclust.plots.plot_scatter(
-        b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"
-    )
-plt.show()
-```
-
-(fig:using-clustbench-example2)=
-```{figure} clustbench-usage-figures/using-clustbench-example2-3.*
-Results generated by Genie.
-```
-
-
-## Applying Clustering Methods Manually
-
-Naturally, the aim of this benchmark framework is also to test new methods.
-We can use {any}`clustbench.fit_predict_many` to generate
-all the partitions required to compare ourselves against the reference labels.
-
-For instance, let us investigate the behaviour of the k-means algorithm:
-
-
-
-```python
-import sklearn.cluster
-m = sklearn.cluster.KMeans(n_init=10)
-res["KMeans"] = clustbench.fit_predict_many(m, b.data, b.n_clusters)
-clustbench.get_score(b.labels, res["KMeans"])
-## 0.9848484848484849
-```
-
-We see that k-means (which specialises in detecting symmetric Gaussian-like blobs)
-performs better than *Genie* on this particular dataset.
-
-
-
-```python
-method = "KMeans"
-for i, k in enumerate(res[method].keys()):
-    plt.subplot(1, len(res[method]), i+1)
-    genieclust.plots.plot_scatter(
-        b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"
-    )
-plt.show()
-```
-
-(fig:using-clustbench-example3)=
-```{figure} clustbench-usage-figures/using-clustbench-example3-5.*
-Results generated by K-Means.
-```
-
-For more functions, please refer to the package's documentation (in the next section).
-Moreover, {ref}`sec:colouriser` describes a standalone application
-that can be used to prepare our own two-dimensional datasets.
-
-Note that you do not have to use the *clustering-benchmark* package
-to access the benchmark datasets from our repository.
-The {ref}`sec:how-to-access` section mentions that most operations
-involve simple operations on files and directories which you can
-implement manually. The package was developed merely for the users'
-convenience.
diff --git a/.devel/sphinx/weave/how-to-access.Rmd b/.devel/sphinx/weave/how-to-access.Rmd
@@ -87,8 +87,10 @@ the [*rpy2*](https://pypi.org/project/rpy2/) package.
 
 ## MATLAB
 
-Unfortunately, MATLAB does not seem to be able to un*gzip* files
-on the fly, but these can be decompressed to a temporary folder
+Unfortunately, MATLAB is not free software.
+
+It does not seem to be able to un*gzip* files
+on the fly, but they can be decompressed to a temporary folder
 manually.
 
 ```matlab
@@ -100,10 +102,9 @@ labels = readmatrix(char(gunzip(base_name + ".labels0.gz", t)), FileType="text")
 
 Note that there is also a MATLAB
 [interface](https://au.mathworks.com/products/matlab/matlab-and-python.html)
-for Python. This way, algorithms that  have only been implemented in the
+for Python. This way, algorithms that have only been implemented in the
 former can be called from within the latter.
 
-Unfortunately, MATLAB is not free software.
 
 <!--
 ## Julia
diff --git a/NEWS b/NEWS
@@ -4,6 +4,7 @@
 ## 1.1.x (2025-xx-xx)
 
 -   [BUGFIX] `numpy.DataSource` is now `numpy.lib.npyio.DataSource`.
+-   `load_dataset` now guarantees to return a C-contiguous `data` matrix.
 
 
 ## 1.1.5 (2024-08-22)
diff --git a/clustbench/load_dataset.py b/clustbench/load_dataset.py
@@ -127,6 +127,7 @@ def load_dataset(
 
     data_file = base_name + ".data.gz"
     data = np.loadtxt(data_file, ndmin=2)
+    data = np.ascontiguousarray(data)
 
     if data.ndim != 2:
         raise ValueError("Not a matrix.")