add some guides (#200)

CarloLucibello · web-flow · commit 342631a10c29 · 2025-02-08T15:46:13.000+01:00
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,3 +1,4 @@
 [deps]
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
diff --git a/docs/make.jl b/docs/make.jl
@@ -16,6 +16,7 @@ makedocs(;
     modules=[MLUtils],
     sitename = "MLUtils.jl",
     pages = ["Home" => "index.md",
+             "Guides" => "guides.md",
              "API" => "api.md"],
 )
 
diff --git a/docs/src/guides.md b/docs/src/guides.md
@@ -0,0 +1,118 @@
+```@meta
+DocTestSetup = quote
+    using MLUtils
+end
+```
+
+# Guides
+
+## Datasets 
+
+### Basic datasets
+
+A dataset in MLUtils.jl is any object `data` that satisfies the following requisites:
+1. Contains a certain number of observations, given by `numobs(data)`.
+2. Observations can be accessed by index, using `getobs(data, i)`, with `i` in `1:numobs(data)`.
+
+Since [`numobs`](@ref) and [`getobs`](@ref) are natively implemented for basic types like arrays, tuples, named tuples, dictionaries and Tables.jl's tables, you can use them as datasets without any further ado.
+
+For arrays, the convention is that the last dimension is the observation dimension (sometimes also called the batch dimension),
+
+Let's see some examples. We begin with a simple array:
+
+```jldoctest
+julia> data = [1 2; 3 4; 5 6]
+3×2 Matrix{Int64}:
+ 1  2
+ 3  4
+ 5  6
+
+julia> numobs(data)
+2
+
+julia> getobs(data, 1)
+3-element Vector{Int64}:
+ 1
+ 3
+ 5
+```
+
+Now let's see an example with named tuples. Notice that the number of observations
+as to be the same for all fields:
+
+```jldoctest
+julia> data = (x = ones(2, 3), y = [1, 2, 3]);
+
+julia> numobs(data)
+3
+
+julia> getobs(data, 2)
+(x = [1.0, 1.0], y = 2)
+```
+Finally, let's consider a table:
+
+```jldoctest
+julia> using DataFrames
+
+julia> data = DataFrame(x = 1:4, y = ["a", "b", "c", "d"])
+4×2 DataFrame
+ Row │ x      y      
+     │ Int64  String 
+─────┼───────────────
+   1 │     1  a
+   2 │     2  b
+   3 │     3  c
+   4 │     4  d
+
+julia> numobs(data)
+4
+
+julia> getobs(data, 3)
+(x = 3, y = "c")
+```
+
+### Custom datasets
+
+If you have a custom dataset type, you can support the MLUtils.jl interface by implementing the `Base.length` and `Base.getindex` functions, since `numobs` and `getobs` fallback to these functions when they are not specifically implemented.
+
+Here is a barebones example of a custom dataset type:
+```jldoctest
+julia> struct DummyDataset
+           length::Int
+       end
+
+julia> Base.length(d::DummyDataset) = d.length
+
+julia> function Base.getindex(d::DummyDataset, i::Int)
+         1 <= i <= d.length || throw(ArgumentError("Index out of bounds"))
+         return 10*i
+       end
+
+julia> data = DummyDataset(10)
+DummyDataset(10)
+
+julia> numobs(data)
+10
+
+julia> getobs(data, 2)
+20
+```
+
+This is all it takes to make your custom type compatible with functionalities such as the [`DataLoader`](@ref) type and the [`splitobs`](@ref) function.
+
+## Observation Views
+
+It is common in machine learning pipelines to transform or split the observations contained in a dataset. 
+In order to avoid unnecessary memory allocations, MLUtils.jl provides the [`obsview`](@ref) function, which creates a view of the observations at the specified indices, without copying the data.
+
+`obsview(data, indices)` can be used with any dataset `data` and a collection of indices `indices`. By default, 
+it returns a wrapper type [`ObsView`](@ref), which behaves like a dataset and can be used with any function that accepts datasets. Users can also specify the behavior of `obsview` on their custom types by implementing the `obsview` method for their type. As an example, for array data, `obsview(data, indices)` will return a subarray:
+
+```jldoctest
+julia> obsview([1 2 3; 4 5 6], 1:2)
+2×2 view(::Matrix{Int64}, :, 1:2) with eltype Int64:
+ 1  2
+ 4  5
+```
+
+
diff --git a/src/Datasets/load_datasets.jl b/src/Datasets/load_datasets.jl
@@ -18,5 +18,5 @@ function load_iris()
     X = convert(Matrix{Float64}, raw_csv[:, 1:4]')
     y = convert(Vector{String}, raw_csv[:, 5])
     vars = ["Sepal length", "Sepal width", "Petal length", "Petal width"]
-    X, y, vars
+    return X, y, vars
 end
diff --git a/src/obsview.jl b/src/obsview.jl
@@ -197,25 +197,33 @@ Base.parent(x::ObsView) = x.data
 """
     obsview(data, [indices])
 
-Returns a lazy view of the observations in `data` that
-correspond to the given `indices`. No data will be copied except
-of the indices. It is similar to constructing an [`ObsView`](@ref), 
-but returns a `SubArray` if the type of
-`data` is `Array` or `SubArray`. Furthermore, this function may
-be extended for custom types of `data` that also want to provide
-their own subset-type.
-
-In case `data` is a tuple, the constructor will be mapped
-over its elements. That means that the constructor returns a
-tuple of `ObsView` instead of a `ObsView` of tuples.
-
-If instead you want to get the subset of observations
-corresponding to the given `indices` in their native type, use
-`getobs`.
-
-See [`ObsView`](@ref) for more information.
+Return a lazy view of the observations in `data` that
+correspond to the given `indices`. No data will be copied. 
+
+By default the return is an [`ObsView`](@ref), although this can be
+overloaded for custom types of `data` that want to provide
+their own lazy view.
+
+In case `data` is a tuple or named tuple, the constructor will be mapped
+over its elements. For array types, return a subarray.
+
+The observation in the returned view `ov` can be materialized by calling
+`getobs(ov, i)` on the view, where `i` is an index in `1:length(ov)`.
+
+If `indices` is not provided, it will be assumed to be `1:numobs(data)`.
+```
+
+# Examples
+
+```jldoctest
+julia> obsview([1 2 3; 4 5 6], 1:2)
+2×2 view(::Matrix{Int64}, :, 1:2) with eltype Int64:
+ 1  2
+ 4  5
+```
 """
-obsview(data, indices=1:numobs(data)) = ObsView(data, indices)
+obsview(data, indices) = ObsView(data, indices)
+obsview(data) = obsview(data, 1:numobs(data))
 
 ##### Arrays / SubArrays
 
@@ -230,5 +238,5 @@ getobs(a::SubArray) = getobs(a.parent, last(a.indices))
 
 ##### Tuples / NamedTuples
 function obsview(tup::Union{Tuple, NamedTuple}, indices)
-    map(data -> obsview(data, indices), tup)
+    return map(data -> obsview(data, indices), tup)
 end
diff --git a/src/slidingwindow.jl b/src/slidingwindow.jl
@@ -38,7 +38,7 @@ To actually get a copy of the data at some window use indexing or [`getobs`](@re
 When indexing the data is accessed as `getobs(data, idxs)`, with `idxs` an appropriate range of indexes.
 ```jldoctest
 julia> s = slidingwindow(11:30, size=6)
-slidingwindow(10:30, size=6, stride=1)
+slidingwindow(11:30, size=6, stride=1)
 
 julia> s[1]  # == getobs(data, 1:6)
 11:16
@@ -53,7 +53,7 @@ By default the stride is equal to 1.
 
 ```jldoctest
 julia> s = slidingwindow(11:30, size=6, stride=3)
-slidingwindow(1:20, size=6, stride=3)
+slidingwindow(11:30, size=6, stride=3)
 
 julia> for w in s; println(w); end
 11:16

Original file line number	Diff line number	Diff line change
`@@ -16,6 +16,7 @@ makedocs(;`
`16`	`16`	`modules=[MLUtils],`
`17`	`17`	`sitename = "MLUtils.jl",`
`18`	`18`	`pages = ["Home" => "index.md",`
	`19`	`+ "Guides" => "guides.md",`
`19`	`20`	`"API" => "api.md"],`
`20`	`21`	`)`
`21`	`22`