Skip to content

Commit 342631a

Browse files
add some guides (#200)
1 parent 4268f30 commit 342631a

File tree

6 files changed

+150
-22
lines changed

6 files changed

+150
-22
lines changed

docs/Project.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
[deps]
2+
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
23
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
34
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ makedocs(;
1616
modules=[MLUtils],
1717
sitename = "MLUtils.jl",
1818
pages = ["Home" => "index.md",
19+
"Guides" => "guides.md",
1920
"API" => "api.md"],
2021
)
2122

docs/src/guides.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
```@meta
2+
DocTestSetup = quote
3+
using MLUtils
4+
end
5+
```
6+
7+
# Guides
8+
9+
## Datasets
10+
11+
### Basic datasets
12+
13+
A dataset in MLUtils.jl is any object `data` that satisfies the following requisites:
14+
1. Contains a certain number of observations, given by `numobs(data)`.
15+
2. Observations can be accessed by index, using `getobs(data, i)`, with `i` in `1:numobs(data)`.
16+
17+
Since [`numobs`](@ref) and [`getobs`](@ref) are natively implemented for basic types like arrays, tuples, named tuples, dictionaries and Tables.jl's tables, you can use them as datasets without any further ado.
18+
19+
For arrays, the convention is that the last dimension is the observation dimension (sometimes also called the batch dimension),
20+
21+
Let's see some examples. We begin with a simple array:
22+
23+
```jldoctest
24+
julia> data = [1 2; 3 4; 5 6]
25+
3×2 Matrix{Int64}:
26+
1 2
27+
3 4
28+
5 6
29+
30+
julia> numobs(data)
31+
2
32+
33+
julia> getobs(data, 1)
34+
3-element Vector{Int64}:
35+
1
36+
3
37+
5
38+
```
39+
40+
Now let's see an example with named tuples. Notice that the number of observations
41+
as to be the same for all fields:
42+
43+
```jldoctest
44+
julia> data = (x = ones(2, 3), y = [1, 2, 3]);
45+
46+
julia> numobs(data)
47+
3
48+
49+
julia> getobs(data, 2)
50+
(x = [1.0, 1.0], y = 2)
51+
```
52+
Finally, let's consider a table:
53+
54+
```jldoctest
55+
julia> using DataFrames
56+
57+
julia> data = DataFrame(x = 1:4, y = ["a", "b", "c", "d"])
58+
4×2 DataFrame
59+
Row │ x y
60+
│ Int64 String
61+
─────┼───────────────
62+
1 │ 1 a
63+
2 │ 2 b
64+
3 │ 3 c
65+
4 │ 4 d
66+
67+
julia> numobs(data)
68+
4
69+
70+
julia> getobs(data, 3)
71+
(x = 3, y = "c")
72+
```
73+
74+
### Custom datasets
75+
76+
If you have a custom dataset type, you can support the MLUtils.jl interface by implementing the `Base.length` and `Base.getindex` functions, since `numobs` and `getobs` fallback to these functions when they are not specifically implemented.
77+
78+
Here is a barebones example of a custom dataset type:
79+
```jldoctest
80+
julia> struct DummyDataset
81+
length::Int
82+
end
83+
84+
julia> Base.length(d::DummyDataset) = d.length
85+
86+
julia> function Base.getindex(d::DummyDataset, i::Int)
87+
1 <= i <= d.length || throw(ArgumentError("Index out of bounds"))
88+
return 10*i
89+
end
90+
91+
julia> data = DummyDataset(10)
92+
DummyDataset(10)
93+
94+
julia> numobs(data)
95+
10
96+
97+
julia> getobs(data, 2)
98+
20
99+
```
100+
101+
This is all it takes to make your custom type compatible with functionalities such as the [`DataLoader`](@ref) type and the [`splitobs`](@ref) function.
102+
103+
## Observation Views
104+
105+
It is common in machine learning pipelines to transform or split the observations contained in a dataset.
106+
In order to avoid unnecessary memory allocations, MLUtils.jl provides the [`obsview`](@ref) function, which creates a view of the observations at the specified indices, without copying the data.
107+
108+
`obsview(data, indices)` can be used with any dataset `data` and a collection of indices `indices`. By default,
109+
it returns a wrapper type [`ObsView`](@ref), which behaves like a dataset and can be used with any function that accepts datasets. Users can also specify the behavior of `obsview` on their custom types by implementing the `obsview` method for their type. As an example, for array data, `obsview(data, indices)` will return a subarray:
110+
111+
```jldoctest
112+
julia> obsview([1 2 3; 4 5 6], 1:2)
113+
2×2 view(::Matrix{Int64}, :, 1:2) with eltype Int64:
114+
1 2
115+
4 5
116+
```
117+
118+

src/Datasets/load_datasets.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,5 @@ function load_iris()
1818
X = convert(Matrix{Float64}, raw_csv[:, 1:4]')
1919
y = convert(Vector{String}, raw_csv[:, 5])
2020
vars = ["Sepal length", "Sepal width", "Petal length", "Petal width"]
21-
X, y, vars
21+
return X, y, vars
2222
end

src/obsview.jl

Lines changed: 27 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -197,25 +197,33 @@ Base.parent(x::ObsView) = x.data
197197
"""
198198
obsview(data, [indices])
199199
200-
Returns a lazy view of the observations in `data` that
201-
correspond to the given `indices`. No data will be copied except
202-
of the indices. It is similar to constructing an [`ObsView`](@ref),
203-
but returns a `SubArray` if the type of
204-
`data` is `Array` or `SubArray`. Furthermore, this function may
205-
be extended for custom types of `data` that also want to provide
206-
their own subset-type.
207-
208-
In case `data` is a tuple, the constructor will be mapped
209-
over its elements. That means that the constructor returns a
210-
tuple of `ObsView` instead of a `ObsView` of tuples.
211-
212-
If instead you want to get the subset of observations
213-
corresponding to the given `indices` in their native type, use
214-
`getobs`.
215-
216-
See [`ObsView`](@ref) for more information.
200+
Return a lazy view of the observations in `data` that
201+
correspond to the given `indices`. No data will be copied.
202+
203+
By default the return is an [`ObsView`](@ref), although this can be
204+
overloaded for custom types of `data` that want to provide
205+
their own lazy view.
206+
207+
In case `data` is a tuple or named tuple, the constructor will be mapped
208+
over its elements. For array types, return a subarray.
209+
210+
The observation in the returned view `ov` can be materialized by calling
211+
`getobs(ov, i)` on the view, where `i` is an index in `1:length(ov)`.
212+
213+
If `indices` is not provided, it will be assumed to be `1:numobs(data)`.
214+
```
215+
216+
# Examples
217+
218+
```jldoctest
219+
julia> obsview([1 2 3; 4 5 6], 1:2)
220+
2×2 view(::Matrix{Int64}, :, 1:2) with eltype Int64:
221+
1 2
222+
4 5
223+
```
217224
"""
218-
obsview(data, indices=1:numobs(data)) = ObsView(data, indices)
225+
obsview(data, indices) = ObsView(data, indices)
226+
obsview(data) = obsview(data, 1:numobs(data))
219227

220228
##### Arrays / SubArrays
221229

@@ -230,5 +238,5 @@ getobs(a::SubArray) = getobs(a.parent, last(a.indices))
230238

231239
##### Tuples / NamedTuples
232240
function obsview(tup::Union{Tuple, NamedTuple}, indices)
233-
map(data -> obsview(data, indices), tup)
241+
return map(data -> obsview(data, indices), tup)
234242
end

src/slidingwindow.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ To actually get a copy of the data at some window use indexing or [`getobs`](@re
3838
When indexing the data is accessed as `getobs(data, idxs)`, with `idxs` an appropriate range of indexes.
3939
```jldoctest
4040
julia> s = slidingwindow(11:30, size=6)
41-
slidingwindow(10:30, size=6, stride=1)
41+
slidingwindow(11:30, size=6, stride=1)
4242
4343
julia> s[1] # == getobs(data, 1:6)
4444
11:16
@@ -53,7 +53,7 @@ By default the stride is equal to 1.
5353
5454
```jldoctest
5555
julia> s = slidingwindow(11:30, size=6, stride=3)
56-
slidingwindow(1:20, size=6, stride=3)
56+
slidingwindow(11:30, size=6, stride=3)
5757
5858
julia> for w in s; println(w); end
5959
11:16

0 commit comments

Comments
 (0)