Closed
Description
In the following MWE I successively create an out-of-memory data source of 20 MNIST images using FileDataset
. I can the wrap the source as MLUtils.DataLoader
with the default parallel=false
option and collect
the result. However, if I specify parallel=true
then the collect
hangs.
Pkg.activate("data", shared=true)
import MLDatasets: MNIST
using MLDatasets
using ScientificTypes
using MLUtils
using FileIO
ENV["DATADEPS_ALWAYS_ACCEPT"] = true
images, labels = MNIST.(split=:train)[:];
N = 20
images = coerce(images, GrayImage)[1:N];
# save some MNIST images as tiff files:
const dir = tempname()
for i in eachindex(images)
filename = joinpath(dir, "$i.tiff")
FileIO.save(filename, images[i])
end
# create out-of-memory image source:
X = MLDatasets.FileDataset(dir)
sequential = DataLoader(X, batchsize=2, collate=true)
collect(sequential) # executes as expected
parallel = DataLoader(X, batchsize=2, collate=true, parallel=true);
collect(parallel); # hangs
Here's my setup:
(@data) pkg> status
Status `~/.julia/environments/data/Project.toml`
[5789e2e9] FileIO v1.16.0
[82e4d734] ImageIO v0.6.6
[eb30cadb] MLDatasets v0.7.6
[f1d291b0] MLUtils v0.3.1
[321657f4] ScientificTypes v3.0.2
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin21.4.0)
CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 5 on 12 virtual cores
Environment:
JULIA_LTS_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
JULIA_PATH = /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia
JULIA_EGLOT_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
JULIA_NUM_THREADS = 5
JULIA_NIGHTLY_PATH = /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia
Metadata
Metadata
Assignees
Labels
No labels