-
Notifications
You must be signed in to change notification settings - Fork 14
Description
These are a minimalist training script in pure PyTorch, and its corresponding datasplit.csv:
minimal_script_pure_pytorch.py
datasplit.csv
It trains on pure ecs class, with an array size of 128x128x128, at batch size of 8, using 8 workers. During training, the CPU memory use increases rapidly during each training step, easily raises up to over 50GB, which seems way too much (consider even the whole dataset is just 37GB in size). And I recall in previous versions it didn't take that much memory.
Similarly, when I try to run the included train_3D.py example script, I get this:
...
Processing D:\DanielHuang\cellmap\cellmap-segmentation-challenge\data\jrc_mus-kidney\jrc_mus-kidney.zarr\recon-1\labels\groundtruth\crop221
Number of datasets: 245
Number of training datasets: 209 (85.31%)
Number of validation datasets: 36 (14.69%)
CSV written to datasplit.csv
100%|██████████| 289/289 [00:00<00:00, 1545.89it/s, No classes found]
Training datasets: 100%|██████████| 209/209 [00:04<00:00, 46.70it/s]
Validation datasets: 100%|██████████| 36/36 [00:00<00:00, 43.89it/s]
Training datasets: 100%|██████████| 209/209 [00:46<00:00, 4.46it/s]
Validation datasets: 100%|██████████| 36/36 [00:03<00:00, 11.64it/s]
100%|██████████| 208/208 [01:39<00:00, 2.09it/s]
Training 3d_vnet for 1000 epochs, starting at epoch 1, iteration 0...
Training: 0%| | 0/1000 [00:00<?, ?it/s]
Process finished with exit code -1073740791 (0xC0000409)
From a quick search, 0xC0000409 seems to mean STATUS_STACK_BUFFER_OVERRUN, which I suspect is something to do with memory use as well.
If I reduce the shape of input_array_info and target_array_info to 64x64x64 (from 128x128x128), and reduce the batch size to 2 (from 8), it no longer output 0xC0000409 directly, but I still observe extreme memory spike.

I am using the latest repository #b618dfd
My environment:
pip list
Package Version Editable project location
------------------------------ -------------- -----------------------------------------------------
absl-py 2.2.2
aiobotocore 2.17.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.11.18
aioitertools 0.12.0
aiosignal 1.3.2
annotated-types 0.7.0
anyio 4.9.0
asciitree 0.3.3
asttokens 3.0.0
atomicwrites 1.4.1
attrs 25.3.0
blinker 1.9.0
boto3 1.35.81
botocore 1.35.93
cachetools 5.5.2
cellmap-data 2025.7.24.1615
cellmap-flow 0.1.3
cellmap-segmentation-challenge 0.0.1 D:\DanielHuang\cellmap\cellmap-segmentation-challenge
cellpose 4.0.2
certifi 2025.4.26
charset-normalizer 3.4.2
click 8.2.0
cloudpickle 3.1.1
cmake 3.31.6
colorama 0.4.6
contourpy 1.3.2
cycler 0.12.1
daisy 1.2.2
dask 2025.4.1
decorator 5.2.1
Deprecated 1.2.18
dill 0.4.0
eval-type-backport 0.1.3
executing 2.2.0
fastapi 0.115.13
fasteners 0.19
fastremap 1.16.1
ffmpy 0.6.0
filelock 3.13.1
fill_voids 2.0.8
flasgger 0.9.7.1
Flask 3.1.0
flask-cors 5.0.1
flexcache 0.3
flexparser 0.4
fonttools 4.58.0
frozenlist 1.6.0
fsspec 2024.6.1
funlib.geometry 0.3.0
funlib.math 0.1
google-apitools 0.5.32
google-auth 2.40.1
gradio 5.34.2
gradio_client 1.10.3
groovy 0.1.2
grpcio 1.71.0
gunicorn 23.0.0
h11 0.16.0
h5py 3.13.0
httpcore 1.0.9
httplib2 0.22.0
httpx 0.28.1
huggingface-hub 0.33.1
idna 3.10
imagecodecs 2025.3.30
imageio 2.37.0
importlib_metadata 8.7.0
ipython 9.3.0
ipython_pygments_lexers 1.1.1
itsdangerous 2.2.0
jedi 0.19.2
Jinja2 3.1.4
jmespath 1.0.1
joblib 1.5.0
jsonschema 4.23.0
jsonschema-specifications 2025.4.1
kiwisolver 1.4.8
lazy_loader 0.4
lightning 2.5.1.post0
lightning-utilities 0.14.3
lit 18.1.8
llvmlite 0.44.0
locket 1.0.0
Markdown 3.8
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 4.0.0
matplotlib 3.10.3
matplotlib-inline 0.1.7
mdurl 0.1.2
mistune 3.1.3
ml_collections 1.1.0
ml_dtypes 0.5.1
mpmath 1.3.0
multidict 6.4.3
natsort 8.4.0
networkx 3.3
neuroglancer 2.40.1
ninja 1.11.1.4
numba 0.61.2
numcodecs 0.15.1
numpy 2.1.2
oauth2client 4.1.3
opencv-python-headless 4.11.0.86
orjson 3.10.18
packaging 24.2
pandas 2.2.3
parso 0.8.4
partd 1.4.2
pillow 11.0.0
Pint 0.24.4
pip 25.1.1
platformdirs 4.3.8
prompt_toolkit 3.0.51
propcache 0.3.1
protobuf 6.30.2
pure_eval 0.2.3
pyasn1 0.6.1
pyasn1_modules 0.4.2
pybind11 2.13.6
pydantic 2.11.4
pydantic_core 2.33.2
pydantic-ome-ngff 0.6.0
pydantic-zarr 0.7.0
pydub 0.25.1
Pygments 2.19.2
pykdtree 1.4.1
pyparsing 3.2.3
pyreadline3 3.5.4
python-dateutil 2.9.0.post0
python-dotenv 1.1.0
python-multipart 0.0.20
pytorch-lightning 2.5.1.post0
pytz 2025.2
PyYAML 6.0.2
referencing 0.36.2
requests 2.32.3
rich 14.0.0
roifile 2025.5.10
rpds-py 0.24.0
rsa 4.9.1
ruff 0.12.0
s3fs 2024.6.1
s3transfer 0.10.4
safehttpx 0.1.6
safetensors 0.5.3
scikit-dimension 0.3.4
scikit-image 0.25.2
scikit-learn 1.6.1
scipy 1.15.3
segment-anything 1.0
semantic-version 2.10.0
setuptools 65.5.0
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
stack-data 0.6.3
starlette 0.46.2
structlog 25.3.0
sympy 1.13.3
tabulate 0.9.0
tensorboard 2.19.0
tensorboard-data-server 0.7.2
tensorboardX 2.6.2.2
tensorstore 0.1.74
threadpoolctl 3.6.0
tifffile 2025.5.10
tomlkit 0.13.3
toolz 1.0.0
torch 2.7.1+cu128
torchaudio 2.7.1+cu128
torchmetrics 1.7.1
torchvision 0.22.1+cu128
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
triton-windows 3.3.1.post19
typer 0.16.0
typing_extensions 4.12.2
typing-inspection 0.4.0
tzdata 2025.2
universal_pathlib 0.2.6
urllib3 2.4.0
uvicorn 0.34.3
wcwidth 0.2.13
websockets 15.0.1
Werkzeug 3.1.3
wheel 0.45.1
wrapt 1.17.2
xarray 2025.4.0
xarray-ome-ngff 3.1.1
xarray-tensorstore 0.1.5
yarl 1.20.0
zarr 2.18.4
zipp 3.21.0