请问为什么运行run-vnet.sh后，没有保存训练模型到best_model #71

Lipanw · 2022-04-25T06:04:07Z

请问为什么运行run-vnet.sh后，没有保存训练模型到best_model。train.log也没有任何内容

Lipanw · 2022-04-25T06:04:46Z

我是在windows10系统运行的

shiyutang · 2022-04-26T02:56:44Z

请问你运行了多长时间呢，有其他信息么～

Lipanw · 2022-04-27T10:36:00Z

Traceback (most recent call last):
File "train.py", line 204, in
main(args)
File "train.py", line 198, in main
to_static_training=cfg.to_static_training)
File "F:\fuxianCode\MedicalSeg-develop\medicalseg\core\train.py", line 233, in train
save_dir=save_dir)
File "F:\fuxianCode\MedicalSeg-develop\medicalseg\core\val.py", line 151, in evaluate
'format': "xyz"
File "F:\fuxianCode\MedicalSeg-develop\medicalseg\utils\utils.py", line 244, in save_array
img_itk_new = sitk.GetImageFromArray(val)
File "D:\mysoftware\Anaconda\lib\site-packages\SimpleITK\extra.py", line 292, in GetImageFromArray
id = _get_sitk_pixelid(z)
File "D:\mysoftware\Anaconda\lib\site-packages\SimpleITK\extra.py", line 189, in _get_sitk_pixelid
raise TypeError('dtype: {0} is not supported.'.format(numpy_array_type.dtype))
TypeError: dtype: int32 is not supported.

Lipanw · 2022-04-27T10:45:46Z

2022-04-27 17:47:02 [INFO] [TRAIN] epoch: 0, iter: 100/15000, loss: 2.4868, DSC: 4.1360, lr: 0.009941, batch_cost: 0.7021, reader_cost: 0.00082, ips: 1.4244 samples/sec | ETA 02:54:20
2022-04-27 17:48:13 [INFO] [TRAIN] epoch: 1, iter: 200/15000, loss: 1.1843, DSC: 4.3465, lr: 0.009881, batch_cost: 0.7081, reader_cost: 0.00062, ips: 1.4123 samples/sec | ETA 02:54:39
2022-04-27 17:49:24 [INFO] [TRAIN] epoch: 2, iter: 300/15000, loss: 1.1282, DSC: 4.3768, lr: 0.009820, batch_cost: 0.7096, reader_cost: 0.00016, ips: 1.4092 samples/sec | ETA 02:53:51
2022-04-27 17:50:35 [INFO] [TRAIN] epoch: 2, iter: 400/15000, loss: 1.1043, DSC: 4.3364, lr: 0.009760, batch_cost: 0.7107, reader_cost: 0.00047, ips: 1.4071 samples/sec | ETA 02:52:56
2022-04-27 17:51:46 [INFO] [TRAIN] epoch: 3, iter: 500/15000, loss: 1.0901, DSC: 4.3506, lr: 0.009700, batch_cost: 0.7109, reader_cost: 0.00047, ips: 1.4066 samples/sec | ETA 02:51:48
2022-04-27 17:51:46 [INFO] Start evaluating (total_samples: 5, total_iters: 5)...

Lipanw · 2022-04-27T10:46:32Z

每次都是运行到500，要进行模型评估的时候就停止运行了

linhandev · 2022-05-05T00:08:48Z

听起来是验证的时候有点问题，在issue之后我们代码有更新，可以pull一下，save_interval开小一点尝试一下

shiyutang · 2022-05-05T11:52:42Z

这部分是在评估过程中保存存在问题，你可以先注释掉save_array部分开始训练，然后在这附上完整的可复现代码链接/修改的部分说明。

Lipanw · 2022-05-06T07:01:39Z

2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17
您好，之前的问题已经解决，但是相对于您在首页给的lr=0.001的例子DSC为什么这么低呢，loss也很高

Lipanw · 2022-05-06T07:24:28Z

以下是我的配置信息
------------Environment Information-------------
platform: Linux-4.15.0-158-generic-x86_64-with-debian-stretch-sid
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Paddle compiled with cuda: True
NVCC: Build cuda_11.2.r11.2/compiler.29618528_0
cudnn: 8.2
GPUs used: 1
CUDA_VISIBLE_DEVICES: None
GPU: ['GPU 0: A100-SXM4-40GB (UUID:']
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0
PaddlePaddle: 2.2.2

2022-05-06 14:45:45 [INFO]
---------------Config Information---------------
batch_size: 6
data_root: tools/data
iters: 15000
loss:
coef:

1
types:
coef:
- 1
- 1
  losses:
- type: CrossEntropyLoss
  weight: null
- type: DiceLoss
  type: MixedLoss
  lr_scheduler:
  decay_steps: 15000
  end_lr: 0
  learning_rate: 0.001
  power: 0.9
  type: PolynomialDecay
  model:
  elu: false
  in_channels: 1
  num_classes: 3
  pretrained: null
  type: VNet
  optimizer:
  momentum: 0.9
  type: sgd
  weight_decay: 0.0001
  train_dataset:
  dataset_root: lung_coronavirus/lung_coronavirus_phase0
  mode: train
  num_classes: 3
  result_dir: lung_coronavirus/lung_coronavirus_phase1
  transforms:
scale:
- 0.8
- 1.2
  size: 128
  type: RandomResizedCrop3D
degrees: 90
type: RandomRotation3D
type: RandomFlip3D
type: LungCoronavirus
val_dataset:
dataset_json_path: lung_coronavirus/lung_coronavirus_raw/dataset.json
dataset_root: lung_coronavirus/lung_coronavirus_phase0
mode: val
num_classes: 3
result_dir: lung_coronavirus/lung_coronavirus_phase1
transforms: []
type: LungCoronavirus

2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17
2022-05-06 15:07:41 [INFO] [TRAIN] epoch: 8, iter: 200/15000, loss: 3.5398, DSC: 3.8685, lr: 0.000988, batch_cost: 6.5488, reader_cost: 2.25564, ips: 0.9162 samples/sec | ETA 26:55:22
2022-05-06 15:18:36 [INFO] [TRAIN] epoch: 12, iter: 300/15000, loss: 2.8668, DSC: 3.9746, lr: 0.000982, batch_cost: 6.5445, reader_cost: 2.25206, ips: 0.9168 samples/sec | ETA 26:43:24

linhandev · 2022-05-08T02:11:58Z

lr可能可以适当大一点

shiyutang · 2022-05-16T06:21:29Z

一个问题可以只开一个issue。
另外看上去是数据的问题，是否有修改数据处理部分的代码呢？或者罗列下你都进行了什么修改？

shiyutang self-assigned this Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请问为什么运行run-vnet.sh后，没有保存训练模型到best_model #71

请问为什么运行run-vnet.sh后，没有保存训练模型到best_model #71

Lipanw commented Apr 25, 2022

Lipanw commented Apr 25, 2022

shiyutang commented Apr 26, 2022

Lipanw commented Apr 27, 2022

Lipanw commented Apr 27, 2022

Lipanw commented Apr 27, 2022

linhandev commented May 5, 2022 •

edited

Loading

shiyutang commented May 5, 2022

Lipanw commented May 6, 2022

Lipanw commented May 6, 2022

linhandev commented May 8, 2022

shiyutang commented May 16, 2022

请问为什么运行run-vnet.sh后，没有保存训练模型到best_model #71

请问为什么运行run-vnet.sh后，没有保存训练模型到best_model #71

Comments

Lipanw commented Apr 25, 2022

Lipanw commented Apr 25, 2022

shiyutang commented Apr 26, 2022

Lipanw commented Apr 27, 2022

Lipanw commented Apr 27, 2022

Lipanw commented Apr 27, 2022

linhandev commented May 5, 2022 • edited Loading

shiyutang commented May 5, 2022

Lipanw commented May 6, 2022

Lipanw commented May 6, 2022

linhandev commented May 8, 2022

shiyutang commented May 16, 2022

linhandev commented May 5, 2022 •

edited

Loading