Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问为什么运行run-vnet.sh后,没有保存训练模型到best_model #71

Open
Lipanw opened this issue Apr 25, 2022 · 11 comments
Open
Assignees

Comments

@Lipanw
Copy link

Lipanw commented Apr 25, 2022

请问为什么运行run-vnet.sh后,没有保存训练模型到best_model。train.log也没有任何内容

@Lipanw
Copy link
Author

Lipanw commented Apr 25, 2022

我是在windows10系统运行的

@shiyutang
Copy link
Contributor

请问你运行了多长时间呢,有其他信息么~

@shiyutang shiyutang self-assigned this Apr 26, 2022
@Lipanw
Copy link
Author

Lipanw commented Apr 27, 2022

Traceback (most recent call last):
File "train.py", line 204, in
main(args)
File "train.py", line 198, in main
to_static_training=cfg.to_static_training)
File "F:\fuxianCode\MedicalSeg-develop\medicalseg\core\train.py", line 233, in train
save_dir=save_dir)
File "F:\fuxianCode\MedicalSeg-develop\medicalseg\core\val.py", line 151, in evaluate
'format': "xyz"
File "F:\fuxianCode\MedicalSeg-develop\medicalseg\utils\utils.py", line 244, in save_array
img_itk_new = sitk.GetImageFromArray(val)
File "D:\mysoftware\Anaconda\lib\site-packages\SimpleITK\extra.py", line 292, in GetImageFromArray
id = _get_sitk_pixelid(z)
File "D:\mysoftware\Anaconda\lib\site-packages\SimpleITK\extra.py", line 189, in _get_sitk_pixelid
raise TypeError('dtype: {0} is not supported.'.format(numpy_array_type.dtype))
TypeError: dtype: int32 is not supported.

@Lipanw
Copy link
Author

Lipanw commented Apr 27, 2022


2022-04-27 17:47:02 [INFO] [TRAIN] epoch: 0, iter: 100/15000, loss: 2.4868, DSC: 4.1360, lr: 0.009941, batch_cost: 0.7021, reader_cost: 0.00082, ips: 1.4244 samples/sec | ETA 02:54:20
2022-04-27 17:48:13 [INFO] [TRAIN] epoch: 1, iter: 200/15000, loss: 1.1843, DSC: 4.3465, lr: 0.009881, batch_cost: 0.7081, reader_cost: 0.00062, ips: 1.4123 samples/sec | ETA 02:54:39
2022-04-27 17:49:24 [INFO] [TRAIN] epoch: 2, iter: 300/15000, loss: 1.1282, DSC: 4.3768, lr: 0.009820, batch_cost: 0.7096, reader_cost: 0.00016, ips: 1.4092 samples/sec | ETA 02:53:51
2022-04-27 17:50:35 [INFO] [TRAIN] epoch: 2, iter: 400/15000, loss: 1.1043, DSC: 4.3364, lr: 0.009760, batch_cost: 0.7107, reader_cost: 0.00047, ips: 1.4071 samples/sec | ETA 02:52:56
2022-04-27 17:51:46 [INFO] [TRAIN] epoch: 3, iter: 500/15000, loss: 1.0901, DSC: 4.3506, lr: 0.009700, batch_cost: 0.7109, reader_cost: 0.00047, ips: 1.4066 samples/sec | ETA 02:51:48
2022-04-27 17:51:46 [INFO] Start evaluating (total_samples: 5, total_iters: 5)...

@Lipanw
Copy link
Author

Lipanw commented Apr 27, 2022

每次都是运行到500,要进行模型评估的时候就停止运行了

@linhandev
Copy link
Member

linhandev commented May 5, 2022

听起来是验证的时候有点问题,在issue之后我们代码有更新,可以pull一下,save_interval开小一点尝试一下

@shiyutang
Copy link
Contributor

这部分是在评估过程中保存存在问题,你可以先注释掉save_array部分开始训练,然后在这附上完整的可复现代码链接/修改的部分说明。

@Lipanw
Copy link
Author

Lipanw commented May 6, 2022

2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17
您好,之前的问题已经解决,但是相对于您在首页给的lr=0.001的例子DSC为什么这么低呢,loss也很高

@Lipanw
Copy link
Author

Lipanw commented May 6, 2022

以下是我的配置信息
------------Environment Information-------------
platform: Linux-4.15.0-158-generic-x86_64-with-debian-stretch-sid
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Paddle compiled with cuda: True
NVCC: Build cuda_11.2.r11.2/compiler.29618528_0
cudnn: 8.2
GPUs used: 1
CUDA_VISIBLE_DEVICES: None
GPU: ['GPU 0: A100-SXM4-40GB (UUID:']
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0
PaddlePaddle: 2.2.2

2022-05-06 14:45:45 [INFO]
---------------Config Information---------------
batch_size: 6
data_root: tools/data
iters: 15000
loss:
coef:

  • 1
    types:
  • coef:
    • 1
    • 1
      losses:
    • type: CrossEntropyLoss
      weight: null
    • type: DiceLoss
      type: MixedLoss
      lr_scheduler:
      decay_steps: 15000
      end_lr: 0
      learning_rate: 0.001
      power: 0.9
      type: PolynomialDecay
      model:
      elu: false
      in_channels: 1
      num_classes: 3
      pretrained: null
      type: VNet
      optimizer:
      momentum: 0.9
      type: sgd
      weight_decay: 0.0001
      train_dataset:
      dataset_root: lung_coronavirus/lung_coronavirus_phase0
      mode: train
      num_classes: 3
      result_dir: lung_coronavirus/lung_coronavirus_phase1
      transforms:
  • scale:
    • 0.8
    • 1.2
      size: 128
      type: RandomResizedCrop3D
  • degrees: 90
    type: RandomRotation3D
  • type: RandomFlip3D
    type: LungCoronavirus
    val_dataset:
    dataset_json_path: lung_coronavirus/lung_coronavirus_raw/dataset.json
    dataset_root: lung_coronavirus/lung_coronavirus_phase0
    mode: val
    num_classes: 3
    result_dir: lung_coronavirus/lung_coronavirus_phase1
    transforms: []
    type: LungCoronavirus

2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17
2022-05-06 15:07:41 [INFO] [TRAIN] epoch: 8, iter: 200/15000, loss: 3.5398, DSC: 3.8685, lr: 0.000988, batch_cost: 6.5488, reader_cost: 2.25564, ips: 0.9162 samples/sec | ETA 26:55:22
2022-05-06 15:18:36 [INFO] [TRAIN] epoch: 12, iter: 300/15000, loss: 2.8668, DSC: 3.9746, lr: 0.000982, batch_cost: 6.5445, reader_cost: 2.25206, ips: 0.9168 samples/sec | ETA 26:43:24

@linhandev
Copy link
Member

lr可能可以适当大一点

@shiyutang
Copy link
Contributor

一个问题可以只开一个issue。
另外看上去是数据的问题,是否有修改数据处理部分的代码呢?或者罗列下你都进行了什么修改?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants