Skip to content

Problems related to model training #102

Open
@zdy1013

Description

@zdy1013

Hello, I am a master with strong interest in autonomous driving. I want to reproduce your model, no, because I want to run through your model code first, and try to use part of the LMDrive data. I plan to use Town01, Town02 for training and Town03 for verification. After I set the number of Gpus in train.sh and the path of data set, the execution of train.sh did not have normal training, and the results are as follows:


(interfuser) zdy@zhaojh:~/InterFuser-main/interfuser/scripts$ bash train.sh 
/home/zdy/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2.
Added key: store_based_barrier_key:1 to store for rank: 1
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2.
Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/resnet50d_ra2-464e36ba.pth)
Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/resnet50d_ra2-464e36ba.pth)
Model interfuser_baseline created, param count:52935567
Data processing configuration for current model + dataset:
        input_size: (3, 224, 224)
        interpolation: bicubic
        mean: (0.485, 0.456, 0.406)
        std: (0.229, 0.224, 0.225)
        crop_pct: 0.875
CNN backbone and transformer blocks using different learning rates!
165 weights in the cnn backbone, 274 weights in other modules
AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Sub route dir nums: 0
Scheduled epochs: 35
Sub route dir nums: 0
Sub route dir nums: 0
Sub route dir nums: 0
Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-3.pth.tar', 0)

Current checkpoints:
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-0.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-1.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-2.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-3.pth.tar', 0)
 ('./output/20241028-100256-interfuser_baseline-224-interfuser_baseline/checkpoint-4.pth.tar', 0)

*** Best metric: 0 (epoch 0)

How can I solve this problem?Hope to get your recovery

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions