Skip to content

Very large loss of "FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_conf_loss" when fine tuning #124

@bowieshi

Description

@bowieshi

Hi. Thanks for your great work! I am trying to fine tune Map-anything on my customized dataset. After several epoch, the training failed with the following info. I found the loss of "FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_conf_loss" becomes very large and the training failed the assertion for loss < 1000. May I ask what is the potential reason for the very large loss and how can I fix it? Thank you!

[2026-01-09 01:19:05,258][__main__][INFO] - [01:19:05.257972]                                                                                                                             
[2026-01-09 01:19:05,361][__main__][INFO] - Epoch: [3]  [540/745]  eta: 0:21:53  lr: 0.000074  lr_encoder: 0.000004  epoch: 3.7114 (3.3624)  loss: -0.2796 (-0.0543)  FactoredGeometryScal
eRegr3DPlusNormalGMLoss_pts3d_conf_loss_avg: -0.2961 (-0.2075)  FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_bot95%_loss_avg: 0.0023 (0.0052)  FactoredGeometryScaleRegr3DPlusNor
malGMLoss_depth_along_ray_bot95%_loss_avg: 0.0018 (0.0047)  FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_avg: 0.0804 (0.1372)  FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_
avg: 0.0080 (0.0137)  FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_avg: 0.0074 (0.0131)  FactoredGeometryScaleRegr3DPlusNormalGMLoss_ray_directions_avg: 0.0007 (0.0018)  F
actoredGeometryScaleRegr3DPlusNormalGMLoss_pose_quats_avg: 0.0000 (0.0000)  FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_trans_avg: 0.0000 (0.0000)  NonAmbiguousMaskLoss_mask_avg: 0.
0475 (0.0494)
 time: 5.9609  data: 0.0032  max mem: 39620      
[2026-01-09 01:19:36,745][__main__][INFO] - [01:19:36.744978]                                                                                                                             
[2026-01-09 01:19:36,745][__main__][INFO] - Error in datasetWAI.__getitem__ for scene_idx=30528:                                                                                          
[2026-01-09 01:19:36,745][__main__][INFO] - [01:19:36.745607]                                                                                                                             
[2026-01-09 01:19:36,745][__main__][INFO] - Retrying with scene_idx=14533 (1 of 5)                                                       
[2026-01-09 01:20:47,119][__main__][INFO] - [01:20:47.119676]                                                                            
[2026-01-09 01:20:47,176][__main__][INFO] - Loss is 5927.96533203125, stopping training                                                  
[2026-01-09 01:20:47,176][__main__][INFO] - [01:20:47.176251]                                                                            
[2026-01-09 01:20:47,176][__main__][INFO] - Loss Details: {'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_conf_loss_view1': 2665.58984375, 'FactoredGeometryScaleRegr3DPlusNormalGMLos
s_pts3d_conf_loss_avg': 2963.632568359375, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_conf_loss_view2': 3261.67529296875, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_b
ot95%_loss_view1': 0.09395701438188553, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_bot95%_loss_avg': 0.09373242780566216, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts
3d_bot95%_loss_view2': 0.09350784122943878, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_bot95%_loss_view1': 0.09339836239814758, 'FactoredGeometryScaleRegr3DPlusNormalGM
Loss_depth_along_ray_bot95%_loss_avg': 0.09323638305068016, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_bot95%_loss_view2': 0.09307440370321274, 'FactoredGeometryScaleRe
gr3DPlusNormalGMLoss_pts3d_view1': 0.988899827003479, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_view2': 0.9848102927207947, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_av
g': 0.9868550598621368, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_view1': 0.09890376776456833, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_view2': 0.0984916910529
1367, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_avg': 0.098697729408741, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_view1': 0.09831757843494415, 'FactoredG
eometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_view2': 0.09802951663732529, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_avg': 0.09817354753613472, 'FactoredGeometryS
caleRegr3DPlusNormalGMLoss_ray_directions_view1': 0.001488085137680173, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_ray_directions_view2': 0.0012360082473605871, 'FactoredGeometryScaleR
egr3DPlusNormalGMLoss_ray_directions_avg': 0.00136204669252038, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_quats_view1': 1.5497208494252845e-07, 'FactoredGeometryScaleRegr3DPlusNo
rmalGMLoss_pose_quats_view2': 1.6093255794658035e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_quats_avg': 1.579523214445544e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_po
se_trans_view1': 2.920627935054654e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_trans_view2': 3.0398371109185973e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_trans_av
g': 2.9802325229866256e-07, 'NonAmbiguousMaskLoss_mask_view1': 0.07778102159500122, 'NonAmbiguousMaskLoss_mask_view2': 0.08098114281892776, 'NonAmbiguousMaskLoss_mask_avg': 0.07938108220
696449, 
ll-l40-0:980357:980980 [3] NCCL INFO [Service thread] Connection closed by localRank 3                                                                                                    
ll-l40-0:980357:988679 [3] NCCL INFO comm 0x3a749720 rank 3 nranks 8 cudaDev 3 busId 61000 - Abort COMPLETE                                                                               
W0109 01:21:52.090000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980354 closing signal SIGTERM                                            
W0109 01:21:52.094000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980355 closing signal SIGTERM                                            
W0109 01:21:52.097000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980356 closing signal SIGTERM                                            
W0109 01:21:52.099000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980358 closing signal SIGTERM                                            
W0109 01:21:52.101000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980359 closing signal SIGTERM                                            
W0109 01:21:52.104000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980360 closing signal SIGTERM                                            
W0109 01:21:52.106000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980361 closing signal SIGTERM

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions