-
Notifications
You must be signed in to change notification settings - Fork 184
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Hi. Thanks for your great work! I am trying to fine tune Map-anything on my customized dataset. After several epoch, the training failed with the following info. I found the loss of "FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_conf_loss" becomes very large and the training failed the assertion for loss < 1000. May I ask what is the potential reason for the very large loss and how can I fix it? Thank you!
[2026-01-09 01:19:05,258][__main__][INFO] - [01:19:05.257972]
[2026-01-09 01:19:05,361][__main__][INFO] - Epoch: [3] [540/745] eta: 0:21:53 lr: 0.000074 lr_encoder: 0.000004 epoch: 3.7114 (3.3624) loss: -0.2796 (-0.0543) FactoredGeometryScal
eRegr3DPlusNormalGMLoss_pts3d_conf_loss_avg: -0.2961 (-0.2075) FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_bot95%_loss_avg: 0.0023 (0.0052) FactoredGeometryScaleRegr3DPlusNor
malGMLoss_depth_along_ray_bot95%_loss_avg: 0.0018 (0.0047) FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_avg: 0.0804 (0.1372) FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_
avg: 0.0080 (0.0137) FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_avg: 0.0074 (0.0131) FactoredGeometryScaleRegr3DPlusNormalGMLoss_ray_directions_avg: 0.0007 (0.0018) F
actoredGeometryScaleRegr3DPlusNormalGMLoss_pose_quats_avg: 0.0000 (0.0000) FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_trans_avg: 0.0000 (0.0000) NonAmbiguousMaskLoss_mask_avg: 0.
0475 (0.0494)
time: 5.9609 data: 0.0032 max mem: 39620
[2026-01-09 01:19:36,745][__main__][INFO] - [01:19:36.744978]
[2026-01-09 01:19:36,745][__main__][INFO] - Error in datasetWAI.__getitem__ for scene_idx=30528:
[2026-01-09 01:19:36,745][__main__][INFO] - [01:19:36.745607]
[2026-01-09 01:19:36,745][__main__][INFO] - Retrying with scene_idx=14533 (1 of 5)
[2026-01-09 01:20:47,119][__main__][INFO] - [01:20:47.119676]
[2026-01-09 01:20:47,176][__main__][INFO] - Loss is 5927.96533203125, stopping training
[2026-01-09 01:20:47,176][__main__][INFO] - [01:20:47.176251]
[2026-01-09 01:20:47,176][__main__][INFO] - Loss Details: {'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_conf_loss_view1': 2665.58984375, 'FactoredGeometryScaleRegr3DPlusNormalGMLos
s_pts3d_conf_loss_avg': 2963.632568359375, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_conf_loss_view2': 3261.67529296875, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_b
ot95%_loss_view1': 0.09395701438188553, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_bot95%_loss_avg': 0.09373242780566216, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts
3d_bot95%_loss_view2': 0.09350784122943878, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_bot95%_loss_view1': 0.09339836239814758, 'FactoredGeometryScaleRegr3DPlusNormalGM
Loss_depth_along_ray_bot95%_loss_avg': 0.09323638305068016, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_bot95%_loss_view2': 0.09307440370321274, 'FactoredGeometryScaleRe
gr3DPlusNormalGMLoss_pts3d_view1': 0.988899827003479, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_view2': 0.9848102927207947, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pts3d_av
g': 0.9868550598621368, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_view1': 0.09890376776456833, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_view2': 0.0984916910529
1367, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_cam_pts3d_avg': 0.098697729408741, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_view1': 0.09831757843494415, 'FactoredG
eometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_view2': 0.09802951663732529, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_depth_along_ray_avg': 0.09817354753613472, 'FactoredGeometryS
caleRegr3DPlusNormalGMLoss_ray_directions_view1': 0.001488085137680173, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_ray_directions_view2': 0.0012360082473605871, 'FactoredGeometryScaleR
egr3DPlusNormalGMLoss_ray_directions_avg': 0.00136204669252038, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_quats_view1': 1.5497208494252845e-07, 'FactoredGeometryScaleRegr3DPlusNo
rmalGMLoss_pose_quats_view2': 1.6093255794658035e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_quats_avg': 1.579523214445544e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_po
se_trans_view1': 2.920627935054654e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_trans_view2': 3.0398371109185973e-07, 'FactoredGeometryScaleRegr3DPlusNormalGMLoss_pose_trans_av
g': 2.9802325229866256e-07, 'NonAmbiguousMaskLoss_mask_view1': 0.07778102159500122, 'NonAmbiguousMaskLoss_mask_view2': 0.08098114281892776, 'NonAmbiguousMaskLoss_mask_avg': 0.07938108220
696449,
ll-l40-0:980357:980980 [3] NCCL INFO [Service thread] Connection closed by localRank 3
ll-l40-0:980357:988679 [3] NCCL INFO comm 0x3a749720 rank 3 nranks 8 cudaDev 3 busId 61000 - Abort COMPLETE
W0109 01:21:52.090000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980354 closing signal SIGTERM
W0109 01:21:52.094000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980355 closing signal SIGTERM
W0109 01:21:52.097000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980356 closing signal SIGTERM
W0109 01:21:52.099000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980358 closing signal SIGTERM
W0109 01:21:52.101000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980359 closing signal SIGTERM
W0109 01:21:52.104000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980360 closing signal SIGTERM
W0109 01:21:52.106000 950999 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 980361 closing signal SIGTERM
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested