-
Notifications
You must be signed in to change notification settings - Fork 535
Description
Bug description
After a successful first validation epoch with reasonable prediction counts, all subsequent validation epochs produce zero predictions, even though the model continues training with low loss and decreasing validation loss
I am using the default training/evaluation pipeline provided in the repo.
The original logs did not show any GT/prediction counts — only a warning:
- (Undefined metric value, caused by empty GTs or predictions)
To debug this, I added print statements in the evaluation loop to log the actual counts:
print(f"GTs: {len(targets)}, Predictions: {len(predictions)}")
This revealed that:
- Epoch 1 had predictions
- Epoch 2 and beyond had Predictions: 0 for all batches, while GTs remained non-zero.
Code snippet to reproduce the bug
def evaluate(model, val_loader, batch_transforms, val_metric, log=None):
# Reset val metric
val_metric.reset()
# Validation loop
val_loss, batch_cnt = 0, 0
val_iter = iter(val_loader)
pbar = tqdm(val_iter, dynamic_ncols=True)
for images, targets in pbar:
images = batch_transforms(images)
out = model(images, target=targets, training=False, return_preds=True)
# Compute metric
loc_preds = out["preds"]
print("loc_preds" , loc_preds)
for target, loc_pred in zip(targets, loc_preds):
for boxes_gt, boxes_pred in zip(target.values(), loc_pred.values()):
if args.rotation and args.eval_straight:
# Convert pred to boxes [xmin, ymin, xmax, ymax] N, 5, 2 (with scores) --> N, 4
boxes_pred = np.concatenate((boxes_pred[:, :4].min(axis=1), boxes_pred[:, :4].max(axis=1)), axis=-1)
print(f"GTs: {len(boxes_gt)}, Predictions: {len(boxes_pred)}")
val_metric.update(gts=boxes_gt, preds=boxes_pred[:, :4])
pbar.set_description(f"Validation loss: {out['loss'].numpy():.6}")
log(val_loss=out["loss"].numpy())
val_loss += out["loss"].numpy()
batch_cnt += 1
val_loss /= batch_cnt
recall, precision, mean_iou = val_metric.summary()
return val_loss, recall, precision, mean_iou
Error traceback
Training loss: 1.00829 | LR: 0.00080002: 100%|████████████████████████████████████████████████████████████████| 81/81 [01:10<00:00, 1.15it/s]
Epoch 1/5 - Training loss: 1.34785 | LR: 0.00080002
Validation loss: 2.46732e+05: 100%|███████████████████████████████████████████████████████████████████████████| 21/21 [00:05<00:00, 3.67it/s]
Validation loss decreased inf --> 2.44343e+05: saving state...
Epoch 1/5 - Validation loss: 2.44343e+05 (Undefined metric value, caused by empty GTs or predictions)
Training loss: 1.00324 | LR: 0.00060004: 100%|████████████████████████████████████████████████████████████████| 81/81 [00:47<00:00, 1.71it/s]
Epoch 2/5 - Training loss: 1.00499 | LR: 0.00060004
Validation loss: 1.06822: 100%|███████████████████████████████████████████████████████████████████████████████| 21/21 [00:05<00:00, 3.62it/s]
Validation loss decreased 2.44343e+05 --> 1.0682: saving state...
Epoch 2/5 - Validation loss: 1.0682 (Undefined metric value, caused by empty GTs or predictions)
Training loss: 1.00213 | LR: 0.00040006: 100%|████████████████████████████████████████████████████████████████| 81/81 [00:47<00:00, 1.70it/s]
Epoch 3/5 - Training loss: 1.00283 | LR: 0.00040006
Validation loss: 1.01836: 100%|███████████████████████████████████████████████████████████████████████████████| 21/21 [00:05<00:00, 3.60it/s]
Validation loss decreased 1.0682 --> 1.01834: saving state...
Epoch 3/5 - Validation loss: 1.01834 (Undefined metric value, caused by empty GTs or predictions)
Training loss: 1.00214 | LR: 0.00020008: 100%|████████████████████████████████████████████████████████████████| 81/81 [00:47<00:00, 1.71it/s]
Epoch 4/5 - Training loss: 1.00221 | LR: 0.00020008
Validation loss: 1.00613: 100%|███████████████████████████████████████████████████████████████████████████████| 21/21 [00:05<00:00, 3.68it/s]
Validation loss decreased 1.01834 --> 1.00614: saving state...
Epoch 4/5 - Validation loss: 1.00614 (Undefined metric value, caused by empty GTs or predictions)
Training loss: 1.00171 | LR: 1e-07: 100%|█████████████████████████████████████████████████████████████████████| 81/81 [00:47<00:00, 1.72it/s]
Epoch 5/5 - Training loss: 1.00193 | LR: 1e-07
Validation loss: 1.00313: 100%|███████████████████████████████████████████████████████████████████████████████| 21/21 [00:05<00:00, 3.67it/s]
Validation loss decreased 1.00614 --> 1.00321: saving state...
Epoch 5/5 - Validation loss: 1.00321 (Undefined metric value, caused by empty GTs or predictions)
then i got
save_interval_epoch: False
Training loss: 1.00212 | LR: 0.00080002: 100%|████████████████████████████████████████████████████████████████| 81/81 [01:58<00:00, 1.46s/it]
Epoch 1/5 - Training loss: 1.21492 | LR: 0.00080002
0%| | 0/21 [00:00<?, ?it/s]GTs: 154, Predictions: 867
GTs: 266, Predictions: 616
Validation loss: 1.32726: 5%|███▊ | 1/21 [00:00<00:12, 1.63it/s]GTs: 157, Predictions: 252
GTs: 155, Predictions: 1000
Validation loss: 1.32641: 10%|███████▌ | 2/21 [00:01<00:09, 1.93it/s]GTs: 465, Predictions: 877
GTs: 141, Predictions: 670
Validation loss: 1.33117: 14%|███████████▍ | 3/21 [00:01<00:08, 2.04it/s]GTs: 138, Predictions: 616
GTs: 150, Predictions: 885
Validation loss: 1.32642: 19%|███████████████▏ | 4/21 [00:02<00:09, 1.86it/s]GTs: 152, Predictions: 620
GTs: 155, Predictions: 853
Validation loss: 1.32649: 24%|███████████████████ | 5/21 [00:02<00:08, 1.87it/s]GTs: 152, Predictions: 801
GTs: 154, Predictions: 739
Validation loss: 1.32666: 29%|██████████████████████▊ | 6/21 [00:03<00:07, 1.90it/s]GTs: 328, Predictions: 146
GTs: 296, Predictions: 771
Validation loss: 1.33071: 33%|██████████████████████████▋ | 7/21 [00:03<00:06, 2.06it/s]GTs: 124, Predictions: 132
GTs: 153, Predictions: 917
Validation loss: 1.32777: 38%|██████████████████████████████▍ | 8/21 [00:04<00:06, 2.02it/s]GTs: 137, Predictions: 1060
GTs: 380, Predictions: 141
Validation loss: 1.33008: 43%|██████████████████████████████████▎ | 9/21 [00:04<00:05, 2.02it/s]GTs: 153, Predictions: 698
GTs: 155, Predictions: 828
Validation loss: 1.32672: 48%|█████████████████████████████████████▌ | 10/21 [00:05<00:05, 1.93it/s]GTs: 151, Predictions: 594
GTs: 154, Predictions: 778
Validation loss: 1.3265: 52%|█████████████████████████████████████████▉ | 11/21 [00:05<00:05, 1.74it/s]GTs: 368, Predictions: 907
GTs: 143, Predictions: 908
Validation loss: 1.32871: 57%|█████████████████████████████████████████████▏ | 12/21 [00:06<00:05, 1.70it/s]GTs: 380, Predictions: 921
GTs: 269, Predictions: 1049
Validation loss: 1.33094: 62%|████████████████████████████████████████████████▉ | 13/21 [00:07<00:04, 1.63it/s]GTs: 137, Predictions: 917
GTs: 153, Predictions: 799
Validation loss: 1.32567: 67%|████████████████████████████████████████████████████▋ | 14/21 [00:07<00:04, 1.69it/s]GTs: 155, Predictions: 1024
GTs: 153, Predictions: 723
Validation loss: 1.32667: 71%|████████████████████████████████████████████████████████▍ | 15/21 [00:08<00:04, 1.38it/s]GTs: 148, Predictions: 618
GTs: 156, Predictions: 932
Validation loss: 1.3263: 76%|████████████████████████████████████████████████████████████▉ | 16/21 [00:09<00:03, 1.48it/s]GTs: 283, Predictions: 1800
GTs: 139, Predictions: 741
Validation loss: 1.32657: 81%|███████████████████████████████████████████████████████████████▉ | 17/21 [00:09<00:02, 1.52it/s]GTs: 153, Predictions: 817
GTs: 153, Predictions: 587
Validation loss: 1.32642: 86%|███████████████████████████████████████████████████████████████████▋ | 18/21 [00:10<00:01, 1.57it/s]GTs: 302, Predictions: 902
GTs: 614, Predictions: 927
Validation loss: 1.33019: 90%|███████████████████████████████████████████████████████████████████████▍ | 19/21 [00:11<00:01, 1.44it/s]GTs: 146, Predictions: 871
GTs: 153, Predictions: 760
Validation loss: 1.3254: 95%|████████████████████████████████████████████████████████████████████████████▏ | 20/21 [00:11<00:00, 1.52it/s]GTs: 151, Predictions: 569
Validation loss: 1.32705: 100%|███████████████████████████████████████████████████████████████████████████████| 21/21 [00:12<00:00, 1.70it/s]
Validation loss decreased inf --> 1.32762: saving state...
Epoch 1/5 - Validation loss: 1.32762 (Recall: 0.00% | Precision: 0.00% | Mean IoU: 0.00%)
Training loss: 1.00112 | LR: 0.00060004: 100%|████████████████████████████████████████████████████████████████| 81/81 [00:50<00:00, 1.60it/s]
Epoch 2/5 - Training loss: 1.00137 | LR: 0.00060004
0%| | 0/21 [00:00<?, ?it/s]GTs: 154, Predictions: 0
GTs: 266, Predictions: 0
Validation loss: 1.03879: 5%|███▊ | 1/21 [00:00<00:05, 3.74it/s]GTs: 157, Predictions: 0
GTs: 155, Predictions: 0
Validation loss: 1.03891: 10%|███████▌ | 2/21 [00:00<00:05, 3.72it/s]GTs: 465, Predictions: 0
GTs: 141, Predictions: 0
Validation loss: 1.03803: 14%|███████████▍ | 3/21 [00:00<00:04, 3.68it/s]GTs: 138, Predictions: 0
GTs: 150, Predictions: 0
Validation loss: 1.03886: 19%|███████████████▏ | 4/21 [00:01<00:04, 3.65it/s]GTs: 152, Predictions: 0
GTs: 155, Predictions: 0
Validation loss: 1.03891: 24%|███████████████████ | 5/21 [00:01<00:04, 3.67it/s]GTs: 152, Predictions: 0
GTs: 154, Predictions: 0
Validation loss: 1.0388: 29%|███████████████████████▏ | 6/21 [00:01<00:04, 3.69it/s]GTs: 328, Predictions: 0
GTs: 296, Predictions: 0
Validation loss: 1.03781: 33%|██████████████████████████▋ | 7/21 [00:01<00:03, 3.66it/s]GTs: 124, Predictions: 0
GTs: 153, Predictions: 0
Validation loss: 1.03864: 38%|██████████████████████████████▍ | 8/21 [00:02<00:03, 3.66it/s]GTs: 137, Predictions: 0
GTs: 380, Predictions: 0
Validation loss: 1.03816: 43%|██████████████████████████████████▎ | 9/21 [00:02<00:03, 3.66it/s]GTs: 153, Predictions: 0
GTs: 155, Predictions: 0
Validation loss: 1.03892: 48%|█████████████████████████████████████▌ | 10/21 [00:02<00:03, 3.66it/s]GTs: 151, Predictions: 0
GTs: 154, Predictions: 0
Validation loss: 1.03882: 52%|█████████████████████████████████████████▍ | 11/21 [00:02<00:02, 3.67it/s]GTs: 368, Predictions: 0
GTs: 143, Predictions: 0
Validation loss: 1.03877: 57%|█████████████████████████████████████████████▏ | 12/21 [00:03<00:02, 3.67it/s]GTs: 380, Predictions: 0
GTs: 269, Predictions: 0
Validation loss: 1.03824: 62%|████████████████████████████████████████████████▉ | 13/21 [00:03<00:02, 3.64it/s]GTs: 137, Predictions: 0
GTs: 153, Predictions: 0
Validation loss: 1.03878: 67%|████████████████████████████████████████████████████▋ | 14/21 [00:03<00:01, 3.67it/s]GTs: 155, Predictions: 0
GTs: 153, Predictions: 0
Validation loss: 1.03888: 71%|████████████████████████████████████████████████████████▍ | 15/21 [00:04<00:01, 3.67it/s]GTs: 148, Predictions: 0
GTs: 156, Predictions: 0
Validation loss: 1.03895: 76%|████████████████████████████████████████████████████████████▏ | 16/21 [00:04<00:01, 3.68it/s]GTs: 283, Predictions: 0
GTs: 139, Predictions: 0
Validation loss: 1.03864: 81%|███████████████████████████████████████████████████████████████▉ | 17/21 [00:04<00:01, 3.71it/s]GTs: 153, Predictions: 0
GTs: 153, Predictions: 0
Validation loss: 1.03885: 86%|███████████████████████████████████████████████████████████████████▋ | 18/21 [00:04<00:00, 3.73it/s]GTs: 302, Predictions: 0
GTs: 614, Predictions: 0
Validation loss: 1.03844: 90%|███████████████████████████████████████████████████████████████████████▍ | 19/21 [00:05<00:00, 3.69it/s]GTs: 146, Predictions: 0
GTs: 153, Predictions: 0
Validation loss: 1.03898: 95%|███████████████████████████████████████████████████████████████████████████▏ | 20/21 [00:05<00:00, 3.70it/s]GTs: 151, Predictions: 0
Validation loss: 1.03875: 100%|███████████████████████████████████████████████████████████████████████████████| 21/21 [00:05<00:00, 3.71it/s]
Validation loss decreased 1.32762 --> 1.03866: saving state...
Epoch 2/5 - Validation loss: 1.03866 (Undefined metric value, caused by empty GTs or predictions)
Training loss: 1.00073 | LR: 0.00040006: 100%|████████████████████████████████████████████████████████████████| 81/81 [00:47<00:00, 1.70it/s]
Environment
DocTR version: 0.12.1a0
TensorFlow version: 2.19.0
PyTorch version: N/A (torchvision N/A)
OpenCV version: 4.11.0
OS: Ubuntu 24.04.1 LTS
Python version: 3.10.9
Is CUDA available (TensorFlow): Yes
Is CUDA available (PyTorch): N/A
CUDA runtime version: 12.0.140
Deep Learning backend
I'm using TF
is_tf_available: True
is_torch_available: False