Update bottomup inference and add note on num_workers (#361)

gitttt-1234 · web-flow · commit 4dbc7cd96100 · 2025-10-24T13:38:15.000-07:00
This PR updates the docs by adding note on using num_workers for mac and
windows. In this PR, we also add a minor fix to speed-up bottomup model
inference.
diff --git a/docs/config.md b/docs/config.md
@@ -18,13 +18,15 @@ The config file has three main sections:
     - **`train_labels_path`**: Path(s) to your training label files.
     - **`val_labels_path`**: Path(s) to your validation label files.
     - **`augmentation_config`**: Controls data augmentation settings for training.
+    - **`data_pipeline_fw`**: Method to load data during training. Options: [`torch_dataset`, `torch_dataset_cache_img_memory`, `torch_dataset_cache_img_disk`].
 
 - **`model_config`**
     - **`head_configs`**: Defines the output heads (e.g., for confidence maps, part affinity fields, etc.).
 
 - **`trainer_config`**
     - **`ckpt_dir`**: Directory where checkpoints and logs will be saved.
     - **`run_name`**: Name for this training run (used for organizing outputs and logging). The checkpoints for a specific run would be saved in `<ckpt_dir>/<run_name>` folder.
+    - **`train_data_loader.num_workers`** and **`val_data_loader.num_workers`**: Number of workers for dataloading. (For mac and windows, set this to > 0, ONLY if caching is used for `data_config.data_pipeline_fw`.)
 
 ### Sample configuration format
 
diff --git a/sleap_nn/inference/paf_grouping.py b/sleap_nn/inference/paf_grouping.py
@@ -545,6 +545,11 @@ def match_candidates_sample(
 
     See also: match_candidates_batch
     """
+    # Move tensors to CPU once to avoid repeated device<->host synchronizations
+    edge_inds_sample = edge_inds_sample.detach().cpu()
+    edge_peak_inds_sample = edge_peak_inds_sample.detach().cpu()
+    line_scores_sample = line_scores_sample.detach().cpu()
+
     match_edge_inds = []
     match_src_peak_inds = []
     match_dst_peak_inds = []
@@ -572,6 +577,8 @@ def match_candidates_sample(
                     edge_peak_inds_k[:, 1] == dst_ind
                 )
                 if mask.any():
+                    # `line_scores_k` is already on CPU; `.item()` does not trigger
+                    # a device synchronization and matches the original behaviour.
                     cost_matrix[i, j] = -line_scores_k[
                         mask
                     ].item()  # Flip sign for maximization.