RetrievalResults as sequence of tensors #565

AlekseySh · 2024-05-18T18:16:21Z

CHANGELOG

RetrievalResults uses Sequence of Tensors which may have different size. In other words, it allows us to support the case when queries have different number of retrieved items.
Consequently, changed batched_knn, retrieval_metrics and PairwiseReranker to support new input type.
Added assert that distances arrive sorted to RetrievalResults, retrieved ids are unique and other checks.

New tests:

Added tests on corner cases for RetrievalResults creation.
Added tests on visualization when queries in RetrievalResults have different number of retrieved items.
Added new test with predefined values for batched_knn to make debugging easier.
Changed existing postprocessor tests: used sequence in datasets so queries have different number of retrieved items and we actually test new functionality.

@leoromanovich and I also checked that using Sequence of Tensors doesn't lead to poor performance on validation.

DaloroAT · 2024-05-21T17:11:37Z

oml/functional/knn.py

+        distances_b_sorted, retrieved_ids_b = torch.topk(distances_b, k=top_n, largest=False, sorted=True)
+
+        # every query may have arbitrary number of retrieved items, so we are forced to use a loop to store the results
+        for dist, ids in zip(distances_b_sorted, retrieved_ids_b):


I don't know if it simplifies or improves something, but you can split tensor by chunks to get tuple of tensors:

distances_b_sorted = torch.tensor( [ [1.0, 2.0, 3.0, float("inf")], [float("inf"), float("inf"), float("inf"), float("inf")], [3.0, 5.0, 6, float("inf")], ] ) retrieved_ids_b = torch.tensor([[10, 1, 2, 7], [3, 14, 5, 6]]) mask_to_keep = ~distances_b_sorted.isinf() elems_per_query = mask_to_keep.sum(dim=1) distances = torch.split(distances_b_sorted[mask_to_keep], elems_per_query.tolist()) retrieved_ids = torch.split(retrieved_ids_b[mask_to_keep], elems_per_query.tolist())

You can play with different amounts of infs. Or just leave as is.

I've check with TQDM that this function is not the bottleneck, so, let's do optimization later (after more urgent stuff)

DaloroAT · 2024-05-21T17:43:50Z

oml/functional/metrics.py

-    top_k = _clip_max_with_warning(top_k, gt_tops.shape[1])
+
+    def precision_single(is_correct: BoolTensor, n_gt_: int, k_: int) -> float:
+        k_ = min(k_, len(is_correct))


Might len(is_correct) be zero? 🤔 previously denominator was defined only by k or shape of gt_tops that are non-zero. But now it might be empty which is okay after postprocessing.

You are right. But support of this case is added in the next PR: #566

DaloroAT · 2024-05-21T17:47:17Z

oml/functional/metrics.py

+    def precision_single(is_correct: BoolTensor, n_gt_: int, k_: int) -> float:
+        k_ = min(k_, len(is_correct))
+        value = torch.cumsum(is_correct, dim=0)[k_ - 1] / min(n_gt_, k_)
+        return float(value)


Not sure that converting torch.float [torch.cumsum()/... inside func] -> float [on return] -> torch.float [outside after func] makes sense when using the inner function.

I return float from the inner function because I expect here just a single value without any extra dimensions which I may have if I keep it as tensor

DaloroAT · 2024-05-21T17:54:40Z

oml/retrieval/postprocessors/pairwise.py

-        assert retrieved_ids.shape[1] <= len(dataset.get_gallery_ids())
+        assert len(dataset.get_query_ids()) == len(
+            rr.retrieved_ids
+        ), "RetrievalResults and dataset must have the same number of queries."


RetrievalResults.__name__ or rr.__class__.__name__

changed, thx

DaloroAT · 2024-05-21T18:02:04Z