Skip to content

Changes for OML 3.0 #557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 7, 2024
Merged

Changes for OML 3.0 #557

merged 20 commits into from
Jun 7, 2024

Conversation

AlekseySh
Copy link
Contributor

@AlekseySh AlekseySh commented May 10, 2024

Checklist:

Check Pipelines:

Misc

  • all disabled tests are turned on
  • code examples reworked
  • colab is updated
  • all configs in pipelines are updated
  • check all sort and topk function in the code that rebuild predictions. They need to be moved to RetrievalResults.
  • CI on oml_3.0_release branch is turned off

**Changelog**

All the functions and classes on the right side are modality agnostic:
* `EmbeddingPairsDataset`, `ImagePairsDataset` -> `PairDataset`
* `pairwise_inference_on_images`,  `pairwise_inference_on_embeddings` -> `pairwise_inference`
* `IDistancesPostprocessor` ->  (mostly renamed) -> `IRetrievalPostprocessor`
* `PairwisePostprocessor`, `PairwiseEmbeddingsPostprocessor`, `PairwiseImagesPostprocessor` ->  `PairwiseReranker`
* `inference_on_images` -> `inference`
* `inference_on_dataframe` -> `inference_cached`

Also: 
* `EmbeddingMetrics` takes optional `dataset` argument in order to perform postprocessing. 
* Made postprocessing tests a bit more informative via making dummy models a bit less trivial (added bias to their outputs)

Examples changed:
* `train + val` and `prediction` for postprocessor
* `retrieval usage`
* added `global_paths` parameter to `download_mock_dataset` so it looks nicer
@AlekseySh AlekseySh self-assigned this May 10, 2024
…keys. Changed signature of EmbeddingMetrics.

**CHANGELOG**

* removed keys: `IS_QUERY_KEY`, `IS_GALLERY_KEY`, `CATEGORIES_KEY`, `PATHS_KEY`, `X1_KEY`, `X2_KEY`, `Y1_KEY`, `Y2_KEY`, `SEQUENCE_KEY`. Categories and sequences are passed through `extra_data` instead. The rest is incapsulated in Dataset.
* Removed `IMetricDDP`, `EmbeddingMetricsDDP`. Reason: having `EmbeddingMetrics` is enough, because we do accumulator sync there anyway.
* Changed signatures of `EmbeddingMetrics`: keys replaces by providing dataset, removed `.sync()` and `.visualisation()` methods and so on.
* Updated `.md` examples and `.rst` docs

Minor:
* removed: `calc_distance_matrix`, `validate_dataset`  -- this logic happens in `RetrievalResults`, also removed `find_first_occurrences` -- we have `unique_by_ids` instead.
* removed `DummyDataset` in tests (used `EmbeddingsQueryGalleryLabeledDataset` instead).
@AlekseySh AlekseySh linked an issue May 10, 2024 that may be closed by this pull request
AlekseySh added 11 commits May 12, 2024 20:32
* `MetricValCallback` is enough to handle DDP
* `samples_in_getitem` is not used
Removed visualization.ipynb
**CHANGELOG**

* `RetrievalResults` uses Sequence of Tensors which may have different size. In other words, it allows us to support the case when queries have different number of retrieved items.
* Consequently, changed `batched_knn`, `retrieval_metrics` and `PairwiseReranker` to support new input type.
* Added assert that distances arrive sorted to `RetrievalResults`, retrieved ids are unique and other checks. 

New tests:
* Added tests on corner cases for `RetrievalResults` creation.
* Added tests on visualization when queries in `RetrievalResults` have different number of retrieved items.
* Added new test with predefined values for `batched_knn` to make debugging easier.
* Changed existing postprocessor tests: used `sequence` in datasets so queries have different number of retrieved items and we actually test new functionality.

@leoromanovich and I also checked that using Sequence of Tensors doesn't lead to poor performance on validation.
**CHANGELOG**

* Added support of empty predictions to the retrieval metrics. For example, it may be useful when we cut retrieval results by distance threshold).
* Moved categories handling to functional metrics from `EmbeddingMetrics` class, also updated `.md` example to show how to deal with categories.
* Added `calc_fnmr_at_fmr_rr`, removed `extract_pos_neg_dists` and `calc_fnmr_at_fmr_from_matrices`. Returned `fnmr@fmr` to `EmbeddingMetrics` (there was a todo).

TESTS
* Moved tests that use old formats of retrieval metrics to a separate folder: `...test_metrics/test_outdated/...`.
* Added a few new tests on retrieval metrics: test handling categories and empty predictions.
* Added test on `calc_fnmr_at_fmr_rr`.
* Added test that `EmbeddingMetrics` calculate all expected metrics.
… so on)

Moved outdated matrix functions to tests (mask_gt, mask_to_ignore and so on)
**CHANGLELOG**

* Simplified handling nans in bboxes
* Improved typings in training pipelines 
* Added show argument to `RetrievalResults.visualise()`
* Specified reason for skipping cloud logging tests
* Polished md examples
* Added `mode_for_checkpointing` to Pipeline config
* Added `TextBaseDataset`, `TextLabeledDataset`, `TextQueryGalleryLabeledDataset`, `TextQueryGalleryDataset`, `get_mock_texts_dataset`, `visualise_text`
* Added `HFWrapper` to wrap models from HuggingFace library
* `download_mock_dataset` -> `download_mock_dataset` (the original name is also kept for back compatibility)
AlekseySh added 4 commits June 4, 2024 20:22
================== Docs, Readme and examples for OML 3.0 ====================

General
* Made imports shorter in all examples (updated the corresponding `__init__.py` files and `__all__` variables)
* `train.md`, `val_md` -> `train_val_img_txt.md`
* Joined example of using pre-trained image models, pre-trained HF text models (just added), and zoo table (moved) into one file.
* Removed links to colab notebooks for all examples except for the `train_val_img_txt.md` code snippet.
* Updated dataset format description: added info about `text` column.
* Updated mock dataset of texts so we have more data in train.
* Added handling categories example (train + val)

Renaming:
* `download_mock_dataset` got a short link `get_mock_images_dataset` so it looks similar to `get_mock_texts_dataset`
* `RetrievalResults.compute_from_embeddings(embeddings, dataset, n_items_to_retrieve=5)` -> `RetrievalResults.from_embeddings(embeddings, dataset, n_items=5)`

README:
* Added release notes for OML 3.0
* Added side-by-side example of training and validation text and image models
* Added `OML Features` section
* Zoo section is updated
* Updated FAQ, moved to Documentation section

ReadTheDocs:
* Added new text Datasets to docs
* Added `OML Features` section to the home page 
* Moved getters for mock datasets from utils to Datasets page
* Python examples: hide most of the examples under details. Added an example of handling categories.
* Updated the page about logging.
* Split post-processing section into re-ranking by model (the old content of post-processing) and algo post-processing (just a page holder for the moment).
* Removed `zoo` section from post-processing by model.
MISC

* Updated `check_retrieval_format` so it works with text dataset.
* Added code example of usage `check_retrieval_format` (+ the corresponding test)
* Made `last_logs` property and added docs for it (for triplet & arcface losses, bank miner)
* Made `distances`, `retrieved_ids`, `gt_ids` documented properties of `RetrievalResults`.
ALGO POST-PROCESSING

* Added `AdaptiveThresholding`, `ConstantThresholding`:
    * classes implementation
    * updated registry and configs
    * readthedocs: contents and algo postprocessing page with a new example
    * pytests and pipelines test
    * added thresholding to a few existing python examples
* Added `is_empty` and `deepcopy` methods to `RetrievalResults`, also updated readthedocs
* Removed `top_n` from `IRetrievalPostprocessor` interface
* Updated code so it can work with the old NN postprocessing and new algorithmic ones. Added todo so we refactor it in future after we have more postprocessors.
* Added drawing test for empty `RetrievalResults`
* Used mock text dataset in categories example instead of the image one (because it has bigger categories)
Misc:

* Fixed categories handling in pcf metric
* Added docs to calculating metrics by rr
* Added verbose argument
@AlekseySh AlekseySh changed the title OML 3.0 Changes for OML 3.0 Jun 6, 2024
@AlekseySh AlekseySh merged commit ca114d6 into main Jun 7, 2024
11 checks passed
@AlekseySh AlekseySh deleted the oml_3.0_release branch June 7, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[EPIC] Release OML 3.0
1 participant