-
Notifications
You must be signed in to change notification settings - Fork 71
Changes for OML 3.0 #557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes for OML 3.0 #557
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
**Changelog** All the functions and classes on the right side are modality agnostic: * `EmbeddingPairsDataset`, `ImagePairsDataset` -> `PairDataset` * `pairwise_inference_on_images`, `pairwise_inference_on_embeddings` -> `pairwise_inference` * `IDistancesPostprocessor` -> (mostly renamed) -> `IRetrievalPostprocessor` * `PairwisePostprocessor`, `PairwiseEmbeddingsPostprocessor`, `PairwiseImagesPostprocessor` -> `PairwiseReranker` * `inference_on_images` -> `inference` * `inference_on_dataframe` -> `inference_cached` Also: * `EmbeddingMetrics` takes optional `dataset` argument in order to perform postprocessing. * Made postprocessing tests a bit more informative via making dummy models a bit less trivial (added bias to their outputs) Examples changed: * `train + val` and `prediction` for postprocessor * `retrieval usage` * added `global_paths` parameter to `download_mock_dataset` so it looks nicer
…keys. Changed signature of EmbeddingMetrics. **CHANGELOG** * removed keys: `IS_QUERY_KEY`, `IS_GALLERY_KEY`, `CATEGORIES_KEY`, `PATHS_KEY`, `X1_KEY`, `X2_KEY`, `Y1_KEY`, `Y2_KEY`, `SEQUENCE_KEY`. Categories and sequences are passed through `extra_data` instead. The rest is incapsulated in Dataset. * Removed `IMetricDDP`, `EmbeddingMetricsDDP`. Reason: having `EmbeddingMetrics` is enough, because we do accumulator sync there anyway. * Changed signatures of `EmbeddingMetrics`: keys replaces by providing dataset, removed `.sync()` and `.visualisation()` methods and so on. * Updated `.md` examples and `.rst` docs Minor: * removed: `calc_distance_matrix`, `validate_dataset` -- this logic happens in `RetrievalResults`, also removed `find_first_occurrences` -- we have `unique_by_ids` instead. * removed `DummyDataset` in tests (used `EmbeddingsQueryGalleryLabeledDataset` instead).
Closed
* `MetricValCallback` is enough to handle DDP * `samples_in_getitem` is not used
Removed visualization.ipynb
**CHANGELOG** * `RetrievalResults` uses Sequence of Tensors which may have different size. In other words, it allows us to support the case when queries have different number of retrieved items. * Consequently, changed `batched_knn`, `retrieval_metrics` and `PairwiseReranker` to support new input type. * Added assert that distances arrive sorted to `RetrievalResults`, retrieved ids are unique and other checks. New tests: * Added tests on corner cases for `RetrievalResults` creation. * Added tests on visualization when queries in `RetrievalResults` have different number of retrieved items. * Added new test with predefined values for `batched_knn` to make debugging easier. * Changed existing postprocessor tests: used `sequence` in datasets so queries have different number of retrieved items and we actually test new functionality. @leoromanovich and I also checked that using Sequence of Tensors doesn't lead to poor performance on validation.
**CHANGELOG** * Added support of empty predictions to the retrieval metrics. For example, it may be useful when we cut retrieval results by distance threshold). * Moved categories handling to functional metrics from `EmbeddingMetrics` class, also updated `.md` example to show how to deal with categories. * Added `calc_fnmr_at_fmr_rr`, removed `extract_pos_neg_dists` and `calc_fnmr_at_fmr_from_matrices`. Returned `fnmr@fmr` to `EmbeddingMetrics` (there was a todo). TESTS * Moved tests that use old formats of retrieval metrics to a separate folder: `...test_metrics/test_outdated/...`. * Added a few new tests on retrieval metrics: test handling categories and empty predictions. * Added test on `calc_fnmr_at_fmr_rr`. * Added test that `EmbeddingMetrics` calculate all expected metrics.
… so on) Moved outdated matrix functions to tests (mask_gt, mask_to_ignore and so on)
**CHANGLELOG** * Simplified handling nans in bboxes * Improved typings in training pipelines * Added show argument to `RetrievalResults.visualise()` * Specified reason for skipping cloud logging tests * Polished md examples * Added `mode_for_checkpointing` to Pipeline config
…rning into oml_3.0_release
* Added `TextBaseDataset`, `TextLabeledDataset`, `TextQueryGalleryLabeledDataset`, `TextQueryGalleryDataset`, `get_mock_texts_dataset`, `visualise_text` * Added `HFWrapper` to wrap models from HuggingFace library * `download_mock_dataset` -> `download_mock_dataset` (the original name is also kept for back compatibility)
================== Docs, Readme and examples for OML 3.0 ==================== General * Made imports shorter in all examples (updated the corresponding `__init__.py` files and `__all__` variables) * `train.md`, `val_md` -> `train_val_img_txt.md` * Joined example of using pre-trained image models, pre-trained HF text models (just added), and zoo table (moved) into one file. * Removed links to colab notebooks for all examples except for the `train_val_img_txt.md` code snippet. * Updated dataset format description: added info about `text` column. * Updated mock dataset of texts so we have more data in train. * Added handling categories example (train + val) Renaming: * `download_mock_dataset` got a short link `get_mock_images_dataset` so it looks similar to `get_mock_texts_dataset` * `RetrievalResults.compute_from_embeddings(embeddings, dataset, n_items_to_retrieve=5)` -> `RetrievalResults.from_embeddings(embeddings, dataset, n_items=5)` README: * Added release notes for OML 3.0 * Added side-by-side example of training and validation text and image models * Added `OML Features` section * Zoo section is updated * Updated FAQ, moved to Documentation section ReadTheDocs: * Added new text Datasets to docs * Added `OML Features` section to the home page * Moved getters for mock datasets from utils to Datasets page * Python examples: hide most of the examples under details. Added an example of handling categories. * Updated the page about logging. * Split post-processing section into re-ranking by model (the old content of post-processing) and algo post-processing (just a page holder for the moment). * Removed `zoo` section from post-processing by model.
MISC * Updated `check_retrieval_format` so it works with text dataset. * Added code example of usage `check_retrieval_format` (+ the corresponding test) * Made `last_logs` property and added docs for it (for triplet & arcface losses, bank miner) * Made `distances`, `retrieved_ids`, `gt_ids` documented properties of `RetrievalResults`.
ALGO POST-PROCESSING * Added `AdaptiveThresholding`, `ConstantThresholding`: * classes implementation * updated registry and configs * readthedocs: contents and algo postprocessing page with a new example * pytests and pipelines test * added thresholding to a few existing python examples * Added `is_empty` and `deepcopy` methods to `RetrievalResults`, also updated readthedocs * Removed `top_n` from `IRetrievalPostprocessor` interface * Updated code so it can work with the old NN postprocessing and new algorithmic ones. Added todo so we refactor it in future after we have more postprocessors. * Added drawing test for empty `RetrievalResults` * Used mock text dataset in categories example instead of the image one (because it has bigger categories)
Misc: * Fixed categories handling in pcf metric * Added docs to calculating metrics by rr * Added verbose argument
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist:
Check Pipelines:
Misc
sort
andtopk
function in the code that rebuild predictions. They need to be moved toRetrievalResults
.oml_3.0_release
branch is turned off