Inconsistencies in recrawling/reindexing framework

## Problem
I discovered a large-ish number of duplicate recipes in the search index today, and have been tracing back through the (re)crawling and (re)indexing logic to figure out how this could have happened.

I think that there are perhaps three logical inconsistencies in the code that cause problems:

  * Unlike the `crawl_recipe` worker method, the `index_recipe` worker method does not check for duplicate recipes before (re)indexing content -- and as a result, manual reindexing of recipes may insert duplicates if the operator is not careful when selecting a set of recipes.
  * We had what looked to me like a bug that meant that `crawl_recipe` may not have been matching recipes to their earliest-known crawled equivalent.  This has been addressed by commit d1e0901f7348fa03b84639eef649fc5ff3cc7650 (and follow-up bugfix b2e0ef33822bad1a9c2fc89b46bc3547ac285ed4, suggesting the need for some code refactoring here).
  * There seems to be a tight coupling between the [URL requested for recrawling/reindexing by the `crawler` `reciperadar/recipes.py`](https://github.com/openculinary/crawler/blob/420d04710df45e3dc8662fe2d12bf18b660eec6d/reciperadar/recipes.py#L11) and the [`RecipeURL`](https://github.com/openculinary/backend/blob/b2e0ef33822bad1a9c2fc89b46bc3547ac285ed4/reciperadar/workers/recipes.py#L45) constructed by the `backend` recipe worker(s).  This doesn't seem great; we should be able to accept oldest-known, latest-known, or anywhere in-between and expect reliable behaviour.

## Analysis

#### Deduplication
The Celery task workflow during recrawling is `crawl_recipe -> index_recipe`, compared to the reindexing workflow of simply `index_recipe`.  Sometimes it's more convenient and performant to use the latter, especially because it does not involve any calls to the `crawler` service and (potentially) external HTTP requests, generally meaning that it can be much faster.

However, it's difficult to strictly enforce the set of recipes that may be selected for reindexing by operators, and we should make the backend robust against mistakes there (because they will happen).  This implies that we should move the [de-duplcation logic](https://github.com/openculinary/backend/blob/b2e0ef33822bad1a9c2fc89b46bc3547ac285ed4/reciperadar/workers/recipes.py#L134-L151) (that detects duplicates, and remaps the `src` and `dst` on the Recipe so that duplicates merge) into the `index_recipe` worker method.

#### Refactoring
It may make sense to relocate the `find_earliest_crawl` and `find_latest_crawl` utility methods from `RecipeURL` to `Recipe`, and to always drive the latter based on the _resolved_ URL of the recipe.

The goals here are multiple, and roughly in this priority order:

  * To ensure that latest-and-earliest URL lookup for a recipe are handled correctly, robustly and with the latest-known information at indexing-time so that recipe duplication should not occur.
  * To minimize the number of database queries required to perform the lookups (currently this requires query/construction of a `RecipeURL` object, for example, and the lookups themselves query the `CrawlURL` model).
  * To organize the code in a comprehensible way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistencies in recrawling/reindexing framework #82

Problem

Analysis

Deduplication

Refactoring

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistencies in recrawling/reindexing framework #82

Description

Problem

Analysis

Deduplication

Refactoring

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions