|
| 1 | +Incremental Algolia Indexing |
| 2 | +============================ |
| 3 | + |
| 4 | + |
| 5 | +Status |
| 6 | +------ |
| 7 | +Draft |
| 8 | + |
| 9 | + |
| 10 | +Context |
| 11 | +------- |
| 12 | +The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog |
| 13 | +database. This index is entirely rebuilt at least nightly, working off a compendium of content records |
| 14 | +resulting in a wholesale replacement of the prior Algolia index. This job is time consuming and memory intensive. |
| 15 | +This job also relies heavily on separate but required processes responsible for retrieving filtered subsets of |
| 16 | +content from external sources of truth, primarily Course Discovery, where synchronous tasks must be regularly |
| 17 | +run in specific orders. This results in a system that is brittle - either entirely successful or entirely unsuccessful. |
| 18 | + |
| 19 | + |
| 20 | +Solution Approach |
| 21 | +----------------- |
| 22 | +The goals should include: |
| 23 | +- Implement new tasks that run alongside/augment the existing indexer until we’re able to entirely cut-over |
| 24 | +- Support all current metadata types but doesn’t need to support them all on day 1 |
| 25 | +- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing |
| 26 | +update_content_metadata job, etc. |
| 27 | + - Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand. |
| 28 | +- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required) |
| 29 | +- Provide a content-oriented method of determining content catalog membership that's not reliant on external services. |
| 30 | + |
| 31 | + |
| 32 | +Decision |
| 33 | +-------- |
| 34 | +We want to follow updates to content with individual and incremental updates to Algolia. To do this we both create |
| 35 | +new functionality and reuse some existing functionality of our Algolia indexing infrastructure. |
| 36 | + |
| 37 | +---------------------------------- |
| 38 | +First, the existing indexing process begins with executing catalog queries against `search/all` to determine which |
| 39 | +courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the |
| 40 | +opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a |
| 41 | +given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query |
| 42 | +and a piece of content metadata and determine if the content matches the query (without the use of course discovery). |
| 43 | +---------------------------------- |
| 44 | + |
| 45 | +First is to address the way in which and the moments when we choose to invoke the process of indexing. Previously, |
| 46 | +the bulk indexing logic was reliant on a completely separate task synchronously completing. In order to bulk index, |
| 47 | +content records needed to be bulk updated. The update_content_metadata job's purpose is two fold, one is to ingest content |
| 48 | +metadata from external service providers and standardize its format and enterprise representation, and two is to |
| 49 | +build associations between said metadata records and customer catalogs by way of catalog query inclusion. Once this |
| 50 | +information is entirely read and saved within the catalog service, the system is then ready to snapshot the state of |
| 51 | +content in the form of algolia objects and entirely rebuild and replace our algolia index. |
| 52 | + |
| 53 | +This first A then B approach to wholesale rebuilding our indices is both time and resource intensive as well as brittle |
| 54 | +and prone to outages. Not to mention the system is slow to fix should a partial or full error occur, as |
| 55 | +everything must be rerun in a specific order. |
| 56 | + |
| 57 | +To remediate these symptoms, indexing content records will be dealt with on an individual object-shard/content metadata |
| 58 | +object basis and will happen at the moment a record is saved to the ContentMetadata table. Tying the indexing process |
| 59 | +to the model ``post_save()`` will decouple the task from any other time consuming, bulk job. In order to combat |
| 60 | +redundant/unneeded requests, the record will be evaluated on two levels before an indexing task is kicked off. First |
| 61 | +the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have |
| 62 | +associations with queries within the service. |
| 63 | + |
| 64 | +In order to incrementally update the Algolia index we need to introduce the ability to replace individual |
| 65 | +object-shard documents in the index (today we just replace the whole index). This can be implemented by creating |
| 66 | +methods to determine which Algolia object-shards exist for a piece of content. Once we have relevant IDs we are able to |
| 67 | +determine if a create, update, or delete of them is required and can highjack existing processes that bulk construct |
| 68 | +our algolia objects except on an individual basis. For simplicity sake an update will likely be a delete followed by |
| 69 | +the creation of new objects. |
| 70 | + |
| 71 | +Incremental updates, through the act of saving individual records, will need to be triggered by something - such as |
| 72 | +polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly |
| 73 | +Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR |
| 74 | +to determine when those events should occur, and in fact the indexing process should be able to handle any source of |
| 75 | +content metadata record updating processes. |
| 76 | + |
| 77 | + |
| 78 | +Consequences |
| 79 | +------------ |
| 80 | +Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will |
| 81 | +also provide us with more flexibility about including non-course-discovery content in catalogs because we will |
| 82 | +no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the |
| 83 | +catalog service, regardless of it's source. |
| 84 | + |
| 85 | + |
| 86 | +Alternatives Considered |
| 87 | +----------------------- |
| 88 | +No alternatives were considered. |
0 commit comments