Skip to content

Commit d1146b2

Browse files
johnnagroalex-sheehan-edx
authored andcommitted
feat: ADR for incremental algolia indexing
1 parent 8c5b50d commit d1146b2

File tree

2 files changed

+157
-0
lines changed

2 files changed

+157
-0
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
Incremental Algolia Indexing
2+
============================
3+
4+
5+
Status
6+
------
7+
Draft
8+
9+
10+
Context
11+
-------
12+
The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog
13+
database. This index is entirely rebuilt at least nightly, working off a compendium of content records
14+
resulting in a wholesale replacement of the prior Algolia index. This job is time consuming and memory intensive.
15+
This job also relies heavily on separate but required processes responsible for retrieving filtered subsets of
16+
content from external sources of truth, primarily Course Discovery, where synchronous tasks must be regularly
17+
run in specific orders. This results in a system that is brittle - either entirely successful or entirely unsuccessful.
18+
19+
20+
Solution Approach
21+
-----------------
22+
The goals should include:
23+
- Implement new tasks that run alongside/augment the existing indexer until we’re able to entirely cut-over
24+
- Support all current metadata types but doesn’t need to support them all on day 1
25+
- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing
26+
update_content_metadata job, etc.
27+
- Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand.
28+
- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required)
29+
- Provide a content-oriented method of determining content catalog membership that's not reliant on external services.
30+
31+
32+
Decision
33+
--------
34+
We want to follow updates to content with individual and incremental updates to Algolia. To do this we both create
35+
new functionality and reuse some existing functionality of our Algolia indexing infrastructure.
36+
37+
----------------------------------
38+
First, the existing indexing process begins with executing catalog queries against `search/all` to determine which
39+
courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
40+
opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a
41+
given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query
42+
and a piece of content metadata and determine if the content matches the query (without the use of course discovery).
43+
----------------------------------
44+
45+
First is to address the way in which and the moments when we choose to invoke the process of indexing. Previously,
46+
the bulk indexing logic was reliant on a completely separate task synchronously completing. In order to bulk index,
47+
content records needed to be bulk updated. The update_content_metadata job's purpose is two fold, one is to ingest content
48+
metadata from external service providers and standardize its format and enterprise representation, and two is to
49+
build associations between said metadata records and customer catalogs by way of catalog query inclusion. Once this
50+
information is entirely read and saved within the catalog service, the system is then ready to snapshot the state of
51+
content in the form of algolia objects and entirely rebuild and replace our algolia index.
52+
53+
This first A then B approach to wholesale rebuilding our indices is both time and resource intensive as well as brittle
54+
and prone to outages. Not to mention the system is slow to fix should a partial or full error occur, as
55+
everything must be rerun in a specific order.
56+
57+
To remediate these symptoms, indexing content records will be dealt with on an individual object-shard/content metadata
58+
object basis and will happen at the moment a record is saved to the ContentMetadata table. Tying the indexing process
59+
to the model ``post_save()`` will decouple the task from any other time consuming, bulk job. In order to combat
60+
redundant/unneeded requests, the record will be evaluated on two levels before an indexing task is kicked off. First
61+
the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have
62+
associations with queries within the service.
63+
64+
In order to incrementally update the Algolia index we need to introduce the ability to replace individual
65+
object-shard documents in the index (today we just replace the whole index). This can be implemented by creating
66+
methods to determine which Algolia object-shards exist for a piece of content. Once we have relevant IDs we are able to
67+
determine if a create, update, or delete of them is required and can highjack existing processes that bulk construct
68+
our algolia objects except on an individual basis. For simplicity sake an update will likely be a delete followed by
69+
the creation of new objects.
70+
71+
Incremental updates, through the act of saving individual records, will need to be triggered by something - such as
72+
polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly
73+
Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR
74+
to determine when those events should occur, and in fact the indexing process should be able to handle any source of
75+
content metadata record updating processes.
76+
77+
78+
Consequences
79+
------------
80+
Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will
81+
also provide us with more flexibility about including non-course-discovery content in catalogs because we will
82+
no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the
83+
catalog service, regardless of it's source.
84+
85+
86+
Alternatives Considered
87+
-----------------------
88+
No alternatives were considered.
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
Incremental Content Metadata Updating
2+
=====================================
3+
4+
5+
Status
6+
------
7+
Draft
8+
9+
10+
Context
11+
-------
12+
The Enterprise Catalog Service implicitly relies on external services as sources of truth for content surfaced to
13+
organizations within the suite of enterprise products and tools. For the most part this external source of truth has
14+
been assumed to be the `course-discovery` service. The ``update_content_metadata`` job has relied on `course-discovery`
15+
to not only expose the content metadata of courses, programs and pathways but also to determine customer catalog
16+
associations with specific subsets of content, meaning enterprise curated content filters are evaluated externally as a
17+
black box solution to what content belongs to which customers. This is burdensome to both the catalog service as it has
18+
little control over how the underlying content filtering logic functions and to the external service as redundant data
19+
must be requested for each and every query filter. Should the catalog service own the responsibility of determining the
20+
associations between a single piece of content and any of the customers' catalogs, not only would we just have to
21+
request all data a single time from external sources for bulk jobs, but we could also easily support creation, updates
22+
and deletes of single pieces of content communicated to the catalog service on an individual basis.
23+
24+
Decision
25+
--------
26+
The existing indexing process begins with executing catalog queries against `search/all` to determine which
27+
courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
28+
opposite semantic and instead be able to determine catalog membership from a given piece of content (rather than
29+
courses from a given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a
30+
catalog query and a piece of content metadata and determine if the content matches the query (without the use of course
31+
discovery).
32+
33+
We will implement a two sided approach to content updating that will be introduced as parallel work to existing
34+
``update_content_metadata`` tasks and can eventually replace old infrastructure. The first method will be a bulk
35+
job similar to the current ``update_content_metadata`` task to query external sources of content and update any records
36+
should they mismatch using `apps.catalog.filters` to determine the query-content association sets. And second, an event
37+
signal receiver which will process any individual content update events that are received. The intention is for the
38+
majority of updates in the catalog service to happen at the moment they are updated in their external source and the
39+
signal is fired, only to be cleaned up and verified by the bulk job later on should something go wrong.
40+
41+
While this new process will remove the need to constantly query and burden the `course-discovery` search/all endpoint
42+
we will still most likely need to request the full metadata of each course/content object similar to how the current
43+
task handles the flow.
44+
45+
An event receiver based approach to individual content updates also opens up our possibilities to ingesting content
46+
from other sources of truth that are hooked up to the edx event-bus. This means that it will be easier for enterprise
47+
to ingest content from many sources, instead of relying on those services first going through course-discovery.
48+
49+
50+
Consequences
51+
------------
52+
As alluded to earlier, this change means that we will no longer have to repeatedly request data from course-discovery's
53+
search/all endpoint as we won't need to rely on the service to do our filtering logic, which was one of the main
54+
contributing factors as to the long run time of the ``update_content_metadata`` task. Additionally, housing
55+
our own filtering logic will allow us to maintain and tweak/improve upon the functionality should we want additional
56+
features.
57+
58+
The signal based individual updates will also mean that we will have a significantly smaller window of lag for content
59+
updates propagating throughout the enterprise system.
60+
61+
62+
Alternatives Considered
63+
-----------------------
64+
There are a number of ways that individual content updates could be communicated to the catalog service. Event-bus
65+
based signal handling restricts the catalog service to sources of truth that have integrated with the event bus
66+
service/software. We considered instead exposing an api endpoint that would take in a content update event and process
67+
the data as needed, however it was decided that this approach is brittle and prone to losing updates in transit as
68+
it would be difficult to ensure the update was fully communicated and processed by the catalog service should anything
69+
go wrong.

0 commit comments

Comments
 (0)