importer: for scaling, process individual sources concurrently #2862

andrewpollock · 2024-11-13T03:47:55Z

Problem statement
The importer currently processes each configured source serially, every 15 minutes. When all sources have no new records, this is fine, as each source is fast enough to evaluate and import a handful of new records from. Many of these sources are quite slow to process if a full reimport is required, and this will cause a run to blow out substantially. Sources later in the list are penalized by virtue of being stuck behind the slow source that is being reimported. As the number of sources continues to grow, the approach of serial processing will inherently become slower and slower and more prone to this problem.

Describe the solution you'd like
Instead, process the sources concurrently, to avoid the problems described above.

Describe alternatives you've considered
In varying degrees of complexity with various tradeoffs:

rely on Kubernetes to run the importer on each source as its own discrete Pod
have the importer spawn each source as its own child process
- potentially have a worker pool type of architecture to avoid having all the sources run in parallel all at once. This will become particularly important as the number of sources continues to grow.

andrewpollock added enhancement New feature or request backlog Important but currently unprioritized labels Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importer: for scaling, process individual sources concurrently #2862

importer: for scaling, process individual sources concurrently #2862

andrewpollock commented Nov 13, 2024

importer: for scaling, process individual sources concurrently #2862

importer: for scaling, process individual sources concurrently #2862

Comments

andrewpollock commented Nov 13, 2024