Automate a flush of the name lookup cache #1059

timrobertson100 · 2024-04-26T14:08:24Z

(Not a pipeline specific issue, but tracking in this repo as it's connected to pipeline function)

Now that the name lookup cache holds decorated records of lookups (e.g. IUCN) and has the scientificNameID etc. short-circuit lookup we need to flush the cache. It has potential to go stale on changes in the IUCN redlist, backbone or any checklist configured to short circuit with ID lookup (e.g. WoRMS LSIDs).

portal-feedback/#5239 is an example of cached responses causing confusion.

I suggest to simply truncate_preserve the HBase table weekly (edit: or daily) after verifying that nothing is running.

The text was updated successfully, but these errors were encountered:

mdoering · 2024-04-26T15:39:06Z

alternatively we could create a new rabbit message, a flushing listener and emit these messages in the clb clis - to be configured with a list of dataset keys to be flushed

timrobertson100 · 2024-04-27T06:16:23Z

If we do the message approach, I'd suggest the subscriber decide if the cache should be flushed, as it's the pipeline that decides if the short circuit IDs should be used, not CLB.

Knowing we're moving off this edition of CLB and it'll all change soon, I wonder if it'd be sufficient to simply flush the cache daily knowing the lookup is now fast so rebuilding is not a big cost. What do you think?

muttcg · 2024-04-27T07:58:15Z

Another options:

We can add extra flag when we process dataset - ignore/repopulate cache
Add lifetime for a record in the cache
There are obvious pros and cons, and questions about data consistency

mdoering · 2024-04-27T10:15:40Z

I wonder if we would not face the same problems in the future too. We will still integrate COL, IUCN, WoRMS and maybe more checklists supplying identifiers. Just through a different system. But we'll be doing this more regulary on a monthly basis, so that might simply be the time to flush and start from scratch without the need to inform about changed lists?

Actually we have the messaging already in place - clb clis emit a ChecklistSyncedMessage once done. The only thing needed would be a listener that knows which datasets to watch out for and then flush the cache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate a flush of the name lookup cache #1059

Automate a flush of the name lookup cache #1059

timrobertson100 commented Apr 26, 2024 •

edited

Loading

mdoering commented Apr 26, 2024

timrobertson100 commented Apr 27, 2024

muttcg commented Apr 27, 2024

mdoering commented Apr 27, 2024

Automate a flush of the name lookup cache #1059

Automate a flush of the name lookup cache #1059

Comments

timrobertson100 commented Apr 26, 2024 • edited Loading

mdoering commented Apr 26, 2024

timrobertson100 commented Apr 27, 2024

muttcg commented Apr 27, 2024

mdoering commented Apr 27, 2024

timrobertson100 commented Apr 26, 2024 •

edited

Loading