Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate a flush of the name lookup cache #1059

Open
timrobertson100 opened this issue Apr 26, 2024 · 4 comments
Open

Automate a flush of the name lookup cache #1059

timrobertson100 opened this issue Apr 26, 2024 · 4 comments

Comments

@timrobertson100
Copy link
Member

timrobertson100 commented Apr 26, 2024

(Not a pipeline specific issue, but tracking in this repo as it's connected to pipeline function)

Now that the name lookup cache holds decorated records of lookups (e.g. IUCN) and has the scientificNameID etc. short-circuit lookup we need to flush the cache. It has potential to go stale on changes in the IUCN redlist, backbone or any checklist configured to short circuit with ID lookup (e.g. WoRMS LSIDs).

portal-feedback/#5239 is an example of cached responses causing confusion.

I suggest to simply truncate_preserve the HBase table weekly (edit: or daily) after verifying that nothing is running.

@mdoering
Copy link
Member

alternatively we could create a new rabbit message, a flushing listener and emit these messages in the clb clis - to be configured with a list of dataset keys to be flushed

@timrobertson100
Copy link
Member Author

If we do the message approach, I'd suggest the subscriber decide if the cache should be flushed, as it's the pipeline that decides if the short circuit IDs should be used, not CLB.

Knowing we're moving off this edition of CLB and it'll all change soon, I wonder if it'd be sufficient to simply flush the cache daily knowing the lookup is now fast so rebuilding is not a big cost. What do you think?

@muttcg
Copy link
Member

muttcg commented Apr 27, 2024

Another options:

  1. We can add extra flag when we process dataset - ignore/repopulate cache
  2. Add lifetime for a record in the cache
    There are obvious pros and cons, and questions about data consistency

@mdoering
Copy link
Member

I wonder if we would not face the same problems in the future too. We will still integrate COL, IUCN, WoRMS and maybe more checklists supplying identifiers. Just through a different system. But we'll be doing this more regulary on a monthly basis, so that might simply be the time to flush and start from scratch without the need to inform about changed lists?

Actually we have the messaging already in place - clb clis emit a ChecklistSyncedMessage once done. The only thing needed would be a listener that knows which datasets to watch out for and then flush the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants