Push Solr Indexed Flavors into a Frictionless data package #14

DiegoPino · 2020-12-07T14:30:42Z

Why?

Or Strawberryfield data source is totally virtual. During a processing chain we use local storage key values to allow Search API to fetch the recently ingested data. But for a longer/complete reindex we want to have that data in a more stable place, specially for longer running/expensive operations like HORC.

The logic we want is that after a Processor's output has been tracked we push the data into a (new or existing) frictionless data package managed by us file. Idea is if the file exists and the content of a certain Flavor ID is inside we update, if not we create and add.

The the Flavor Data source can always try to fetch from the less expensive Key/Value store if found, or if not, see if the Node itself has one of the packages corresponding to the same FLV id.

Flavors indexed into Solr have the id pattern (Flavor ID)
"ss_search_api_id":"strawberryfield_flavor_datasource/2017:1:en:1d9ae1cd-b3d0-477c-8061-313bb1bc9273:ocr",
Which means:
strawberryfield_flavor_datasource => the data source
2017 => the Node ID
1 = The sequence (remember this is one Node to many files to many sequences)
1d9ae1cd-b3d0-477c-8061-313bb1bc9273 => The File UUID that was processed
ocr => the Plugin type that generated this

Depending on how well I can deal with this issue esmero/strawberryfield#115 we may want to have many Frictionless Data Packages or a single one

The operation would be (pseudo buggy code)

Post processor (flavor) is tracked into Index // Already do this
Post processor checks if Node (source) has already a datapackage for that FLV (e.g ocr)
If yes, checks the manifest.json of the ZIP, if the same Flavor ID is already there replace it
If not, creates the Datapackage and initializes it, adds the first Postprocessor output and attaches it to the Node.
This happens for every sequence/etc/.

On reindexing/indexing/update from Search API:

We get a Flavor ID. // already do this
We check if the pattern makes sense and validate the data // already do this
We check if the ID is in the key/value store // already do this
If yes -> great adds the data again // // already do this
If no -> checks if the Node has the datapackage and it contains the Flavor ID, if so, fetches the data, rebuilds the needed data structure for Search API (because its more than just the HOCR) and passes that back to search API
If none, means it does not exist anymore, processing was deleted of the original files are gone and Solr document is removed.

@giancarlobi ideas/thoughts?

The text was updated successfully, but these errors were encountered:

giancarlobi · 2020-12-08T13:58:29Z

@DiegoPino some "philosophical" thoughts about this, premised I agree with all you wrote and probably I have to analyze deeper single steps.
Archipelago has to satisfy also long term preservation requirement, I think, and what is the piece of our architecture that can be the most reliable to preserve data? Obviously the storage (filesystem/S3/...) that can be mirrored/backuped, all those tricks that make storage "for ever without loss data".
What are the pieces less reliable in our architecture? For many reasons, I think that MySQL DB (Drupal) and Solr DATA, in addition to servers but that is another topic, are the pieces that can lost data and they are more complex to backup (where? to storage so ...).
Conclusion: we have to be ready for a restore into MySQL and Solr by data in storage.
This involves data package and all you wrote above, we need to store SBF-JSONs, flavours data and Solr docs (or something we allow to reindex Solr) into a data package that will be allow to rebuild the whole Archipelago if something goes wrong.
I know, this is the idea, not code and I know that we need code to make this working so take this as a raining day thought.

DiegoPino · 2020-12-08T14:15:43Z

@giancarlobi I totally agree. Interestingly enough we have almost every (data) we just need to code it and since we are building AMI this is maybe a good moment to start doing that. PS: I will copy your thoughts and also this post later this week into its own ISSUES to complement what is missing:

Original Data:

WE have: We have the DOStore folder in the persistent storage and every file keyed by its checksum. All these are simple full dumps of the Node and the pure JSON data.
We are missing:
- Right now we are only depositing on Entity Insert events. Having the same happening on Update (Save Event) is easy but it needs to be done. We need to keep track of every change on an object
- We can traverse every folder and reinsert but there is something tricky: our top level json keys referring to files save File Entity IDs (numeric). In case of a full restore those are not very useful (we can not ask Drupal to respected them) so we need to use the as:document, etc structures that we have there to create the files entities back from storage using the given UUIDs and then replace on the JSON the existing IDs with the newly created ones. Its not complex but it adds some overhead and logic needs to be perfect. We may also want to do fixity checks while doing this. Eventually we can also use Frictionless Data packages to generate a ZIP with the Metadata JSON and all the corresponding files. So Objects can be shared between repositories or put into Long term cheap storage.
We really want to make ID to UUID to ID and back a service that can be used for anything. Entity to Entity relationships are the most complex topic in Drupal.
We may want to have time based restores. And also Item level restores. all via the UI.

Solr: if we have Data-packages reindexing is a breeze. It is also important that we have SBR processors kill switches, we do not want to re-process data when doing a full restore.

We are missing: the data-package writing and reading back. And I feel we should at least plan for the remote ZIP ranged reader at some moment. We want to be efficient as hell! This one Create a remote Zip reader Class strawberryfield#115

@alliomeria @dmer any thoughts on this? All this looks like code we can get rolling quite fast but then again we have our hands so full. Should we add this to the roadmap in a more concrete fashion?

dmer · 2020-12-08T18:11:48Z

@DiegoPino I really like the ideas here (as far as I understand them :) It sounds like this would make a much more secure and robust backup (and more importantly) restore. Having time-based and/or single item level restore available via the UI would be a huge improvement on the current restore capabilities that I'm used to w/ Islandora.

As to your question about when. My main input is that I'm wanting to start working with the AMI tools asap so anything that sounds like it might delay that I'm suspicious of! - perhaps I'll be able to better answer after our briefing.

DiegoPino self-assigned this Dec 7, 2020

DiegoPino added Datapackage / Frictionless Packaging and wrapping in Xmas paper enhancement New feature or request Solr Indexing Putting things where they can be found labels Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push Solr Indexed Flavors into a Frictionless data package #14

Push Solr Indexed Flavors into a Frictionless data package #14

DiegoPino commented Dec 7, 2020

giancarlobi commented Dec 8, 2020

DiegoPino commented Dec 8, 2020

dmer commented Dec 8, 2020

Push Solr Indexed Flavors into a Frictionless data package #14

Push Solr Indexed Flavors into a Frictionless data package #14

Comments

DiegoPino commented Dec 7, 2020

Why?

giancarlobi commented Dec 8, 2020

DiegoPino commented Dec 8, 2020

dmer commented Dec 8, 2020