Move datasets to delete first in line#261
Move datasets to delete first in line#261jbrown-xentity wants to merge 2 commits intockan:masterfrom
Conversation
We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports. It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances. This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deletion but still in the system.
|
@jbrown-xentity It's been a long time since I worked on this but IIRC the harvesters call |
If the harvest is managing the datasets in ckan, it seems that the harvest source should be the "source of truth". If this is the case, we shouldn't need "revive" capability of soft removing packages/datasets in ckan. I propose to actually purge the dataset within ckan. Since it's difficult/nearly impossible to track these files without a unique id, sometimes the harvester will delete and create a new item if the waf or files change in any way. This would keep that behind the scenes, and allow the end user to get to the same dataset at the old URL
|
@amercader no, I believe you're right: we would need to purge the dataset. I forgot about that functionality. I believe we actually should be purging; I don't see a likely scenario where a user would want to keep or "revive" a dataset that was harvested and has been removed from source... I updated the PR to include the "purge" command instead of "delete". |
|
I'm experiencing a problem after having purged a dataset harvevsted. I think that the purge may take care eventually of harvest object or... (since the core cant depend on an extension) we've to provide purge for harvest object table. |
We have reports at data.gov of datasets that get re-harvested with an extra
1in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if
this is a new dataset or not;
but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it.
By running through the datasets removal first,
if the spatial harvester is essentially doing a "delete and add"
when it should be replacing, then the name of the new dataset
won't collide with the one that is marked for deletion
but still in the system. This will keep the URL the same, and not break as many workflows.