Support keeping only jobs in database (while removing all logs and test results sooner)#6603
Conversation
okurz
left a comment
There was a problem hiding this comment.
I consider the most important change actually to be in documentation. Please see http://open.qa/docs/#basic_cleanup
okurz
left a comment
There was a problem hiding this comment.
I wonder if we actually need to distinguish "storing jobs" and "storing results" on so many levels. I know that already the distinction between "logs" and "results" is not well understood for many users. Instead can we just review the cleanup algorithm and only clean actual job entries as last resort after all other relevant job results have been cleaned in a job group? Like first delete all results and if still more space is needed only then remove the job entry (in the database) itself?
|
I actually did this "documentation driven" implementing the help popover changes that ended up in Extend UI for configuring job storage duration as one of the first changes. I will of course also extend the real documentation. I just have to think how to handle existing instances where e.g. You can still review the other parts of the PR. Considering your 2nd comment I will not put anymore effort into this until we clarified what we actually want. (Although I probably won't have enough time in the spike solution for a 2nd attempt so I'd just close this PR if this approach is not wanted.)
I guess users are only exposed to this at the UI-level (and perhaps API-level but this is really just the same again from a user perspective). So I don't see how dealing with this on many levels internally will confuse users. Note that I have already added help for users in the UI in form of changes to the help popover. Users can also still keep the defaults for this new setting if they don't want to be bothered with it. The new setting is also inline with the existing ones and doesn't bring much more to the table in terms of complexity. If you think our users cannot take another form field I can remove the UI changes so this becomes a hidden feature. I wouldn't do that because I think our users can take another form field and we should be as transparent about the cleanup as possible to help users understand better how it actually works. (Maybe that's just my preference as I don't like dumbed down UIs.)
This makes no sense. The results are usually not on the same storage as the database. So if more storage on the disk where results are stored is needed deleting the job entry from the database has no effect. Even if both were on the same disk then deleting rows from the database will most likely not have an immediate effect. And even if it has an effect at some point then it is probably very little compared to what we gain by deleting results. I think this feature was intended with the opposite thought in mind: The database entry of a job needs not much disk space but results potentially a lot. So we can store much more database entries than results. It seems unnecessary to delete jobs from the database (and thus not having them anymore on dashboard pages) when we really only run out of result storage. This change helps preserving jobs in the database longer. It is not meant to delete data more aggressively. |
|
Oh, and I would really like to avoid repeating the approach we took for archiving. Here we archive jobs which are "preserved because they are important". While it works for archiving I must say that adding this kind of "condition" does add more complexity and makes things therefore harder to understand. Adding just another configurable duration is on the other hand just more of the same and doesn't add complexity and thus doesn't make things harder to understand. |
Exactly. But I think a "job group owner" neither understands this distinction nor would be able to derive what that could mean for their job group. Obviously with the configurations for "keep results" and "keep jobs" next to each other I would assume that users compare those values and try to select something sensible but they can't know what effect on available database storage that actually has.
I think "archiving" is a good example of leaving the decision to the administrators of an instance which are also the ones need to ensure that storage space is not depleted. How about a simpler approach but still leaving the choice to the administrator: Similar to our implementation for "space aware cleanup" only clean jobs from the database when it's actually necessary. As you explained it can be expected that the database are residing on different storage than results so how about two config settings for number of jobs as well as a threshold to keep free storage configured in the ini file with sane default and only remove jobs when any of those two limits are reached? |
|
This pull request is now in conflicts. Could you fix it? 🙏 |
Users and admins would both need to play with these settings to know the effect on available database storage. I think the current approach is the following (with the existing retentions and the new one would not change that): Users put in values that make sense for them without knowing whether it is too much. They might also just leave the defaults we put in there as admins if they don't really care about this. If we find that retentions are too long we need to reduce them from our side in coordination with users. So these settings aren't just for users but also for us as admins. And with the newly introduced setting we would have the chance of preserving jobs at least longer in the database when we will have to reduce the storage duration for results. I think the current approach isn't that bad and this change makes it better. So ok, this means that this setting is likely more for us as admins to have it easier in case we need to reduce retentions (e.g. after receiving a filesystem alert). Users will probably just stick to the defaults here for the newly introduced settings. I don't think our users will be overwhelmed by the new settings (as they would probably just ignore them) but I could hide/collapse these settings by default. By the way, I think our documentation in the popover is very bad. So maybe that's also just the reason why users are confused. I improved it a little bit as part of this change. I hope it is now clear what the difference between "logs" and "results" (and now also "database entry") actually is. What's still missing is a clarification on what "important" actually means.
This is not true. The archiving feature doesn't allow anybody to make a real decision on when to archive jobs except for setting a "binary" yes/no flag on the admin side to enable/disable the feature at all. If anybody can still influence this in a more nuanced way then that are the users and admins via the other retentions but that's of course just very indirectly and also a bit intransparent (as not everyone reads the documentation and the behavior is generally more complex as mentioned before).
This makes the whole thing much more complicated because for this we would need to define "necessary". Note that we are not just talking about disk space being used here but also about operations taking longer in the presence of many jobs. And also just the space aspect alone would be not super easy to implement as we would probably have to ask PostgreSQL about the figures (as a plain Additionally, the database is storing much more than just jobs so it would make much more sense to decide case by case what to store less of if we receive an alert, e.g. we might decide to store less audit events instead. (By the way, we completely lack cleanup of scheduled products which I would find much more important than making cleanup of jobs "super intelligent/automatic". Limiting scheduled products would also help us filter/sort this table faster (and fast enough in the future) without any additional indexes - these kinds of cleanups are really not only about storage capacity.)
As mentioned, it won't be as straight forward for the database. In case of the space-aware cleanup we really just call Considering I'm not making things really more complex here (by just adding more of the same) I wouldn't have expected that this change is so problematic. I guess if I changed it so these settings are collapsed in the UI by default then this really shouldn't bother users but still gives admins more control. And not even the function to delete only results (but keeping the jobs in the database) is new because this function is already used by our space aware cleanup. |
Yes, that is the current approach and I find it suboptimal as it is a constant back and forth between users and admins. I would like to find a way where users and admins don't need to argue over the same values but all use different parameters that they care about. Think of a scenario where we would have a ten times bigger openQA instance or compare to github. There are also no settings in github projects that you and me set and sometimes github admins would contact us and say "hey, we need to lower those values you set".
Please take a look into recent problems we had with overuse of results and logs and related discussions with job group owners in the following SUSE internal message threads for context:
What those threads should show is that job group owners in multiple cases did not understand the meaning of results, logs, important or not important. And they chose values like 180/180/180/180 so all the same.
That might be true and I like your improvements.
ok. So as this PR is for a spike solution how about doing the "important" clarification first and then think about how to do "keeping only jobs" in a better way.
Yes. I am aware. That's why I created the ticket assuming that we have many building blocks in place already. Your changes to the help text are also good and I like those. But I am still concerned about asking users to make choices which would be more suitable for admins.
We could simply leave out all job group attributes and UI changes for now and only keep the database rows and default values as they can still be set over the config and should have a good default. Then we can collect experience, improve the definition of "important" and then come back to deciding about which UI integration is useful. If after reading all of the above you still think that openQA would scale well - "think ten times bigger" - with this current implementation including the UI integration then I will approve. |
Ok - although I don't see how adding another parameter makes things worse now.
For 10x bigger we would probably need a different database setup and probably a different setup overall. Having everything served by a single host probably won't cut it anymore. I think GitHub and other services can grow in such an unlimited way because they can just add more servers. I don't think we should make very advanced/big/expensive setups like these the next step for openQA, though. We can still help users/admins with simple setups and more limited resources. By the way, I was recently curious how much disk space one has on GitHub as I upload assets for my releases there. It turns out GitHub actually does ask their users to stay within limits to avoid performance problems. And another by the way, many other big platforms directly confront users with limits simply to ask them to pay at some point. I'm not sure whether big platforms are generally a good comparison but this just shows that users often are confronted with limits. So it is not like there's "a perfect world out there" and we need to reach the same level - as it seems you want me to believe.
Ok, I'm not reading these different threads right now. However, I think I know some of them anyway. And that is also why I propose to improve the help popover here. (And because I simply couldn't add anything to the existing popover without breaking its currently terrible structure.)
Ok, I can create a separate PR for UI improvements first. It'll be easier and more straight forward it I was able to base further changes on this PR, though.
But considering we already delete only results first as part of the space aware cleanup and you are against the additions I am proposing here I'm not really sure what you are asking for. Some "magic solution" to make things better probably. Sorry, but all your suggestions so far seem impractical and really leading much further than anyone reading the ticket you've created could have guessed. (At least we didn't guess that when moving the ticked into the backlog. At this point @kalikiana and @perlpunk and me actually briefly talked about this.)
I think collapsing these settings into an advanced/admin section will be good enough. Of course it isn't ideal that all these settings are mixed on a single page even though different user groups probably care about them to varying degrees. There is probably no sharp line that can be drawn between admins and users in our case so it also makes kind of sense. (And it also makes sense because distributing these settings would make it much harder to find them in general.)
I think openQA will not scale 10x bigger regardless of this particular change. However, I don't think this change will get in the way of scalability. As I mentioned before, this is really not adding anything new. Just another setting on a level where we already have very similar settings.
Having to change this in the database manually (and having to use the database just to check the current values) seems very cumbersome. Would it be ok for you if I collapse these two new settings as mentioned before? If not, would it be ok for you if I hid these settings by default but show them only when a certain query parameter is specified? If the answer to any of those questions is yes than I would implement that and also additional improvements to the help popover only top of that (because it is easier). |
Yes, ok. After you provided your counter points on all previous statements and still think it's good to bring in the UI changes I can also accept the current state of this PR as is without additional folding considering that by default all the job group properties are anyway already "folded away".
Not necessary, see above
Fine with the order of implementing that. As soon as conflicts and CI checks are resolved I will approve the PR without needing further changes to the UI |
That sounds good but maybe I will look into additional folding because that's still would make sense and I have still some time in the timebox for that. (If the UI becomes just more bulky/complicated due to the folding I will leave it out, though.)
Thanks. And if we can come up with more concrete ideas (that seem doable with reasonable effort/complexity) I think we can still follow-up on them. I will still have to add one more change. So far, if one puts a |
|
Great PR! Please pay attention to the following items before merging: Files matching
This is an automatically generated QA checklist based on modified files. |
|
I pushed a new version. This should have everything except the folding on the UI as I haven't had the time to look into it yet. (Improving the documentation took a little bit longer than expected and we also have these npm issues to deal with.) |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6603 +/- ##
=======================================
Coverage 99.12% 99.12%
=======================================
Files 400 400
Lines 40885 40940 +55
=======================================
+ Hits 40526 40581 +55
Misses 359 359 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Very nice. The only suggestion I have is to change "Toggle advanced cleanup settings" to "Show advanced cleanup settings". "Toggle" sounds like pressing this would actually change settings which it doesn't, right? It just shows settings |
| KEEP_IMPORTANT_LOGS_IN_DAYS => 120, | ||
| KEEP_RESULTS_IN_DAYS => 365, | ||
| KEEP_IMPORTANT_RESULTS_IN_DAYS => 0, | ||
| KEEP_JOBS_IN_DAYS => 365, |
There was a problem hiding this comment.
I assume you selected 365 so that the behaviour (by default) is effectively the same as in before, right?
I think that is a good idea.
So for later how about making use of the new feature by default by selecting an according number higher than "KEEP_RESULTS_IN_DAYS"
| KEEP_JOBS_IN_DAYS => 365, | |
| KEEP_JOBS_IN_DAYS => 730, |
If you think this is too much then I suggest to lower KEEP_RESULTS_IN_DAYS to something like 200 and set KEEP_JOBS_IN_DAYS to 365 then.
|
I changed the name of the button. I also made the advanced fields expand automatically if an error occurs when saving that relates to those fields. When doing further testing I also noticed a few other details (with the error handling in general, not with the new fields) which needed fixing so I added two more commits for this as well.
Yes, that was the idea.
I think putting in a higher number as default would make sense. I assume with "later" you mean in a follow-up PR? |
| push @errors, "'$log_key' must be <= '$result_key'" if $result_value != 0 && $log_value > $result_value; | ||
| push @errors, "'$result_key' must be <= '$job_key'" if $job_value != 0 && $result_value > $job_value; |
There was a problem hiding this comment.
By the way, depending on user feedback we could also simply extend the too low values here. This way users would be confronted with fewer errors and openQA would just do what they had probably in mind anyway. We could still show a reworded version of these messages as information to make it clear what happened.
Yes, please. |
|
Ok, I'll create a follow up once this PR has been merged. |
if you go with this I would also say the |
This will allow configuring the storage duration of the job itself in the database, see https://progress.opensuse.org/issues/179263.
This will allow configuring the storage duration of the job itself in the database, see https://progress.opensuse.org/issues/179263.
This will allow configuring the storage duration of the job itself in the database, see https://progress.opensuse.org/issues/179263.
This keeps jobs in the database as long as configured specifically via `keep_jobs_in_days`/`keep_important_jobs_in_days`. The configurations `keep_results_in_days`/`keep_important_results_in_days` are now really only used for deleting results. Related ticket: https://progress.opensuse.org/issues/179263.
It is possible to configure only one of the limits/durations explicitly, e.g. `log_storage_duration = 400`. So far the entire results would still be deleted after 365 days as this is our default for `result_storage_duration`. So the configuration has no effect. This is probably not what an admin configuring this had in mind. With the introduction of `job_storage_duration` this problem is more apparent as one might have put e.g. `result_storage_duration = 400` or even `result_storage_duration = 0` into their config file but now we would still delete the whole job after 365 days as this is the default limit for `job_storage_duration`. This change extends the default durations so these settings are still effective. So if someone put `result_storage_duration = 400` or `result_storage_duration = 0` in their config file the results would be kept for 400 days or forever - just as before the introduction of `job_storage_duration`. Note that we introduced a validation for this kind of ordering between the retentions in 026d72a. This change did not cover the configuration. We should perhaps print a warning in case the values are invalid. This could be done in the function `_set_default_storage_durations` introduced by this commit if that is wanted. Related ticket: https://progress.opensuse.org/issues/179263
* Improve formatting and wrapping * Mention the differences between the different settings * Mention that also a link label can be used to mark a job as important * Update the contents after introducing "Keep (important) jobs for"
These descriptions were too long for the getting started section - also before introducing "Keep (important) jobs for". Ideally new users don't need this detailed explanation as there is now also a help popover on the relevant page with a more compact explanation. This way there's also less cross-referencing necessary. Related ticket: https://progress.opensuse.org/issues/179263
* Show the description directly after the name instead of at the end * Place all cleanup related properties at the end * See https://progress.opensuse.org/issues/179263
* Mention the different settings in the same order as they appear in the form * Move code into a separate file fore easier editing * See https://progress.opensuse.org/issues/179263
This looks nicer, saves space and avoids a separate paragraph/sentence just for a link to the documentation. Related ticket: https://progress.opensuse.org/issues/179263
* Add missing colon * Use an icon that makes sense
Otherwise, when an error message that relates to advanced fields is shown, the advanced field itself might be hidden which is probably confusing. Related ticket: https://progress.opensuse.org/issues/179263
This is a follow-up of 026d72a which introduced this kind of validation but overlooked fields for important jobs. Related ticket: https://progress.opensuse.org/issues/64830
So the name is more in-line with what it actually contains.
35709c9 to
6b61de8
Compare
|
Actually, with the database migration in place that assigns the current value of I'm not sure whether that's what we want. I could change the assignment in that migration to happen only if Note that if we merge this as-is we can still easily keep jobs longer in the database for all groups afterwards. Just bumping |
Sounds like a plan. I will approve |

See https://progress.opensuse.org/issues/179263 and particular commit messages