-
Notifications
You must be signed in to change notification settings - Fork 576
[cleaner] Make cleaner concurrent inside one archive #3988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
How to get the sqlite DB locks: In an unpacked sosreport:
And run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a pure review standpoint, I like this approach and the code changes overall. We'd want to ensure that the --jobs
option gets exposed to report
for leveraging in-line cleaning, but that can come later.
I am unsure of the need to carry both files
and sqlite
as the concurrency mechanism though. Especially if the results are largely the same. I have a preference to only carry files
, but am certainly open to hearing why we should prefer the sqlite
approach (if we only carry one forward).
That said, on python 3.13 (my Fedora daily driver install), I get this error when trying to clean an archive:
sosreport-terra-2025-04-22-xicbmuo : Beginning obfuscation...
Exception while processing sosreport-terra-2025-04-22-xicbmuo: cannot pickle 'BufferedReader' instances
No reports obfuscated, aborting...
and it is not immediately obvious to me where that BufferedReader
is coming into play here but I suspect it is from the ProcessPoolExecutor
instantiation.
What python version(s) are you able to get successful executions on?
One more bug:
Because |
bd314b2
to
93fe8a3
Compare
The
The concurrency mechanism does not need to be configurable, I think sticking to either The |
cc00d15
to
72664af
Compare
c777375
to
6f4c2c2
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
6f4c2c2
to
516c61a
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
516c61a
to
7eec966
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
7eec966
to
938a85a
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
938a85a
to
841ca5b
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
841ca5b
to
888926a
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
888926a
to
ee19b28
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
ee19b28
to
3915edd
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
3915edd
to
7305ce2
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
7305ce2
to
d52df32
Compare
While the code still has a few TODOs, they are rather minor ones and I will process them soon. Otherwise, the PR is ready for a review. As the change affects various stuff, I would prefer both code review and also some independent testing(*). So I am kindly requesting @TurboTurtle @arif-ali @jcastill and also @mhradile / @tiredPotato for a review or testing. (*) I tested both standalone cleaner on a sosreport and sos-cleaner, as well as hooked |
@@ -26,8 +26,8 @@ class SoSUsernameParser(SoSCleanerParser): | |||
map_file_key = 'username_map' | |||
regex_patterns = [] | |||
|
|||
def __init__(self, config, skip_cleaning_files=[]): | |||
self.mapping = SoSUsernameMap() | |||
def __init__(self, config, skip_cleaning_files=[], workdir='/tmp'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workdir=
/tmp` is here due to https://github.com/sosreport/sos/blob/main/tests/unittests/cleaner_tests.py#L33-L38 where we need to pass some value.
Calls from the cleaner itself always use os.path.dirname(self.opts.map_file)
so we just need some safe fallback for tests.
I am not sure if /tmp
is a good dir for all distros, and if we should set it here or in the tests - let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ok from the Debian/Ubuntu side, as we use /tmp
already for all our configs as a patch (although I'll be changing that in the policy instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had the config option specific in the sos.conf it was being placed in /tmp
already. I only just moved it in the polict instead, which made more sense.
I'd agree, we ought to use the policy configuration (if we can), then you know it will be consistent based on those parameters rather than a static entry on the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we would need to load Policy and its get_tmp_dir()
as a fallback. Sadly, neither SoSCleanerParser
or cleaner tests have direct access to Policy - I will enhance tests accordingly.
I'll have a look at testing next week. As you mentioned there are a few things to look at still as the unit tests are failing. |
I've run some quick tests, and the results are looking good. In RHEL 10 (but I'm testing 8 and 9 as well) here's an example, with 3000 IP addresses on dummy network devices and the same number of entries in /etc/hosts. Posting here two runs of tests but I've run around 20 with very similar results. RHEL 10 with new code (i.e. the code in this PR): Test 1:
Test 2:
RHEL 10 with current code (i.e. the one that exists in SoS at the moment): Test 1:
Test 2:
I'll run the same tests in RHEL 8, 9, and Fedora 42 over the weekend. Also no errors or issues at all with the code proposed here. |
@@ -26,8 +26,8 @@ class SoSUsernameParser(SoSCleanerParser): | |||
map_file_key = 'username_map' | |||
regex_patterns = [] | |||
|
|||
def __init__(self, config, skip_cleaning_files=[]): | |||
self.mapping = SoSUsernameMap() | |||
def __init__(self, config, skip_cleaning_files=[], workdir='/tmp'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got bit better results, see tail of first post here / grep for "When speaking about performance,". I run these tests on real sosreports. Test "10k IP addresses" is imho worth to run to verify scalability of the solution rather than compare performance. Also, you can play with |
But Similarly, repeated runs of avocado tests cause that (to really test something is not yet obfuscated, we would have to remove cleaner's cache - what could interfere with other tests that we remove cache under their hands. So I would not have such tests) Going through the comments now, will apply them together with the tests fixes in the next push. |
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
d52df32
to
2efca94
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
2efca94
to
dc4cec6
Compare
The failing tests are due to missing rebase of my PR to have the Debian default workdir commit. Will do tomorrow. |
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
dc4cec6
to
1b96e72
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
1b96e72
to
f691e37
Compare
Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>
f691e37
to
8fb933b
Compare
MOCK_FILE = '/tmp/sos-test-ipv6.txt' | ||
MOCK_FILE = '/sos-test-ipv6.txt' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (and same in ipv6_test.py
is tricky. Some previous test obfuscated tmp
into obfuscatedword1
. We replaced /etc/sos/cleaner/default_mapping
by empty file, yet the cleaner's cache kept the mapping. So self.assertFileCollected(MOCK_FILE)
fails as the filename is changed by cleaner in the tarball.
Draft version adding to the traditional sequential backend also sqlite3 and file based concurrent ones.
TL;DR: please review the idea, test it and comment or ack or nack the approach. Then I will fix the TODOs.
Let me explain various factors and reasons that affected the chosen implementation.
As very very first, I implemented both sqlite3 and files based approaches, see reasoning below. And left the original behaviour just for (performance) comparison (it must be run with
-j 1
).First, mappings are very independent objects that do maintain their dataset on their own. Even adding an item to obfuscate (say, FQDN) means the dataset is updated a few times (by the host, FQDN and/or domain). I tried to respect this as I like that independency (and also changing it would require a lot of changes).
Also, the dataset is growing in time (as we discover more domains or IP networks) , which means some instances of lately discovered domain might not be obfuscated or that changing ordering of files to obfuscate result in different final mapping (also in the size of the final map). You can check it by yourself if you reorder the list from
get_file_list
. So running cleaner concurrently - which means dataset being populated non-deterministically - can end up in different final map. Dont be confused by this as I was.I chose individual processes to sync over the pieces of information "we obfuscate item X as the first one, and item Y as the next". An option to exchange or sync on the whole mapping/datasets would mean 1) more data to sync and 2) altering the mapping classes "independent" behaviour.
This also means there are gaps in numbers and it is fine. The ordering number is the size of the dataset, which can be incremented more than by one when adding just one item. This is fine, as each and every process replays the same and adds the same stuff into its dataset. And also this allows a smart replay of the whole "how was the dataset created?" process at the end - see the
archive.load_parser_entries()
call.Then, ProcessPoolExecutor has two limitations for us that I described directly in the code, plus one substantial one: passing the whole
SoSCleaner
class is not easily feasible. Doing so would prevent some code movement among classes and would be nicer to hook new code in, but it does not work easily. Trying that, I hit issues like "oh, cloningSoSCleaner
into a child process means cloning there sos arguments parsing that overrides some __* method that blocks successful process spawn and iteration over a list". It is possible to fix or hack those traps in our code, but after iteratively doing that for five such traps, I gave up this rabbit hole journey.Please, consider this as a draft only - see the number of TODO points, plus various methods need a better name. Anyway the code is functionally ready, works well(*) and scales fairly well.
(*) the only concern is that sqlite DB can lock itself. I did a few changes there, but still I seldomly hit a live-lock behaviour over locked DB on some artificial test cases (have 8 identical files each with 100 unique IP addresses inside a sosreport, and run cleaner with
-j 8
is my "favourite" one).While I like the sqlite approach more (it looks so professional!), we cant choose it until we fix these live-locks / DB locks.
Gladly, the file-based approach works smoothly and provides comparable (or even slightly better) performance.
When speaking about performance, some benchmark tests are running. Results from 8cores RHEL10 beta, median time from 3 runs each:
a small, 11MB packed sosreport:
220MB packed sosreport:
350MB packed sosreport:
770MB packed sosreport:
1GB packed sosreport:
Closes: #3097
Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines