[cleaner] Make cleaner concurrent inside one archive #3988

pmoravec · 2025-04-10T20:33:07Z

Draft version adding to the traditional sequential backend also sqlite3 and file based concurrent ones.

TL;DR: please review the idea, test it and comment or ack or nack the approach. Then I will fix the TODOs.

Let me explain various factors and reasons that affected the chosen implementation.

As very very first, I implemented both sqlite3 and files based approaches, see reasoning below. And left the original behaviour just for (performance) comparison (it must be run with -j 1).

First, mappings are very independent objects that do maintain their dataset on their own. Even adding an item to obfuscate (say, FQDN) means the dataset is updated a few times (by the host, FQDN and/or domain). I tried to respect this as I like that independency (and also changing it would require a lot of changes).

Also, the dataset is growing in time (as we discover more domains or IP networks) , which means some instances of lately discovered domain might not be obfuscated or that changing ordering of files to obfuscate result in different final mapping (also in the size of the final map). You can check it by yourself if you reorder the list from get_file_list. So running cleaner concurrently - which means dataset being populated non-deterministically - can end up in different final map. Dont be confused by this as I was.

I chose individual processes to sync over the pieces of information "we obfuscate item X as the first one, and item Y as the next". An option to exchange or sync on the whole mapping/datasets would mean 1) more data to sync and 2) altering the mapping classes "independent" behaviour.

This also means there are gaps in numbers and it is fine. The ordering number is the size of the dataset, which can be incremented more than by one when adding just one item. This is fine, as each and every process replays the same and adds the same stuff into its dataset. And also this allows a smart replay of the whole "how was the dataset created?" process at the end - see the archive.load_parser_entries() call.

Then, ProcessPoolExecutor has two limitations for us that I described directly in the code, plus one substantial one: passing the whole SoSCleaner class is not easily feasible. Doing so would prevent some code movement among classes and would be nicer to hook new code in, but it does not work easily. Trying that, I hit issues like "oh, cloning SoSCleaner into a child process means cloning there sos arguments parsing that overrides some __* method that blocks successful process spawn and iteration over a list". It is possible to fix or hack those traps in our code, but after iteratively doing that for five such traps, I gave up this rabbit hole journey.

Please, consider this as a draft only - see the number of TODO points, plus various methods need a better name. Anyway the code is functionally ready, works well(*) and scales fairly well.

(*) the only concern is that sqlite DB can lock itself. I did a few changes there, but still I seldomly hit a live-lock behaviour over locked DB on some artificial test cases (have 8 identical files each with 100 unique IP addresses inside a sosreport, and run cleaner with -j 8 is my "favourite" one).

While I like the sqlite approach more (it looks so professional!), we cant choose it until we fix these live-locks / DB locks.
Gladly, the file-based approach works smoothly and provides comparable (or even slightly better) performance.

When speaking about performance, some benchmark tests are running. Results from 8cores RHEL10 beta, median time from 3 runs each:

a small, 11MB packed sosreport:

current cleaner run 0m59.799s
sqlite-based cleaner run 0m59.956s (j=1), 0m34.710s (j=2), 0m22.761s (j=4) and 0m14.137s (j=8)
files-based cleaner run 0m59.760s (j=1), 0m34.594s (j=2), 0m22.650s (j=4) and 0m14.111s (j=8)

220MB packed sosreport:

current cleaner run 42m2.282s
sqlite-based cleaner run 42m4.665s (j=1), 23m17.864s (j=2), 19m17.219s (j=4) and 10m26.143s (j=8)
files-based cleaner run 42m0.994s (j=1), 23m12.987s (j=2), 19m23.130s (j=4) and 10m26.608s (j=8)

350MB packed sosreport:

current cleaner run 65m30.090s
sqlite-based cleaner run 66m59.097s (j=1), 42m5.778s (j=2), 29m3.550s (j=4) and 25m45.191s (j=8)
files-based cleaner run 66m51.753s (j=1), 41m39.086s (j=2), 28m21.102s (j=4) and 24m20.798s (j=8)

770MB packed sosreport:

current cleaner run 794m0.238s - yes, over 12 hours
sqlite-based cleaner run 794m47.511s (j=1), 436m9.563s (j=2), 219m25.159s (j=4) and 164m10.205s (j=8) - speeded up from 12h to 2.5h
files-based cleaner run 793m45.589s (j=1), 432m25.081s (j=2), 218m22.004s (j=4) and 165m10.775s (j=8) - again speeded up from 12h to 2.5h

1GB packed sosreport:

current cleaner run 66m17.037s
sqlite-based cleaner run 66m14.740s (j=1), 48m7.181s (j=2), 26m5.231s (j=4) and 23m20.612s (j=8)
files-based cleaner run 65m56.254s (j=1), 47m57.850s (j=2), 25m50.313s (j=4) and 23m19.893s (j=8)

Closes: #3097

Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

Is the commit message split over multiple lines and hard-wrapped at 72 characters?
Is the subject and message clear and concise?
Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
Does the commit contain a Signed-off-by: First Lastname [email protected]?
Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
Are all passwords or private data gathered by this PR obfuscated?

packit-as-a-service · 2025-04-10T20:35:48Z

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

sudo yum install -y dnf-plugins-core on RHEL 8
sudo dnf install -y dnf-plugins-core on Fedora
dnf copr enable packit/sosreport-sos-3988
And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

pmoravec · 2025-04-11T07:50:08Z

How to get the sqlite DB locks:

In an unpacked sosreport:

for i in $(seq 1 10); do
  for j in $(seq 1 10); do
    echo "${i}.${j}.${i}.${j}"
  done
done > hunderd_of_IPs.0.txt

for i in $(seq 1 8); do
  cp hunderd_of_IPs.0.txt hunderd_of_IPs.${i}.txt

And run sos clean -j 8 --concurrency-backend sql <DIR> (-j 4 should get the same outcome as well with smaller probability).

TurboTurtle

From a pure review standpoint, I like this approach and the code changes overall. We'd want to ensure that the --jobs option gets exposed to report for leveraging in-line cleaning, but that can come later.

I am unsure of the need to carry both files and sqlite as the concurrency mechanism though. Especially if the results are largely the same. I have a preference to only carry files, but am certainly open to hearing why we should prefer the sqlite approach (if we only carry one forward).

That said, on python 3.13 (my Fedora daily driver install), I get this error when trying to clean an archive:

sosreport-terra-2025-04-22-xicbmuo :               Beginning obfuscation...
Exception while processing sosreport-terra-2025-04-22-xicbmuo: cannot pickle 'BufferedReader' instances
No reports obfuscated, aborting...

and it is not immediately obvious to me where that BufferedReader is coming into play here but I suspect it is from the ProcessPoolExecutor instantiation.

What python version(s) are you able to get successful executions on?

pmoravec · 2025-04-23T12:22:28Z

BufferedReader: I can reproduce the same on a few Fedora versions and python versions, the problem is archive class instance is not pickable.

One more bug:

  File "/path/to/sos-pmoravec-concurrent-cleaner-TESTING/sos/cleaner/mappings/__init__.py", line 135, in add
    os.link(tmpfile.name, os.path.join(self.link_dir, f"{counter}"))
OSError: [Errno 18] Invalid cross-device link: '/tmp/tmp9emqdshx' -> '/etc/sos/cleaner/cleaner_links/soshostnamemap/1'

Because /tmp is on a different lvolume/device than /etc. The tmpfile must be under self.link_dir (plus minus).

pmoravec · 2025-04-23T15:11:24Z

The cannot pickle 'BufferedReader' instances is resolved now (though in a way I dont like much; simply the archive class instance has to keep some open FD, to a file or a connection, but I was unable to identify either - while I knew I was explicitly dropping some such connections to sqlite). Previous testing was run on several RHELs (9 and 10 beta), it is strange it hasnt hit this issue despite same python version on Fedora hit it.

Invalid cross-device link: also fixed.

The concurrency mechanism does not need to be configurable, I think sticking to either sql xor files is fine. I like sql more but as it has scalability issues we cant afford, I would vote for files as well. I put it in the draft as an option to easy test and compare both.

The --jobs option gets exposed to report: good catch, I thought it is there but we use --threads instead. What option name would you suggest, anything better than --cleaner-concurrency? And it should go into Report Options:, not to Global Options:, right?

Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>

pmoravec · 2025-06-10T13:47:32Z

While the code still has a few TODOs, they are rather minor ones and I will process them soon. Otherwise, the PR is ready for a review.

As the change affects various stuff, I would prefer both code review and also some independent testing(*). So I am kindly requesting @TurboTurtle @arif-ali @jcastill and also @mhradile / @tiredPotato for a review or testing.

(*) I tested both standalone cleaner on a sosreport and sos-cleaner, as well as hooked sos [report|collect] --clean call flow. But I bet some corner case can be overlooked.

pmoravec · 2025-06-13T06:56:25Z

sos/cleaner/parsers/username_parser.py

@@ -26,8 +26,8 @@ class SoSUsernameParser(SoSCleanerParser):
    map_file_key = 'username_map'
    regex_patterns = []

-    def __init__(self, config, skip_cleaning_files=[]):
-        self.mapping = SoSUsernameMap()
+    def __init__(self, config, skip_cleaning_files=[], workdir='/tmp'):


The workdir=/tmp` is here due to https://github.com/sosreport/sos/blob/main/tests/unittests/cleaner_tests.py#L33-L38 where we need to pass some value.

Calls from the cleaner itself always use os.path.dirname(self.opts.map_file) so we just need some safe fallback for tests.

I am not sure if /tmp is a good dir for all distros, and if we should set it here or in the tests - let me know your thoughts.

it's ok from the Debian/Ubuntu side, as we use /tmp already for all our configs as a patch (although I'll be changing that in the policy instead)

When I run avocado tests in rhel, I get them placed in /var/tmp. I imagine it uses the _tmp_dir inside policy.
Since @arif-ali has added now the same var in the debian policy via e4c50c2, maybe it would be better to use policy.get_tmp_dir() to get the right tmp directory depending on distro?

We had the config option specific in the sos.conf it was being placed in /tmp already. I only just moved it in the polict instead, which made more sense.

I'd agree, we ought to use the policy configuration (if we can), then you know it will be consistent based on those parameters rather than a static entry on the code.

So we would need to load Policy and its get_tmp_dir() as a fallback. Sadly, neither SoSCleanerParser or cleaner tests have direct access to Policy - I will enhance tests accordingly.

arif-ali · 2025-06-13T10:08:18Z

I'll have a look at testing next week. As you mentioned there are a few things to look at still as the unit tests are failing.

jcastill · 2025-06-13T16:00:46Z

I've run some quick tests, and the results are looking good. In RHEL 10 (but I'm testing 8 and 9 as well) here's an example, with 3000 IP addresses on dummy network devices and the same number of entries in /etc/hosts. Posting here two runs of tests but I've run around 20 with very similar results.

RHEL 10 with new code (i.e. the code in this PR):

Test 1:

real	20m31.553s
user	34m53.220s
sys	0m2.397s

Test 2:

real	21m38.153s
user	36m25.341s
sys	0m2.380s

RHEL 10 with current code (i.e. the one that exists in SoS at the moment):

Test 1:

real	32m37.409s
user	32m27.055s
sys	0m11.651s

Test 2:

real	31m57.659s
user	31m47.392s
sys	0m2.013s

I'll run the same tests in RHEL 8, 9, and Fedora 42 over the weekend. Also no errors or issues at all with the code proposed here.
@pmoravec did you get similar results in your tests?

sos/cleaner/mappings/__init__.py

jcastill · 2025-06-16T08:55:36Z

sos/cleaner/parsers/username_parser.py

@@ -26,8 +26,8 @@ class SoSUsernameParser(SoSCleanerParser):
    map_file_key = 'username_map'
    regex_patterns = []

-    def __init__(self, config, skip_cleaning_files=[]):
-        self.mapping = SoSUsernameMap()
+    def __init__(self, config, skip_cleaning_files=[], workdir='/tmp'):


When I run avocado tests in rhel, I get them placed in /var/tmp. I imagine it uses the _tmp_dir inside policy.
Since @arif-ali has added now the same var in the debian policy via e4c50c2, maybe it would be better to use policy.get_tmp_dir() to get the right tmp directory depending on distro?

pmoravec · 2025-06-17T09:15:30Z

I'll run the same tests in RHEL 8, 9, and Fedora 42 over the weekend. Also no errors or issues at all with the code proposed here. @pmoravec did you get similar results in your tests?

I got bit better results, see tail of first post here / grep for "When speaking about performance,". I run these tests on real sosreports. Test "10k IP addresses" is imho worth to run to verify scalability of the solution rather than compare performance. Also, you can play with sos report -t / sos cleaner --jobs options to see impact of # processes to execution time (over same sosreport).

pmoravec · 2025-06-17T10:25:01Z

tests/unittests/cleaner_tests.py, line 189, in test_ip_parser_invalid_ipv4_line: we dont further raise parser exceptions as that would halt the child process. Rather we keep the line unchanged. Outcome is the same, no change applied, just without the exception. I will fix the test.

But File "/tmp/cirrus-ci-build/tests/unittests/cleaner_tests.py", line 271, in test_keyword_parser_no_change_by_default becomes a false test. Since avocado runs individual tests concurrently, kw_parser_none loads mapping from kw_parser. So we must revert the test.

Similarly, repeated runs of avocado tests cause that host_parser adds to cleaner's cache foobar.com, hence CleanerMapTests.host_map (run originally prior CleanerParserTests.host_parser) is populated since the second run of avocado tests. VERY confusing. So let swap parsers<->map testing and revert the logic as well.

(to really test something is not yet obfuscated, we would have to remove cleaner's cache - what could interfere with other tests that we remove cache under their hands. So I would not have such tests)

Going through the comments now, will apply them together with the tests fixes in the next push.

Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>

pmoravec · 2025-06-17T17:53:48Z

The failing tests are due to missing rebase of my PR to have the Debian default workdir commit. Will do tomorrow.

Allow running cleaner concurrently via child processes. They synchronize on the ordering of items added to dataset of each mapper by creating numbered files in a directory specific for each mapper. Together with deterministic generation of obfuscated values, this ensures the individual processes end up with identical mappings. Resolves: sosreport#3097 Closes: sosreport#3988 Signed-off-by: Pavel Moravec <[email protected]>

pmoravec · 2025-06-18T21:15:05Z

tests/cleaner_tests/ipv6_test/ipv6_test.py

-MOCK_FILE = '/tmp/sos-test-ipv6.txt'
+MOCK_FILE = '/sos-test-ipv6.txt'


This change (and same in ipv6_test.py is tricky. Some previous test obfuscated tmp into obfuscatedword1. We replaced /etc/sos/cleaner/default_mapping by empty file, yet the cleaner's cache kept the mapping. So self.assertFileCollected(MOCK_FILE) fails as the filename is changed by cleaner in the tarball.

TurboTurtle reviewed Apr 22, 2025

View reviewed changes

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from bd314b2 to 93fe8a3 Compare April 23, 2025 15:01

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch 4 times, most recently from cc00d15 to 72664af Compare May 13, 2025 15:19

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch 3 times, most recently from c777375 to 6f4c2c2 Compare May 25, 2025 13:16

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 6f4c2c2 to 516c61a Compare May 25, 2025 20:01

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 516c61a to 7eec966 Compare June 2, 2025 20:28

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 7eec966 to 938a85a Compare June 2, 2025 20:40

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 938a85a to 841ca5b Compare June 2, 2025 20:45

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 841ca5b to 888926a Compare June 8, 2025 20:53

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 888926a to ee19b28 Compare June 8, 2025 21:09

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from ee19b28 to 3915edd Compare June 10, 2025 12:41

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 3915edd to 7305ce2 Compare June 10, 2025 12:57

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 7305ce2 to d52df32 Compare June 10, 2025 13:40

pmoravec marked this pull request as ready for review June 10, 2025 13:44

pmoravec commented Jun 13, 2025

View reviewed changes

jcastill reviewed Jun 16, 2025

View reviewed changes

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from d52df32 to 2efca94 Compare June 17, 2025 11:49

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 2efca94 to dc4cec6 Compare June 17, 2025 13:21

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from dc4cec6 to 1b96e72 Compare June 17, 2025 20:26

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from 1b96e72 to f691e37 Compare June 18, 2025 07:19

pmoravec force-pushed the sos-pmoravec-concurrent-cleaner branch from f691e37 to 8fb933b Compare June 18, 2025 21:12

pmoravec commented Jun 18, 2025

View reviewed changes

		MOCK_FILE = '/tmp/sos-test-ipv6.txt'
		MOCK_FILE = '/sos-test-ipv6.txt'

[cleaner] Make cleaner concurrent inside one archive #3988

Are you sure you want to change the base?

[cleaner] Make cleaner concurrent inside one archive #3988

Conversation

pmoravec commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

packit-as-a-service bot commented Apr 10, 2025

Uh oh!

pmoravec commented Apr 11, 2025

Uh oh!

TurboTurtle left a comment

Choose a reason for hiding this comment

Uh oh!

pmoravec commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmoravec commented Apr 23, 2025

Uh oh!

pmoravec commented Jun 10, 2025

Uh oh!

pmoravec Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

arif-ali Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

jcastill Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

arif-ali Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

pmoravec Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

arif-ali commented Jun 13, 2025

Uh oh!

jcastill commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

jcastill Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

pmoravec commented Jun 17, 2025

Uh oh!

pmoravec commented Jun 17, 2025

Uh oh!

pmoravec commented Jun 17, 2025

Uh oh!

pmoravec Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pmoravec commented Apr 10, 2025 •

edited

Loading

pmoravec commented Apr 23, 2025 •

edited

Loading