Skip to content

fix(SSHLoggerBase): switch to use threads #11394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fruch
Copy link
Contributor

@fruch fruch commented Jul 15, 2025

switch to threads, so we can stop creating a new remoter for process
creating new remoter break some of the integration tests
that pass in LoaclCmdRunner

there no real reason for using a process here, while some of the classes
inheriting from this one switch already to using a thread

Fixes: #11368

Testing

  • tested locally

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

Copilot

This comment was marked as outdated.

@soyacz soyacz added the test-integration Enable running the integration tests suite label Jul 15, 2025
@vponomaryov
Copy link
Contributor

vponomaryov commented Jul 15, 2025

https://jenkins.scylladb.com/job/sct-github-PRs-scan/job/scylla-cluster-tests/job/PR-11394/2/consoleFull

12:21:59  unit_tests/test_cassandra_stress_thread.py::test_01_cassandra_stress FAILED [  2%]
12:22:04  unit_tests/test_cassandra_stress_thread.py::test_02_cassandra_stress_user_profile FAILED [  2%]
12:22:25  unit_tests/test_cassandra_stress_thread.py::test_03_cassandra_stress_client_encrypt FAILED [  3%]
12:22:31  unit_tests/test_cassandra_stress_thread.py::test_04_cassandra_stress_multi_region FAILED [  4%]
12:22:40  unit_tests/test_cassandra_stress_thread.py::test_05_cassandra_stress_compression[Deflate-none] FAILED [  4%]
12:22:51  unit_tests/test_cassandra_stress_thread.py::test_05_cassandra_stress_compression[LZ4-lz4] FAILED [  5%]
12:22:58  unit_tests/test_cassandra_stress_thread.py::test_05_cassandra_stress_compression[Snappy-snappy] FAILED [  6%]
12:23:11  unit_tests/test_cassandra_stress_thread.py::test_05_cassandra_stress_compression[Zstd-none] FAILED [  6%]
...

@soyacz
Copy link
Contributor

soyacz commented Jul 15, 2025

but it's different error: PermissionError: [Errno 13] Permission denied: '/tmp/sctfxfk1x3x/tmp.f5cXTvYnCX'
I've tried this fix locally and worked.

@dimakr
Copy link
Contributor

dimakr commented Jul 15, 2025

Maybe it is due to locally (on dev machine) sudo is usually configured to be passwordless? And this leads to the statement with remote_file(remoter=remoter, remote_path=password_file, sudo=use_sudo) as fobj: behave differently on local setup?

@dimakr
Copy link
Contributor

dimakr commented Jul 15, 2025

Maybe it is due to locally (on dev machine) sudo is usually configured to be passwordless? And this leads to the statement with remote_file(remoter=remoter, remote_path=password_file, sudo=use_sudo) as fobj: behave differently on local setup?

I will disable passwordless sudo on my machine and check this PR locally

@fruch
Copy link
Contributor Author

fruch commented Jul 15, 2025

This missing the fix for dockerhub login, I'll rebase it later

@fruch fruch force-pushed the fix_11368 branch 2 times, most recently from 2923840 to 0907996 Compare July 17, 2025 13:39
@fruch
Copy link
Contributor Author

fruch commented Jul 17, 2025

@dimakr
this sage isn't done

this fix doesn't work with you changes.

i.e. we still have issue with docker_hub_login if it's thinks it's docker, and try to use sudo

why do we need sudo within the docker ? for setting the password for docker ?

@dimakr
Copy link
Contributor

dimakr commented Jul 17, 2025

why do we need sudo within the docker ? for setting the password for docker ?

at some point in the past the state of the code of RemoteDocker.pull_image was like this:

    def pull_image(node, image):
        # Login docker-hub before pull, in case node authentication is expired or not logged-in.
        docker_hub_login(remoter=node.remoter)
        prefix = "sudo" if node.is_docker else ""
        node.remoter.run(
            f'{prefix} docker pull {image}', verbose=True, retry=3)

This led to the situation when login to DockerHub was performed under regular user, but image pull was done under sudo, for docker based nodes. So basically for docker nodes image pull was done by unauthenticated user.
At some point we started hitting unauthenticated pull rate limit too often, after Docker reduced rate limits.

@fruch
Copy link
Contributor Author

fruch commented Jul 17, 2025

why do we need sudo within the docker ? for setting the password for docker ?

at some point in the past the state of the code of RemoteDocker.pull_image was like this:

    def pull_image(node, image):
        # Login docker-hub before pull, in case node authentication is expired or not logged-in.
        docker_hub_login(remoter=node.remoter)
        prefix = "sudo" if node.is_docker else ""
        node.remoter.run(
            f'{prefix} docker pull {image}', verbose=True, retry=3)

This led to the situation when login to DockerHub was performed under regular user, but image pull was done under sudo, for docker based nodes. So basically for docker nodes image pull was done by unauthenticated user. At some point we started hitting unauthenticated pull rate limit too often, after Docker reduced rate limits.

long run we don't want this command to be run within a docker instance
and if we don't need sudo support in it for VMs, let's just skip this login for docker instances, until we fix the loader setup for docker backend.

@fruch
Copy link
Contributor Author

fruch commented Jul 17, 2025

meanwhile I'll try fixing this in a different way, removing the remoter recreate code that is the root cause, and this change was intended to bypass it.

@fruch fruch changed the title fix(intergation-tests): add DummyNode.is_docker() fix(SSHLoggerBase): switch to use threads Jul 17, 2025
@fruch fruch removed the test-integration Enable running the integration tests suite label Jul 17, 2025
@fruch fruch requested a review from Copilot July 17, 2025 20:45
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR switches the SSHLoggerBase class from using multiprocessing to threading to resolve integration test failures. The change eliminates the need to create new remoter instances for processes, which was breaking tests that pass in LocalCmdRunner.

  • Replace multiprocessing.Process with ThreadPoolExecutor for journal logging
  • Remove remoter creation logic and reuse the existing node remoter
  • Update start/stop methods to work with thread-based execution
Comments suppressed due to low confidence (1)

sdcm/utils/remote_logger.py:81

  • The variable name _child_thread is misleading since it's actually a ThreadPoolExecutor, not a thread. Consider renaming it to _thread_executor or _executor.
        self._child_thread = ThreadPoolExecutor(max_workers=1)

swith to threads, so we can stop creating a new remoter for process
creating new remoter break some of the integration tests
that pass in `LoaclCmdRunner`

there no real reason for using a process here, while some of the classes
inheriting from this one switch already to using a thread
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_01_cassandra_stress integration test is failing in multiple PRs
4 participants