Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent SEGFAULTs on consecutive exec_command() invocations #658

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from

Conversation

Jakuje
Copy link
Contributor

@Jakuje Jakuje commented Nov 14, 2024

SUMMARY

The function exec_command() keeps the callbacks as a local variable before assigning them to the created channel. The channel is not guaranteed to be completely freed when ssh_channel_free() is called because there might be some leftover messages or responses to process (close confirmation, exit code ...).

Calling the exec_command() as done previously in the test from the same function without anything in between (except assert) will likely map the second function call to the same memory on the call stack so it was working most of the time. But calling it from different functions or contexts will likely change the call stack and processing of outstanding callbacks is more likely to result in addressing wrong memory location.

Likely fixes #57, #645 and #657

ISSUE TYPE
  • Bugfix Pull Request
ADDITIONAL INFORMATION

I was not able to reproduce the issue locally so pushing to see if the CI will be able to crash.

This is also introducing memory leaks as the callback structure is never freed. We should probably store it somewhere in the python code before returning to make sure it is not garbage collected (or can the python GC track the callback pointer is still stored on the libssh side?).

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Nov 14, 2024
Copy link

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo yum install -y dnf-plugins-core on RHEL 8
  • sudo dnf install -y dnf-plugins-core on Fedora
  • dnf copr enable packit/ansible-pylibssh-658
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

@Jakuje Jakuje force-pushed the crash-exec branch 2 times, most recently from 6d6711a to b94734d Compare November 15, 2024 20:41
@Jakuje Jakuje marked this pull request as ready for review November 19, 2024 10:12
Signed-off-by: Jakub Jelen <[email protected]>
Copy link

sonarcloud bot commented Nov 19, 2024

strict=False,
)
def exec_second_command(ssh_channel):
"""Call exec_command() from different context in the call stack."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling from a different context is a part of the motivation. Within this function, there's nothing that changes the context. You're probably referring to the calling test but this is not something that would go into the docstring because it's not what the function does, it's what that test does, probably.

Copy link
Member

@webknjaz webknjaz Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Call exec_command() from different context in the call stack."""
"""Return ``exec_command()`` stdout as a text string."""

cb_size = sizeof(callbacks.ssh_channel_callbacks_struct)
cdef callbacks.ssh_channel_callbacks_struct *cb = <callbacks.ssh_channel_callbacks_struct *>PyMem_Malloc(cb_size)
if cb is NULL:
raise LibsshChannelException("Memory allocation error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov says this line is uncovered: https://app.codecov.io/gh/ansible/pylibssh/pull/658#17ffa324129dc304109f4a66c79769d7-R173.

Could you add a test covering this newly added line so that all the new lines in the patch are covered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a simple way to make the malloc fail to add to the test coverage?

u_cmd_out = ssh_channel.exec_command('echo -n Hello Again').stdout.decode()
assert u_cmd_out == u'Hello Again' # noqa: WPS302

# randomize the stack a bit more by calling this from yet another function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate enough?

Suggested change
# randomize the stack a bit more by calling this from yet another function
# NOTE: Call `exec_command()` once again from another function to
# NOTE: force it to happen in another place of the call stack,
# NOTE: making sure that the context is different from one in this
# NOTE: this test function. The resulting call stack will end up
# NOTE: being more random.


libssh.ssh_channel_send_eof(channel)
result.returncode = libssh.ssh_channel_get_exit_status(channel)
if channel is not NULL:
libssh.ssh_channel_close(channel)
libssh.ssh_channel_free(channel)

# XXX leaking the memory of the callbacks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on what this means? Is this a FIXME? Does it have to remain in the patch? Why not use a # FIXME: comment instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a side-effect that might be considered as a bug, I would prefer to keep a reference to that. Other option might be to keep some list of callbacks on the session layer to free when we free session, but I wanted first PoC to confirm this approach was working. This will be an issue for older libssh versions before we will fix it with the following change in libssh: https://gitlab.com/libssh/libssh-mirror/-/merge_requests/549

Given that I see this works, I can probably implement more cleaner approach without memory leaks, but probably only tomorrow.

@webknjaz webknjaz changed the title Prevent crash on consecutive exec_command() Prevent SEGFAULTs on consecutive exec_command() invocations Nov 19, 2024
@webknjaz
Copy link
Member

@Jakuje it looks like this is making some CI jobs get stuck: https://github.com/ansible/pylibssh/actions/runs/11910858999/job/33210235668?pr=658 / https://github.com/ansible/pylibssh/actions/runs/11910858999/job/33210009382?pr=658.
(I cancelled the 3.9 one after 39 minutes, and the 3.10 one was killed by xdist after a 30-second timeout)

@webknjaz
Copy link
Member

I restarted said jobs, but this is something to look into, as it'll probably make the CI flakier if merged.

@webknjaz
Copy link
Member

@webknjaz
Copy link
Member

OTOH, it's also unstable on devel: https://github.com/ansible/pylibssh/actions/runs/11917417576/job/33212843276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flaky SEGFAULT while testing exec_command
2 participants