Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward signals to child process with micromamba run #3152

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

MihaiBSony
Copy link

Hi! This MR tries to fix issue #1820, in which signals are not forwarded to the child process. I think this should count as a first draft - in the final version it's probably a good idea to save the state of the signal handlers and restore it once the call is over, instead of just overwriting them. Looking forward to your feedback!

P.S.: I don't know much about Windows, so I didn't even try to handle that case.

@jonashaag
Copy link
Contributor

Do have more understanding of why signal forwarding works on Linux and macOS but not Windows? Should this patch be Windows-only instead?

@MihaiBSony
Copy link
Author

MihaiBSony commented Jan 25, 2024

My changes don't touch windows at all (almost all changes are inside a #ifndef _WIN32 guard). Previously, the only signal that was somewhat working on Linux was SIGTERM, because micromamba was adding a signal handler for it (see removed code in the PR), which in turn was stopping the child process with a SIGTERM. This PR attempts to forward all signals to the child, except for SIGCHLD.

@MihaiBSony
Copy link
Author

The context where I encountered this as an issue is a setup involving SLURM and pytorch-lightning. There I needed the python interpreter to receive the SIGUSR1 signal, but, whenever I ran it as micromamba run -n env python script.py, sending SIGUSR1 to micromamba would just terminate the micromamba process and the child, since that is the default action for SIGUSR1 according to the signal(7) manpage.

While working on the issue I got some inspiration from the source code of the Singularity/Apptainer containerization tool, since micromamba run seems to roughly have a similar purpose as singularity exec. There they seem to handle signals in a similar way as this PR, forwarding them to the child process (if I understood the code correctly). See here.

@jonashaag
Copy link
Contributor

jonashaag commented Jan 25, 2024

Thanks a lot for the context!

We definitely need to add tests for this feature. I guess it could be as simple as using micromamba run to start a Python script that prints whatever signals it receives, and using that output in the test.

That could also help understanding the Windows status quo :)

We save and restore the signal mask and the signal handlers, as well as
stop and restart our `default_signal_handler`.
@MihaiBSony
Copy link
Author

Hi @jonashaag!

I made the code more robust and added unit tests for signal forwarding and process suspension (with CTRL-Z). Unfortunately the code is using a lot of Linux syscalls, so not even the unit tests are usable for windows. Could you take a look at it and maybe tag other people who have more knowledge of signal handling or linux processes? In particular, I'm not sure if the way I'm dealing with the previous signal handler and the signal mask is correct, and if I'm waiting correctly for the child process. Also, it might make sense to move some of the new code to separate functions to keep things tidy.

On the plus side, I already tested the binary in my setup and it works for my use case! 😀

Copy link
Contributor

@jonashaag jonashaag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work <3

}
}

// Compute the return status mimicking the private logic in reproc++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to that logic as a comment please

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link to the main part of the logic, which is here. Do you want a second link for the wpid < 1 part? That would be here

@@ -442,24 +477,38 @@ namespace mamba
}

#ifndef _WIN32
MainExecutor::instance().schedule(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I forgot this change from one of the experiments I did. I'll revert it.

Do you know why the signal handler was being set with the MainExecutor instead of directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No 🤷

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I traced this back to 60d4b3a, but still don't know why the signal was being set in a separate thread.

With the new changes, it seems more difficult to do the modifications correctly in a separate thread. The MainExecutor doesn't seem to have functionality to check if a scheduled task has finished, meaning that it wouldn't be possible to check that modifying and restoring the signal masks happen before the end of the run_in_environment call without using some sort of synchronization primitives. At the same time, the signal mask is set on a per-thread basis, so the calls to pthread_sigmask would have to anyhow be done on the main thread. What do you think about leaving this as is?

Copy link
Contributor

@jonashaag jonashaag Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wolfv do you recall why?

libmamba/src/core/run.cpp Outdated Show resolved Hide resolved
@@ -470,12 +519,23 @@ namespace mamba

ec = reproc::drain(proc, reproc::sink::null, reproc::sink::null);

std::tie(status, ec) = proc.stop(opt.stop);
#ifndef _WIN32
status = wait_for_process(pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between the place where we installed the signal handlers up until after this line, is it possible to get any exceptions? If so, should we put the cleanup/restoring code into a finally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check, but that sounds like a good idea. Unfortunately I'm not sure how to do that in C++

- add link to relevant reproc code
- move Unix-only code into `#ifndef _WIN32` block
- fix comment
- extend suspension unit test sleep time
Comment on lines +308 to +309
// Compute the return status mimicing the private logic in reproc++
// https://github.com/DaanDeMeyer/reproc/blob/1c07bdbec3f2ecba7125b9499b9a8a77bf9aa8c7/reproc/src/process.posix.c#L455
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Compute the return status mimicing the private logic in reproc++
// https://github.com/DaanDeMeyer/reproc/blob/1c07bdbec3f2ecba7125b9499b9a8a77bf9aa8c7/reproc/src/process.posix.c#L455
// Compute the return status mimicing the private logic in reproc++:
// https://github.com/DaanDeMeyer/reproc/blob/1c07bdbec3f2ecba7125b9499b9a8a77bf9aa8c7/reproc/src/process.posix.c#L471
// https://github.com/DaanDeMeyer/reproc/blob/1c07bdbec3f2ecba7125b9499b9a8a77bf9aa8c7/reproc/src/process.posix.c#L455

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

Suggested change
// Compute the return status mimicing the private logic in reproc++
// https://github.com/DaanDeMeyer/reproc/blob/1c07bdbec3f2ecba7125b9499b9a8a77bf9aa8c7/reproc/src/process.posix.c#L455
// Compute the return status mimicing the private logic in reproc++
// https://github.com/DaanDeMeyer/reproc/blob/1c07bdbec3f2ecba7125b9499b9a8a77bf9aa8c7/reproc/src/process.posix.c#L465-L478

@jonashaag
Copy link
Contributor

What signal handler do we have installed in this loop?

// Set handlers for all signals and save the previous state.

Ie. what are the previous handlers? If this always the fixed list from set_signal_handler? If so, I guess we could simplify the code by making it less generic?

@MihaiBSony
Copy link
Author

What signal handler do we have installed in this loop?

// Set handlers for all signals and save the previous state.

Ie. what are the previous handlers? If this always the fixed list from set_signal_handler? If so, I guess we could simplify the code by making it less generic?

When used in our code, I think only the so-called default handler is set. If, on the other hand, run_in_environment is used from third-party code (which could be the case, since it's in libmamba?), then we can't make any assumptions about the signals.

Also, there's actually one corner case which I don't handle and which seems impossible to handle without reworking other parts of the code. If someone uses set_signal_handler to set a non-default handler, I can't restore it at the end of the run_in_environment call, and my code just sets the default handler instead.

The so-called signal handlers from set_signal_handler are actually just threads blocking on a sigwait call. We could save the function that is passed as an argument, but if it's third-party code and has any side-effects before getting to sigwait, the behaviour would be pretty hard to predict. This seems to rather be a design flaw in the set_signal_handler functionality, but maybe there's a reason why we don't just use the normal linux signal handlers instead?

@jonashaag
Copy link
Contributor

It seems that set_signal_handler was introduced in #1193. Can you please check @MihaiBSony's last comment and let us know if there is a way to simplify the signal handling here? Especially for the run_in_environment API. @adriendelsalle @AntoinePrv

@AntoinePrv
Copy link
Member

I don't know much about the signal handlers but I know they must be forwarded to python when used in libmambapy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants