Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slave has Died issue #243

Open
5angjun opened this issue Oct 11, 2023 · 5 comments
Open

Slave has Died issue #243

5angjun opened this issue Oct 11, 2023 · 5 comments

Comments

@5angjun
Copy link

5angjun commented Oct 11, 2023

Hello, I'm sangjun who is very interested in this project.

However, I want to know how to fix some error in Manager & Workers Communication.

I want to make died process to restart new qemu and reconnect fuzzing process when slave has died.

Dying slaves is very critical when i try to fuzzing very long hours ex) over 6hours.

So i think what needs to be impored is to re-engage dead workers in the fuzzing process.

Is any idea of this??

    def wait(self, timeout=None):
        results = []
        r, w, e = select.select(self.clients, (), (), timeout)
        for sock_ready in r:
            if sock_ready == self.listener:
                c = self.listener.accept()
                self.clients.append(c)
                self.clients_seen += 1
            else:
                try:
                    msg = sock_ready.recv_bytes()
                    msg = msgpack.unpackb(msg, strict_map_key=False)
                    results.append((sock_ready, msg))
                except (EOFError, IOError):
                    sock_ready.close()
                    self.clients.remove(sock_ready)
                    self.logger.info("Worker disconnected (remaining %d/%d)." % (len(self.clients)-1, self.clients_seen))
                    if len(self.clients) == 1:
                        raise SystemExit("All Workers exited.")
        return results
     ```
@Wenzel
Copy link
Contributor

Wenzel commented Oct 11, 2023

Hi @5angjun,

I think we need to understand why the slaves (or Workers) are dying in the first place ?
That shouldn't happen.

You can get more logging information with --log and combine it with --debug to extract useful debug output.

cc @il-steffen , can we expect dying workers during a fuzzing campaign ? Something I'm missing ?

@il-steffen
Copy link
Collaborator

Worker exit can happen on Qemu segfault or unhandled exception in the worker / mutation logic. The above logic is only to handle the loss of the socket connection, you need to look at why the worker exited..

@5angjun
Copy link
Author

5angjun commented Oct 11, 2023

I think the error occured when qemu died.

The last died code is this.

    def run_qemu(self):
        self.control.send(b'x')
        self.control.recv(1)

So i think it is nice to restart fuzzing campaign when qemu die.

@il-steffen
Copy link
Collaborator

Please have a look why this is happening. In general we want to fix anything that causes workers to die during a fuzzing campaign.
There are cases where restarting won't help, for instance if the disk is full then Qemu will just exit again on next file/log write.

In some cases there may be Qemu segfault that is not easy to fix, for instance we had bugs related to specific virtio fuzzing harnesses where fixing Qemu did not make much sense. In this case it would make sense to catch + restart the worker. This should be possible from the manager, and then the fuzzing campaign can just continue running.

The manager main loop is here: https://github.com/IntelLabs/kafl.fuzzer/blob/master/kafl_fuzzer/manager/manager.py#L85
We enter this just after launching the workers: https://github.com/IntelLabs/kafl.fuzzer/blob/master/kafl_fuzzer/manager/core.py#L104
The workers are python threads which in turn launch Qemu sub-processes. The threads should abort normally on Qemu communication error or uncatched exceptions, so you should be able to detect and restart the thread with same settings.

With some luck, the socket connection code you referenced above should detect the new worker and the main loop will start dispatching jobs again.

@5angjun
Copy link
Author

5angjun commented Oct 11, 2023

This situation appears when allocating a lot of RAM to a vm image and performing parallel fuzzing. In my case, this problem appeared while fuzzing the Windows built-in driver for a long time.

For example, my host computer's RAM size is 84G, but when I allocated 10G of RAM to each vm and fuzzed it with 8 cores ( use almost 82G / 84G ), qemu or worker died (there is a high probability that qemu died).

But the manager process is still alive. I am thinking about how to modify the code to revive dead workers in the manager process.

As a person who loves kAFL, I will also think about how to modify kAFL to make it a masterpiece.😀😀😀

Thank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants