Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug risk: ctx for DialContext was reused everywhere #75

Open
gaukas opened this issue Jul 19, 2024 · 3 comments
Open

bug risk: ctx for DialContext was reused everywhere #75

gaukas opened this issue Jul 19, 2024 · 3 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@gaukas
Copy link
Contributor

gaukas commented Jul 19, 2024

Problem

If the ctx used in DialContext is cancelled AFTER the function returns, the returned water.Conn will be closed.

See https://github.com/hwh33/water-context-bug for an example.

Cause

Two causes contributed to this bug risk.

WATER reuse the ctx from DialContext in all calls to WebAssembly-exported functions

Configured the water.Core with the ctx passed to DialContext:

func (d *Dialer) DialContext(ctx context.Context, network, address string) (conn water.Conn, err error) {
if d.config == nil {
return nil, fmt.Errorf("water: dialing with nil config is not allowed")
}
ctxReady, dialReady := context.WithCancel(context.Background())
go func() {
defer dialReady()
var core water.Core
core, err = water.NewCoreWithContext(ctx, d.config)
if err != nil {
return
}
conn, err = dial(core, network, address)
}()
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-ctxReady.Done():
return conn, err
}
}

Then reused this ctx when invoking every single WebAssembly-exported function

coreCtx := tm.Core().Context()

_init:

ret, err := init.Call(coreCtx)

_dial_fixed (to be merged with _dial):

ret, err := dial_fixed.Call(coreCtx, api.EncodeI32(callerFd))

_dial:

ret, err := dial.Call(coreCtx, api.EncodeI32(callerFd))

_accept (not with DialContext, but would suffer similar bug risks):

ret, err := accept.Call(coreCtx, api.EncodeI32(callerFd))

_associate (not with DialContext, but would suffer similar bug risks):

ret, err := associate.Call(coreCtx)

_ctrlpipe:

ret, err := ctrlPipe.Call(coreCtx, api.EncodeI32(fd))

_start:

ret, err := start.Call(coreCtx)

At least some function calls, IF NOT ALL, should not be dependent on the context passed in by DialContext.

Functions which MUST return before DialContext can return (_init, _dial, _ctrlpipe) should depend on the ctx passed to DialContext for no issue. However, the blocking main loop, _start, should depend on a different context, probably context.Background().

WATER explicitly configured wazero to terminate the execution when ctx is canceled or timed out

(wazero.RuntimeConfig).WithCloseOnContextDone(true) is invoked by default:

water/wazero_config.go

Lines 153 to 159 in d4faf1b

// NewWazeroRuntimeConfigFactory creates a new WazeroRuntimeConfigFactory.
func NewWazeroRuntimeConfigFactory() *WazeroRuntimeConfigFactory {
return &WazeroRuntimeConfigFactory{
runtimeConfig: wazero.NewRuntimeConfig().WithCloseOnContextDone(true),
compilationCache: nil,
}
}

And wazero/config.go#L151-L172 noted that:

	// WithCloseOnContextDone ensures the executions of functions to be terminated under one of the following circumstances:
	//
	// 	- context.Context passed to the Call method of api.Function is canceled during execution. (i.e. ctx by context.WithCancel)
	// 	- context.Context passed to the Call method of api.Function reaches timeout during execution. (i.e. ctx by context.WithTimeout or context.WithDeadline)
	// 	- Close or CloseWithExitCode of api.Module is explicitly called during execution.
	//
	// This is especially useful when one wants to run untrusted Wasm binaries since otherwise, any invocation of
	// api.Function can potentially block the corresponding Goroutine forever. Moreover, it might block the
	// entire underlying OS thread which runs the api.Function call. See "Why it's safe to execute runtime-generated
	// machine codes against async Goroutine preemption" section in RATIONALE.md for detail.
	//
	// Upon the termination of the function executions, api.Module is closed.
	//
	// Note that this comes with a bit of extra cost when enabled. The reason is that internally this forces
	// interpreter and compiler runtimes to insert the periodical checks on the conditions above. For that reason,
	// this is disabled by default.
	//
	// See examples in context_done_example_test.go for the end-to-end demonstrations.
	//
	// When the invocations of api.Function are closed due to this, sys.ExitError is raised to the callers and
	// the api.Module from which the functions are derived is made closed.
	WithCloseOnContextDone(bool) RuntimeConfig
@gaukas gaukas added bug Something isn't working enhancement New feature or request labels Jul 19, 2024
@gaukas gaukas self-assigned this Jul 19, 2024
@gaukas
Copy link
Contributor Author

gaukas commented Jul 19, 2024

There would be two potential trivia fix to this problem.

First of all, it is obvious that all function calls which MUST return before DialContext does should depend on the context passed by DialContext. Which leaves out only _start, which is used to run the worker thread for WATM v0 and v1 spec and blocks until the WATM can no longer make any progress and must exit (e.g., peer/user closes the connection).

We can either simply use context.Background() for start.Call(ctx) (recommended for its simplicity) or create another way for caller to provide a context ahead of time to cancel/unblock the WATM worker thread. The latter way might be extraneous and redundant since a caller may simply call (Conn).Close() to terminate a Conn once it is established.

@gaukas
Copy link
Contributor Author

gaukas commented Jul 19, 2024

We must note that currently (Conn).Close() (which calls (*TransportModule).Close()) MAY still block forever without a proper timeout for (*TransportModule).Cancel(timeout time.Duration).

func (tm *TransportModule) Close() error {
var err error
tm.closeOnce.Do(func() {
tm.DeferAll()
err = tm.Cancel(0) // may wait forever
tm.Cleanup()
tm.coreMutex.Lock()
if tm.core != nil {
tm.core.Close()
tm.core = nil
}
tm.coreMutex.Unlock()
})
return err
}

Instead of wait forever by default, assume a reasonable default timeout and allow user to override it might be proper.

@gaukas
Copy link
Contributor Author

gaukas commented Jul 19, 2024

Instead of doing this:

tm.core.ContextCancel() // if not exited before timeout, cancel the context to terminate the WebAssembly execution

tm.core.Close() would be more straightforward.

@gaukas gaukas pinned this issue Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant