Skip to content

GPG stalls on database_open lock when using pass driver for secrets #2433

Open
@donovan-maas-gs

Description

@donovan-maas-gs

When trying to use secrets with the passdriver it fails with an opaque error code "exit status 2".

This happens both when creating the secret, and (assuming you manually create the secret) when you try to load the secret.

The crux of this issue seems to stem from this message

gpg: Note: database_open 134217901 waiting for lock (held by 6970) ...

Where PID 6970 is keyboxd

admin 6970 0.0 0.0 231492 4276 ? Ss 12:18 0:00 keyboxd --homedir /home/admin/.gnupg --daemon
admin 7041 0.0 0.0 305224 2376 ? Ss 12:18 0:00 keyboxd --homedir /home/admin/.gnupg --daemon

Repro

I am running Fedora 42 (Server Edition), on a PowerEdge 2950. It has 2 2-core CPUs, 32 GiB of RAM, and these issues are occurring with files on the boot XFS SSD. Podman is being run rootlessly, but these errors appear when running the command with sudo.

You of course will need a GPG key installed.

It's worth noting here that if you haven't trusted the key you get the same "exit status 2" error, but it happens immediately, and I haven't bothered diagnosing what error messages it produces in the background.

Once the key is trusted you can attempt to create the secret. In this case I used this test command

printf password123 | podman secret create test_secret --driver=pass --driver-opts=key=ABCD1234 -

This will then stall and eventually report

Error: creating secret test_secret: exit status 2

If we run ps auxf while it is stalled we should see something to the effect of

admin      46016  1.4  0.1 1942312 44588 pts/0   Sl+  16:07   0:00      |   \\_ podman secret create test_secret --driver=pass --driver-opts=key=ABCD1234 -
admin      46028  0.0  0.0  11704  4500 pts/0    SL+  16:07   0:00      |       \\_ gpg --encrypt -r ABCD1234 -o /home/admin/02595e1876d7790afb3eedb42.gpg

This shows that the command that is stalling is the GPG command. Running this command manually works without a problem.

This repo discards StdErr, but we can still read it with the command

ps auxf | grep -E 'gpg +--' | awk '{print $2}' | xargs -I{} cat /proc/{}/fd/2

This command has 4 parts: ps auxf prints out the list of processes; grep -E 'gpg +--' finds the GPG process, we need some special RegEx here to exclude the grep process that would otherwise also be returned; awk '{print $2}' came from ChatGPT and extracts the PID; xargs -I{} /proc/{}/fd/2 also came from ChatGPT and passes the PID into the path for ls and loads the File Descriptor for StdErr.

This will produce at least one log message along the lines of

gpg: Note: database_open 134217901 waiting for lock (held by 6970) ...

The PID 6970 that is holding the lock belongs to one of the instances of keyboxd

admin 6970 0.0 0.0 231492 4276 ? Ss 12:18 0:00 keyboxd --homedir /home/admin/.gnupg --daemon
admin 7041 0.0 0.0 305224 2376 ? Ss 12:18 0:00 keyboxd --homedir /home/admin/.gnupg --daemon

Code Sources

The vague error appears to come from secrets.go#L237

err = driver.Store(secr.ID, data)
if err != nil {
	return "", fmt.Errorf("creating secret %s: %w", name, err)
}

There is another location that could also be the source, but it is unlikely since the default driver works.

This then traces to passdriver.go#L139

func (d *Driver) Store(id string, data []byte) error {
	if _, err := d.Lookup(id); err == nil {
		return fmt.Errorf("%s: %w", id, define.ErrSecretIDExists)
	}
	in := bytes.NewReader(data)
	key, err := d.getPath(id)
	if err != nil {
		return err
	}
	return d.gpg(context.TODO(), in, nil, "--encrypt", "-r", d.KeyID, "-o", key)
}

which in turn executes the GPG command at passdriver.go#166.

func (d *Driver) gpg(ctx context.Context, in io.Reader, out io.Writer, args ...string) error {
	if d.GPGHomedir != "" {
		args = append([]string{"--homedir", d.GPGHomedir}, args...)
	}
	cmd := exec.CommandContext(ctx, "gpg", args...)
	cmd.Env = os.Environ()
	cmd.Stdin = in
	cmd.Stdout = out
	cmd.Stderr = io.Discard
	return cmd.Run()
}

It seems important to note that shortly before the GPG encrypt command is called that Lookup is called

func (d *Driver) Lookup(id string) ([]byte, error) {
	out := &bytes.Buffer{}
	key, err := d.getPath(id)
	if err != nil {
		return nil, err
	}
	if err := d.gpg(context.TODO(), nil, out, "--decrypt", key); err != nil {
		return nil, fmt.Errorf("%s: %w", id, define.ErrNoSuchSecret)
	}
	if out.Len() == 0 {
		return nil, fmt.Errorf("%s: %w", id, define.ErrNoSuchSecret)
	}
	return out.Bytes(), nil
}

Additional Testing

I did clone this repo and do some manual testing, but it was limited and had to be done with unit tests.

I essentially recreated the exact command I would be using, but made sure that Stdout & Stderr were redirected. Additionally I logged the GPG commands before than ran. This results in the output

Running command: gpg --decrypt /home/admin/test_secret.gpg
gpg: can't open '/home/admin/test_secret.gpg': No such file or directory
gpg: decrypt_message failed: No such file or directory
Running command: gpg --encrypt -r ABCD1234 -o /home/admin/test_secret.gpg

Based on what I can find, this indicates to me that the decrypt GPG command could be locking the database and it is not getting unlocked by the time the encrypt command runs. I'm not entirely sure why this works in the unit test but not with Podman, if I had to guess, I'd wager that running go test rebuilds the code without optimizations, and so there is enough of a delay that it just works. A pitfall with this hypothesis is that the "waiting for lock" message can appear multiple times, indicating that it is at least trying several times.

I also manually tried tossing the secret into secrets.json, and running a container with

podman run --secret=test_secret debian cat "/run/secrets/test_secret"

Which likewise stalls and errors out with

Error: 02595e1876d7790afb3eedb42: no such secret

But intercepting StdErr shows that it is actually the same problem as when creating secrets.

Final Notes

I'm not the only one to have this issue. My friend is running a similar setup, albeit with very different server specs, and was encountering this issue, which led me here.

I've tried running the GPG commands manually many times and don't encounter this error. It only happens when they run as a result of the podman command.

It would also be nice if Stdout & Stderr weren't simply discarded for the GPG command. I would prefer that StdErr is always written to the console, but having it only write with a verbose flag is still better than the debugging I had to do. I'd even accept an obscure --driver-opt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions