snap.broken files filled up the etcd disk space after a few restarting

### Bug report criteria

- [ ] This bug report is not security related, security issues should be disclosed privately via security@etcd.io.
- [ ] This is not a support request or question, support requests or questions should be raised in the etcd [discussion forums](https://github.com/etcd-io/etcd/discussions).
- [ ] You have read the etcd [bug reporting guidelines](https://github.com/etcd-io/etcd/blob/main/Documentation/contributor-guide/reporting_bugs.md).
- [ ] Existing open issues along with etcd [frequently asked questions](https://etcd.io/docs/latest/faq) have been checked and this is not a duplicate.

### What happened?

We're running etcd v3.5 in the Kubernetes Pods, and the attached disks have limited size. The etcd cluster has 5 members and each runs in a separate Pod on different Kubernetes nodes.
We restarted the Pods for a few times. After that, a few .snap.broken files were created which filled up the disk space, and thus the etcd services cannot start anymore.

====
#: etcd --version
etcd Version: 3.5.21
Git SHA: a17edfd
Go Version: go1.23.7
Go OS/Arch: linux/amd64

# du -sh *
3.2G snap
800M wal
# cd snap/
# du -sh *
482M     0000000000003c97-0000000005ff12da.snap
487M     0000000000003c97-000000000600997b.snap
483M     0000000000003с97-0000000006022053.snap
489M     0000000000003c9f-000000000603a784.snap
211M     0000000000003c9f-000000000603a784.snap.broken
483M     0000000000003сa0-0000000006052e25.snap
490M     0000000000003ca0-000000000606b4c6.snap.broken
59M       00000000000003a1-000000000606b4f9.snap.broken
68K        db

### What did you expect to happen?

If I understand the etcd source code which creates the snap files correctly (server/etcdserver/api/snap/snapshotter.go), seems it can make incomplete snap files in some conditions. For example, if the process is stopped while a new snap file is being created, the file would be leftover as incomplete. Next time when the etcd process starts, it would not be able to load the partial file successfully thus would isolate it to be a .snap.broken file. The .snap.broken files would be skipped in future and won't be purged anymore.

In an enterprise class software, the approach of tmp+rename is usually used to create the critical files. That is, we firstly dump the file content in a temporary file on the same filesystem, then commit the file creation by renaming it with the destination file name. In this way, the incomplete file won't be loaded at all, and the leftover temporary file can be simply discarded automatically next time. I understand this can avoid the .snap.broken files significantly.

I'm not sure if this can be considered a request for enhancement or bug fix. Anyhow it's making trouble when the disk space for etcd is small, and especially the etcd server is containerized.

Hopefully this approach can be used in both v3.5 and newer versions.

### How can we reproduce it (as minimally and precisely as possible)?

Keep restarting the etcd members forcibly for times.

### Anything else we need to know?

_No response_

### Etcd version (please run commands below)

<details>

```console
$ etcd --version
etcd Version: 3.5.21
Git SHA: a17edfd
Go Version: go1.23.7
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.21
```

</details>


### Etcd configuration (command line flags or environment variables)

This is the command-line flags for the first etcd member. There're 5 etcd members in the etcd cluster.

<details>
/usr/bin/etcd --data-dir=/etcd-data --name=sample-mds-1 --listen-peer-urls=https://0.0.0.0:2380 --listen-client-urls=https://0.0.0.0:2379 --advertise-client-urls=https://sample-mds-1.sample-mds.labsys:2379 --initial-advertise-peer-urls=https://sample-mds-1.sample-mds.labsys:2380 --initial-cluster=sample-mds-1=https://sample-mds-1.sample-mds.labsys:2380 --initial-cluster-state=new --initial-cluster-token=sample-mds-tok --peer-cert-file=/sample-config/certificates/tls.crt --peer-key-file=/sample-config/certificates/tls.key --peer-trusted-ca-file=/sample-config/certificates/ca.crt --peer-client-cert-auth --cert-file=/sample-config/certificates/tls.crt --key-file=/sample-config/certificates/tls.key --client-cert-auth --trusted-ca-file=/sample-config/certificates/ca.crt --enable-v2
</details>


### Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

<details>

```console
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
```

</details>


### Relevant log output

```Shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

snap.broken files filled up the etcd disk space after a few restarting #20732

Bug report criteria

What happened?

du -sh *

cd snap/

du -sh *

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

snap.broken files filled up the etcd disk space after a few restarting #20732

Description

Bug report criteria

What happened?

du -sh *

cd snap/

du -sh *

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions