Skip to content

[Bug]: MessageID roll back when calling Woodpecker append operation #41842

@chyezh

Description

@chyezh

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: #41824
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    woodpecker
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

wal write message into underlying wal, and return 6/7xx, but after recovering at another streamingnode,
wal write a new message into underlying wal but return 2/0, and the consume operation from 6/5xx can not read any entry. Break the consistent of wal, the system cannot recovered anymore.

2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [adaptor/wal_adaptor.go:242] ["wal graceful close done, wait for operation to be finished..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [adaptor/wal_adaptor.go:250] ["wal begin to close flusher..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [adaptor/scanner_adaptor.go:146] ["the produce event loop of scanner is closed"] [module=streamingnode] [component=scanner] [name=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0/6/2] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [WARN] [producer/produce_server.go:93] ["send arm of stream closed by unexpected error"] [module=streamingnode] [component=producer-server] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0] [term=6] [error="send loop is stopped for close of wal"]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [adaptor/scanner_adaptor.go:154] ["the consuming event loop of scanner is closed"] [module=streamingnode] [component=scanner] [name=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0/6/2] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [adaptor/scanner_adaptor.go:133] ["scanner is closed"] [module=streamingnode] [component=scanner] [name=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0/6/2] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [WARN] [producer/produce_server.go:93] ["send arm of stream closed by unexpected error"] [module=streamingnode] [component=producer-server] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0] [term=6] [error="send loop is stopped for close of wal"]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [WARN] [producer/produce_server.go:145] ["recv arm of stream closed by unexpected error"] [module=streamingnode] [component=producer-server] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0] [term=6] [error="rpc error: code = Canceled desc = context canceled"]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [WARN] [producer/produce_server.go:145] ["recv arm of stream closed by unexpected error"] [module=streamingnode] [component=producer-server] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0] [term=6] [error="rpc error: code = Canceled desc = context canceled"]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [flusherimpl/flusher_components.go:146] ["data sync service closed for flusher closing"] [module=streamingnode] [component=flusher] [pchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6] [vchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0_458022832186351001v0]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [flusherimpl/flusher_components.go:146] ["data sync service closed for flusher closing"] [module=streamingnode] [component=flusher] [pchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6] [vchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0_458022021334041933v0]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [flusherimpl/flusher_components.go:146] ["data sync service closed for flusher closing"] [module=streamingnode] [component=flusher] [pchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6] [vchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0_458022738515455585v0]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [flusherimpl/wal_flusher.go:63] ["wal flusher stop"] [module=streamingnode] [component=flusher] [pchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [flusherimpl/wal_flusher.go:120] ["wal flusher start to close the recovery storage..."] [module=streamingnode] [component=flusher] [pchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.586	[2025/05/14 10:32:12.586 +00:00] [INFO] [recovery/recovery_background_task.go:40] ["recovery storage background task, perform a graceful exit..."] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6] [state=working]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.630 +00:00] [INFO] [recovery/recovery_background_task.go:77] ["persist dirty snapshot"] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6] [state=working] [checkpoint=6/493] [checkpointTimeTick=458022906520928279] [vchannelCount=0] [segmentCount=16]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [recovery/recovery_background_task.go:32] ["recovery storage background task exit"] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6] [state=working]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [recovery/wal_truncator.go:69] ["sampling truncator background task exit"] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6] [state=working]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [flusherimpl/wal_flusher.go:122] ["recovery storage closed"] [module=streamingnode] [component=flusher] [pchannel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [adaptor/wal_adaptor.go:256] ["wal begin to close scanners..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [adaptor/scanner_adaptor.go:146] ["the produce event loop of scanner is closed"] [module=streamingnode] [component=scanner] [name=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0/6/1] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [adaptor/scanner_adaptor.go:154] ["the consuming event loop of scanner is closed"] [module=streamingnode] [component=scanner] [name=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0/6/1] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [adaptor/scanner_adaptor.go:133] ["scanner is closed"] [module=streamingnode] [component=scanner] [name=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0/6/1] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0]
2025-05-14 18:32:12.631	[2025/05/14 10:32:12.631 +00:00] [INFO] [adaptor/wal_adaptor.go:265] ["scanner close done, close inner wal..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.665	[2025/05/14 10:32:12.665 +00:00] [INFO] [adaptor/wal_adaptor.go:268] ["wal close done, close interceptors..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.665	[2025/05/14 10:32:12.665 +00:00] [INFO] [adaptor/wal_adaptor.go:272] ["close the write ahead buffer..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.665	[2025/05/14 10:32:12.665 +00:00] [INFO] [adaptor/wal_adaptor.go:275] ["close the segment assignment manager..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.665	[2025/05/14 10:32:12.665 +00:00] [INFO] [adaptor/wal_adaptor.go:278] ["call wal cleanup function..."] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.666	[2025/05/14 10:32:12.665 +00:00] [INFO] [adaptor/wal_adaptor.go:280] ["wal closed"] [module=streamingnode] [component=wal] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@6]
2025-05-14 18:32:12.666	[2025/05/14 10:32:12.666 +00:00] [INFO] [walmanager/wal_lifetime.go:143] ["close current term wal done"] [module=streamingnode] [component=wal-manager] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0] [expected=(6,false)] [current=(6,true)]
2025-05-14 18:32:12.666	[2025/05/14 10:32:12.666 +00:00] [INFO] [walmanager/manager_impl.go:89] ["remove wal success"] [module=streamingnode] [component=wal-manager] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0] [term=6]
2025-05-14 18:32:12.704	[2025/05/14 10:32:12.704 +00:00] [INFO] [walmanager/wal_lifetime.go:118] ["expected state changed, do a life cycle"] [module=streamingnode] [component=wal-manager] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0] [expected=(7,true)]
2025-05-14 18:32:12.712	[2025/05/14 10:32:12.712 +00:00] [INFO] [adaptor/initializing.go:53] ["start to sync first time tick"] [module=streamingnode] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7]
2025-05-14 18:32:13.772	[2025/05/14 10:32:13.772 +00:00] [INFO] [adaptor/initializing.go:59] ["sync first time tick done"] [module=streamingnode] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7] [msgID=2/0] [timetick=458022907451540458]
2025-05-14 18:32:13.773	[2025/05/14 10:32:13.773 +00:00] [INFO] [recovery/recovery_persisted.go:41] ["recover checkpoint done"] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7] [state=persist-recovering] [checkpoint=6/493] [timetick=458022906520928279] [magic=1]
2025-05-14 18:32:13.775	[2025/05/14 10:32:13.775 +00:00] [INFO] [recovery/recovery_persisted.go:54] ["recovery vchannel info done"] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7] [state=persist-recovering] [vchannels=3]
2025-05-14 18:32:13.776	[2025/05/14 10:32:13.775 +00:00] [INFO] [recovery/recovery_persisted.go:65] ["recover segment info done"] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7] [state=persist-recovering] [segments=16]
2025-05-14 18:32:13.776	[2025/05/14 10:32:13.776 +00:00] [INFO] [recovery/recovery_stream.go:33] ["recover from wal stream..."] [module=streamingnode] [component=recovery-storage] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7] [startMessageID=6/493] [fromTimeTick=458022906520928279] [toTimeTick=458022907451540458] [state=stream-recovering]
2025-05-14 18:32:13.776	[2025/05/14 10:32:13.776 +00:00] [INFO] [adaptor/scanner_adaptor.go:135] ["scanner start background task"] [module=streamingnode] [component=scanner] [name=recovery] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7] [startMessageID=6/493]
2025-05-14 18:32:13.776	[2025/05/14 10:32:13.776 +00:00] [INFO] [adaptor/scanner_adaptor.go:177] ["start produce loop of scanner at model"] [module=streamingnode] [component=scanner] [name=recovery] [channel=yz-chaos-clu-wp-mix-fix71-rootcoord-dml_0:rw@7] [startMessageID=6/493] [model=catchup]

Expected Behavior

No response

Steps To Reproduce

Milvus Log

No response

Anything else?

No response

Metadata

Metadata

Assignees

Labels

kind/bugIssues or changes related a bugseverity/criticalCritical, lead to crash, data missing, wrong result, function totally doesn't work.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions