Skip to content

feat: add concurrency limit for WAL replay #26483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 3, 2025

Conversation

praveen-influx
Copy link
Contributor

@praveen-influx praveen-influx commented Jun 2, 2025

WAL replay currently loads all WAL files concurrently running into OOM. This commit adds a CLI parameter --wal-replay-concurrency-limit that would allow the user to set a lower limit and run WAL replay again.

Also, small refactor to introduce args struct to hold the all the params to WalObjectStore::new method and remove #[allow(clippy::too_many_arguments)].

CLI changes added to WAL section as below,

WAL Configuration:
  --wal-flush-interval <INTERVAL>  Interval to flush data to WAL file [default: 1s]
                                  [env: INFLUXDB3_WAL_FLUSH_INTERVAL=]
  --wal-snapshot-size <SIZE>       Number of WAL files per snapshot [default: 600]
                                  [env: INFLUXDB3_WAL_SNAPSHOT_SIZE=]
  --wal-max-write-buffer-size <SIZE>
                                  Max write requests in buffer [default: 100000]
                                  [env: INFLUXDB3_WAL_MAX_WRITE_BUFFER_SIZE=]
  --wal-replay-concurrency-limit <LIMIT>
                                  Concurrency limit during WAL replay [default: no_limit]
                                  If replay runs into OOM, set this to a lower number eg. 10
                                  [env: INFLUXDB3_WAL_REPLAY_CONCURRENCY_LIMIT=]
  --snapshotted-wal-files-to-keep <N>
                                  Number of snapshotted WAL files to retain [default: 300]
                                  [env: INFLUXDB3_NUM_WAL_FILES_TO_KEEP=]

closes: #26481

Tests

  • Setup: replay with 64 files (approx. 3.8MB each file)
  • Test against main ran into OOM in 20s
❯ time systemd-run --scope  -p MemoryMax=1000M -p CPUQuota=25% ./target-main/quick-release/influxdb3 serve --node-id node-1 --object-store file --data-dir /home/praveen/projects/influx/test-data/core-perf  --disable-telemetry-upload --snapshotted-wal-files-to-keep 10 --force-snapshot-mem-threshold 500 --exec-mem-pool-bytes 200 --log-filter 'info,iox_query=debug,influxdb3_server::query_executor=warn,influxdb3_server::http=warn,influxdb3_wal=debug,influxdb3_write::write_buffer::queryable_buffer=debug,influxdb3_write::write_buffer::table_buffer=debug,influxdb3::write_buffer=debug,influxdb3_enterprise=debug' --gen1-duration 10m --without-auth
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start transient unit 'run-p174171-i174471.scope'.
Authenticating as: praveen
Password:
==== AUTHENTICATION COMPLETE ====
Running as unit: run-p174171-i174471.scope; invocation ID: 19de343d39924f7b90ad104732243f3d
2025-06-03T13:09:33.294905Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting node_id=node-1 git_hash=be25c6f52b046e57ec909b815e5471d4c6bb4f19 version=3.2.0-nightly uuid=8dfafa78-71fe-4b8d-86ee-326019983044 num_cpus=1
2025-06-03T13:09:33.296720Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/praveen/projects/influx/test-data/core-perf" object_store_type="Directory"
2025-06-03T13:09:33.298653Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=1
2025-06-03T13:09:33.374024Z  INFO influxdb3_catalog::catalog::update: create database name="_internal"
2025-06-03T13:09:33.374086Z  INFO influxdb3::commands::serve: catalog initialized catalog_uuid=447f1558-23b3-496d-b129-474ab8cb0dfc
2025-06-03T13:09:33.374102Z  INFO influxdb3_catalog::catalog::update: register node node_id="node-1" core_count=1 mode=[Core]
2025-06-03T13:09:33.374127Z  INFO influxdb3_catalog::catalog::update: registering node to catalog that was not previously de-registered node_id="node-1" instance_id="4837d978-a131-418b-a5e3-bf8bb773dbb7"
2025-06-03T13:09:33.472809Z  INFO influxdb3_catalog::object_store: persisted next catalog sequence put_result=PutResult { e_tag: Some("3a27213-636aa97728f58-165"), version: None } object_path=CatalogFilePath(Path { raw: "node-1/catalogs/00000000000000000013.catalog" })
2025-06-03T13:09:33.473500Z  INFO influxdb3::commands::serve: catalog initialized instance_id="4837d978-a131-418b-a5e3-bf8bb773dbb7"
2025-06-03T13:09:33.478538Z DEBUG influxdb3_wal::object_store: file name path and wal file name file_name_with_path="00000000001.wal" wal_file_name="00000000001"
2025-06-03T13:09:33.478577Z DEBUG influxdb3_wal::object_store: replaying
2025-06-03T13:09:35.474071Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748953567410460370) max_time=Timestamp(1748953567899190867) wal_file_number=WalFileSequenceNumber(1)
2025-06-03T13:09:35.474131Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748953567410460370 max_timestamp_ns=1748953567899190867 wal_file_number=1 snapshot_details=None
2025-06-03T13:09:35.777740Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748953568019636338) max_time=Timestamp(1748953568495005790) wal_file_number=WalFileSequenceNumber(2)
2025-06-03T13:09:35.777769Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748953568019636338 max_timestamp_ns=1748953568495005790 wal_file_number=2 snapshot_details=None
2025-06-03T13:09:35.970735Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748953569022454358) max_time=Timestamp(1748953569311694350) wal_file_number=WalFileSequenceNumber(3)
2025-06-03T13:09:35.970804Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748953569022454358 max_timestamp_ns=1748953569311694350 wal_file_number=3 snapshot_details=None
2025-06-03T13:09:36.075188Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748953570024425579) max_time=Timestamp(1748953570309178036) wal_file_number=WalFileSequenceNumber(4)
2025-06-03T13:09:36.075217Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748953570024425579 max_timestamp_ns=1748953570309178036 wal_file_number=4 snapshot_details=None

________________________________________________________
Executed in   20.76 secs    fish           external
   usr time    0.57 secs    0.00 millis    0.57 secs
   sys time    3.55 secs    1.25 millis    3.54 secs

  • Against branch but without setting concurrency limit runs into OOM in 7s
❯ time systemd-run --scope  -p MemoryMax=1000M -p CPUQuota=25% ./target/quick-release/influxdb3 serve --node-id node-1 --object-store file --data-dir /home/praveen/projects/influx/test-data/core-perf  --disable-telemetry-upload --snapshotted-wal-files-to-keep 10 --force-snapshot-mem-threshold 500 --exec-mem-pool-bytes 200 --log-filter 'info,iox_query=debug,influxdb3_server::query_executor=warn,influxdb3_server::http=warn,influxdb3_wal=debug,influxdb3_write::write_buffer::queryable_buffer=debug,influxdb3_write::write_buffer::table_buffer=debug,influxdb3::write_buffer=debug,influxdb3_enterprise=debug' --gen1-duration 10m --without-auth
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start transient unit 'run-p174452-i174752.scope'.
Authenticating as: praveen
Password:
==== AUTHENTICATION COMPLETE ====
Running as unit: run-p174452-i174752.scope; invocation ID: 64cd6b8a6d5b496bbe85faecd0d60fb0
2025-06-03T13:10:26.270485Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting node_id=node-1 git_hash=58c6039c97bf74db8728dc0fd7421c288e387c88 version=3.2.0-nightly uuid=e2c639eb-60e3-4fc8-9589-250e0504ca44 num_cpus=1
2025-06-03T13:10:26.270605Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/praveen/projects/influx/test-data/core-perf" object_store_type="Directory"
2025-06-03T13:10:26.270792Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=1
2025-06-03T13:10:26.287371Z  INFO influxdb3_catalog::catalog::update: create database name="_internal"
2025-06-03T13:10:26.287413Z  INFO influxdb3::commands::serve: catalog initialized catalog_uuid=447f1558-23b3-496d-b129-474ab8cb0dfc
2025-06-03T13:10:26.287422Z  INFO influxdb3_catalog::catalog::update: register node node_id="node-1" core_count=1 mode=[Core]
2025-06-03T13:10:26.287432Z  INFO influxdb3_catalog::catalog::update: registering node to catalog that was not previously de-registered node_id="node-1" instance_id="4837d978-a131-418b-a5e3-bf8bb773dbb7"
2025-06-03T13:10:26.287625Z  INFO influxdb3_catalog::object_store: persisted next catalog sequence put_result=PutResult { e_tag: Some("3a27377-636aa9a987373-165"), version: None } object_path=CatalogFilePath(Path { raw: "node-1/catalogs/00000000000000000014.catalog" })
2025-06-03T13:10:26.287689Z  INFO influxdb3::commands::serve: catalog initialized instance_id="4837d978-a131-418b-a5e3-bf8bb773dbb7"
2025-06-03T13:10:26.288574Z DEBUG influxdb3_wal::object_store: file name path and wal file name file_name_with_path="00000000001.wal" wal_file_name="00000000001"
2025-06-03T13:10:26.288590Z DEBUG influxdb3_wal::object_store: replaying

________________________________________________________
Executed in    7.13 secs      fish           external
   usr time  298.38 millis  478.00 micros  297.90 millis
   sys time  454.42 millis  726.00 micros  453.70 millis

  • Against branch setting the concurrency limit to 10, loads all the WAL files in 9s and starts the whole process in 9.3s
❯ time systemd-run --scope  -p MemoryMax=1000M -p CPUQuota=25% ./target/quick-release/influxdb3 serve --node-id node-1 --object-store file --data-dir /home/praveen/projects/influx/test-data/core-perf  --disable-telemetry-upload --snapshotted-wal-files-to-keep 10 --force-snapshot-mem-threshold 500 --exec-mem-pool-bytes 200 --log-filter 'info,iox_query=debug,influxdb3_server::query_executor=warn,influxdb3_server::http=warn,influxdb3_wal=debug,influxdb3_write::write_buffer::queryable_buffer=debug,influxdb3_write::write_buffer::table_buffer=debug,influxdb3::write_buffer=debug,influxdb3_enterprise=debug' --gen1-duration 10m --without-auth --wal-replay-concurrency-limit 10
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start transient unit 'run-p174626-i174926.scope'.
Authenticating as: praveen
Password:
==== AUTHENTICATION COMPLETE ====
Running as unit: run-p174626-i174926.scope; invocation ID: f7fa9cda27b546869fd765f54181cb84
2025-06-03T13:10:55.018808Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting node_id=node-1 git_hash=58c6039c97bf74db8728dc0fd7421c288e387c88 version=3.2.0-nightly uuid=a81a1558-25d0-43ea-83ef-92da6d4efe26 num_cpus=1
2025-06-03T13:10:55.018898Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/praveen/projects/influx/test-data/core-perf" object_store_type="Directory"
2025-06-03T13:10:55.019021Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=1
2025-06-03T13:10:55.038164Z  INFO influxdb3_catalog::catalog::update: create database name="_internal"
2025-06-03T13:10:55.038214Z  INFO influxdb3::commands::serve: catalog initialized catalog_uuid=447f1558-23b3-496d-b129-474ab8cb0dfc
2025-06-03T13:10:55.038223Z  INFO influxdb3_catalog::catalog::update: register node node_id="node-1" core_count=1 mode=[Core]
2025-06-03T13:10:55.038235Z  INFO influxdb3_catalog::catalog::update: registering node to catalog that was not previously de-registered node_id="node-1" instance_id="4837d978-a131-418b-a5e3-bf8bb773dbb7"
2025-06-03T13:10:55.038402Z  INFO influxdb3_catalog::object_store: persisted next catalog sequence put_result=PutResult { e_tag: Some("3a27378-636aa9c4f27ba-165"), version: None } object_path=CatalogFilePath(Path { raw: "node-1/catalogs/00000000000000000015.catalog" })
2025-06-03T13:10:55.038462Z  INFO influxdb3::commands::serve: catalog initialized instance_id="4837d978-a131-418b-a5e3-bf8bb773dbb7"
2025-06-03T13:10:55.039205Z DEBUG influxdb3_wal::object_store: file name path and wal file name file_name_with_path="00000000001.wal" wal_file_name="00000000001"
2025-06-03T13:10:55.039218Z DEBUG influxdb3_wal::object_store: replaying
2025-06-03T13:10:56.615728Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.576492836s batch_len=10
2025-06-03T13:10:57.926043Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.310273005s batch_len=10
2025-06-03T13:10:59.320192Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.394112687s batch_len=10
2025-06-03T13:11:00.519033Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.198800107s batch_len=10
2025-06-03T13:11:01.818391Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.299318334s batch_len=10
2025-06-03T13:11:03.225094Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.406665535s batch_len=10
2025-06-03T13:11:03.727570Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=502.446051ms batch_len=5
2025-06-03T13:11:03.727612Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-06-03T13:11:03.727618Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-06-03T13:11:03.727632Z  WARN influxdb3::commands::serve: server started without auth (`--without-auth` switch), all token creation and regeneration of admin token endpoints are disabled
2025-06-03T13:11:04.333884Z  INFO influxdb3::commands::serve: setting up server with authz disabled for paths paths_without_authz=[]
2025-06-03T13:11:04.333973Z  INFO influxdb3_server: startup time: 9315ms address=0.0.0.0:8181
^C2025-06-03T13:11:04.993871Z  INFO influxdb3_shutdown: Received SIGINT
2025-06-03T13:11:04.993910Z  INFO influxdb3::commands::serve: shutdown requested
2025-06-03T13:11:04.994008Z  INFO influxdb3_catalog::catalog: updating node state to stopped in catalog node_id="node-1"
2025-06-03T13:11:04.994049Z  INFO influxdb3_catalog::catalog::update: updating node state to Stopped in catalog node_id="node-1" process_uuid=a81a1558-25d0-43ea-83ef-92da6d4efe26
2025-06-03T13:11:04.994511Z  INFO influxdb3_catalog::object_store: persisted next catalog sequence put_result=PutResult { e_tag: Some("3a27387-636aa9ce7123c-10a"), version: None } object_path=CatalogFilePath(Path { raw: "node-1/catalogs/00000000000000000016.catalog" })
2025-06-03T13:11:04.994747Z  INFO influxdb3::commands::serve: frontend shutdown completed
2025-06-03T13:11:04.994771Z  INFO influxdb3::commands::serve: backend shutdown completed

________________________________________________________
Executed in   14.23 secs    fish           external
   usr time    1.86 secs    0.00 millis    1.86 secs
   sys time    0.51 secs    1.74 millis    0.51 secs

Another set of tests to prove that without setting concurrency limit it definitely loads all the WAL files and the time it takes is roughly equivalent. In this setup, there were 39 files (~3.8M) and they all loaded in ~6s.

  • Against main
❯ time systemd-run --scope  -p MemoryMax=1000M -p CPUQuota=25% ./target-main/quick-release/influxdb3 serve --node-id node-1 --object-store file --data-dir /home/praveen/projects/influx/test-data/core-perf  --disable-telemetry-upload --snapshotted-wal-files-to-keep 10 --force-snapshot-mem-threshold 500 --exec-mem-pool-bytes 200 --log-filter 'info,iox_query=debug,influxdb3_server::query_executor=warn,influxdb3_server::http=warn,influxdb3_wal=debug,influxdb3_write::write_buffer::queryable_buffer=debug,influxdb3_write::write_buffer::table_buffer=debug,influxdb3::write_buffer=debug,influxdb3_enterprise=debug' --gen1-duration 10m --without-auth
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start transient unit 'run-p193384-i193684.scope'.
Authenticating as: praveen
Password:
==== AUTHENTICATION COMPLETE ====
Running as unit: run-p193384-i193684.scope; invocation ID: e3b68a0be22440c8a1651bee03b56665
2025-06-03T13:57:11.569507Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting node_id=node-1 git_hash=be25c6f52b046e57ec909b815e5471d4c6bb4f19 version=3.2.0-nightly uuid=8643ed3b-3132-4ca5-8171-e84cc049d1b8 num_cpus=1
2025-06-03T13:57:11.570215Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/praveen/projects/influx/test-data/core-perf" object_store_type="Directory"
2025-06-03T13:57:11.570566Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=1
2025-06-03T13:57:11.596158Z  INFO influxdb3_catalog::catalog::update: create database name="_internal"
2025-06-03T13:57:11.596192Z  INFO influxdb3::commands::serve: catalog initialized catalog_uuid=6da27f5c-c31e-4a06-b7d3-695bedccd7c8
2025-06-03T13:57:11.596199Z  INFO influxdb3_catalog::catalog::update: register node node_id="node-1" core_count=1 mode=[Core]
2025-06-03T13:57:11.606365Z  INFO influxdb3_catalog::object_store: persisted next catalog sequence put_result=PutResult { e_tag: Some("3a27225-636ab41ce0f56-165"), version: None } object_path=CatalogFilePath(Path { raw: "node-1/catalogs/00000000000000000011.catalog" })
2025-06-03T13:57:11.606504Z  INFO influxdb3::commands::serve: catalog initialized instance_id="e6924667-4e43-4448-b415-1db2bf1947ff"
2025-06-03T13:57:11.608105Z DEBUG influxdb3_wal::object_store: file name path and wal file name file_name_with_path="00000000001.wal" wal_file_name="00000000001"
2025-06-03T13:57:11.608142Z DEBUG influxdb3_wal::object_store: replaying
2025-06-03T13:57:12.709919Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748958535005913635) max_time=Timestamp(1748958535385503195) wal_file_number=WalFileSequenceNumber(1)
2025-06-03T13:57:12.709967Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748958535005913635 max_timestamp_ns=1748958535385503195 wal_file_number=1 snapshot_details=None
2025-06-03T13:57:14.024660Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748958536010776420) max_time=Timestamp(1748958536381581570) wal_file_number=WalFileSequenceNumber(2)
...
2025-06-03T13:57:17.421149Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748958570316101350 max_timestamp_ns=1748958570675102202 wal_file_number=37 snapshot_details=None
2025-06-03T13:57:17.509454Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748958571319291536) max_time=Timestamp(1748958571595634555) wal_file_number=WalFileSequenceNumber(38)
2025-06-03T13:57:17.509501Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748958571319291536 max_timestamp_ns=1748958571595634555 wal_file_number=38 snapshot_details=None
2025-06-03T13:57:17.521490Z  INFO influxdb3_wal::snapshot_tracker: timestamps passed in and wal file num min_time=Timestamp(1748958572324018085) max_time=Timestamp(1748958572676922851) wal_file_number=WalFileSequenceNumber(39)
2025-06-03T13:57:17.521504Z  INFO influxdb3_wal::object_store: replaying WAL file n_ops=1 min_timestamp_ns=1748958572324018085 max_timestamp_ns=1748958572676922851 wal_file_number=39 snapshot_details=None
2025-06-03T13:57:17.614194Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-06-03T13:57:17.614250Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-06-03T13:57:17.614318Z  WARN influxdb3::commands::serve: server started without auth (`--without-auth` switch), all token creation and regeneration of admin token endpoints are disabled
2025-06-03T13:57:18.128214Z  INFO influxdb3::commands::serve: setting up server with authz disabled for paths paths_without_authz=[]
2025-06-03T13:57:18.128414Z  INFO influxdb3_server: startup time: 6559ms address=0.0.0.0:8181
  • Against branch without concurrency limit
❯ time systemd-run --scope  -p MemoryMax=1000M -p CPUQuota=25% ./target/quick-release/influxdb3 serve --node-id node-1 --object-store file --data-dir /home/praveen/projects/influx/test-data/core-perf  --disable-telemetry-upload --snapshotted-wal-files-to-keep 10 --force-snapshot-mem-threshold 500 --exec-mem-pool-bytes 200 --log-filter 'info,iox_query=debug,influxdb3_server::query_executor=warn,influxdb3_server::http=warn,influxdb3_wal=debug,influxdb3_write::write_buffer::queryable_buffer=debug,influxdb3_write::write_buffer::table_buffer=debug,influxdb3::write_buffer=debug,influxdb3_enterprise=debug' --gen1-duration 10m --without-auth
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start transient unit 'run-p193646-i193946.scope'.
Authenticating as: praveen
Password:
==== AUTHENTICATION COMPLETE ====
Running as unit: run-p193646-i193946.scope; invocation ID: fa403ef5c0e144e7adb3b89154fecd3c
2025-06-03T13:57:51.416803Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting node_id=node-1 git_hash=58c6039c97bf74db8728dc0fd7421c288e387c88 version=3.2.0-nightly uuid=9256a158-d0bc-4704-89f8-d703d84f172c num_cpus=1
2025-06-03T13:57:51.416898Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/praveen/projects/influx/test-data/core-perf" object_store_type="Directory"
2025-06-03T13:57:51.417014Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=1
2025-06-03T13:57:51.486100Z  INFO influxdb3_catalog::catalog::update: create database name="_internal"
2025-06-03T13:57:51.486158Z  INFO influxdb3::commands::serve: catalog initialized catalog_uuid=6da27f5c-c31e-4a06-b7d3-695bedccd7c8
2025-06-03T13:57:51.486169Z  INFO influxdb3_catalog::catalog::update: register node node_id="node-1" core_count=1 mode=[Core]
2025-06-03T13:57:51.486645Z  INFO influxdb3_catalog::object_store: persisted next catalog sequence put_result=PutResult { e_tag: Some("3a274b5-636ab442ebd91-165"), version: None } object_path=CatalogFilePath(Path { raw: "node-1/catalogs/00000000000000000013.catalog" })
2025-06-03T13:57:51.486913Z  INFO influxdb3::commands::serve: catalog initialized instance_id="e6924667-4e43-4448-b415-1db2bf1947ff"
2025-06-03T13:57:51.488498Z DEBUG influxdb3_wal::object_store: file name path and wal file name file_name_with_path="00000000001.wal" wal_file_name="00000000001"
2025-06-03T13:57:51.488544Z  INFO influxdb3_wal::object_store: replaying WAL files
2025-06-03T13:57:52.390484Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958535005913635 max_timestamp_ns=1748958535385503195 wal_file_number=1 snapshot_details=None
...
2025-06-03T13:57:56.983994Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958570316101350 max_timestamp_ns=1748958570675102202 wal_file_number=37 snapshot_details=None
2025-06-03T13:57:56.996816Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958571319291536 max_timestamp_ns=1748958571595634555 wal_file_number=38 snapshot_details=None
2025-06-03T13:57:57.083212Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958572324018085 max_timestamp_ns=1748958572676922851 wal_file_number=39 snapshot_details=None
2025-06-03T13:57:57.098379Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=5.609814321s batch_len=39
2025-06-03T13:57:57.098434Z  INFO influxdb3_wal::object_store: completed replaying wal files time_taken=5.609893357s
2025-06-03T13:57:57.098490Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-06-03T13:57:57.098504Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-06-03T13:57:57.098559Z  WARN influxdb3::commands::serve: server started without auth (`--without-auth` switch), all token creation and regeneration of admin token endpoints are disabled
2025-06-03T13:57:57.695835Z  INFO influxdb3::commands::serve: setting up server with authz disabled for paths paths_without_authz=[]
2025-06-03T13:57:57.695924Z  INFO influxdb3_server: startup time: 6279ms address=0.0.0.0:8181

  • Against branch with concurrency limit set to 10
❯ time systemd-run --scope  -p MemoryMax=1000M -p CPUQuota=25% ./target/quick-release/influxdb3 serve --node-id node-1 --object-store file --data-dir /home/praveen/projects/influx/test-data/core-perf  --disable-telemetry-upload --snapshotted-wal-files-to-keep 10 --force-snapshot-mem-threshold 500 --exec-mem-pool-bytes 200 --log-filter 'info,iox_query=debug,influxdb3_server::query_executor=warn,influxdb3_server::http=warn,influxdb3_wal=debug,influxdb3_write::write_buffer::queryable_buffer=debug,influxdb3_write::write_buffer::table_buffer=debug,influxdb3::write_buffer=debug,influxdb3_enterprise=debug' --gen1-duration 10m --without-auth --wal-replay-concurrency-limit 10
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start transient unit 'run-p193755-i194055.scope'.
Authenticating as: praveen
Password:
==== AUTHENTICATION COMPLETE ====
Running as unit: run-p193755-i194055.scope; invocation ID: f19c613040664a83b158bbba5a1827ce
2025-06-03T13:58:16.621875Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting node_id=node-1 git_hash=58c6039c97bf74db8728dc0fd7421c288e387c88 version=3.2.0-nightly uuid=b1b2c6a2-f140-488a-a0fc-53bfb325f3f2 num_cpus=1
2025-06-03T13:58:16.621961Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/praveen/projects/influx/test-data/core-perf" object_store_type="Directory"
2025-06-03T13:58:16.622083Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=1
2025-06-03T13:58:16.704073Z  INFO influxdb3_catalog::catalog::update: create database name="_internal"
2025-06-03T13:58:16.704146Z  INFO influxdb3::commands::serve: catalog initialized catalog_uuid=6da27f5c-c31e-4a06-b7d3-695bedccd7c8
2025-06-03T13:58:16.704159Z  INFO influxdb3_catalog::catalog::update: register node node_id="node-1" core_count=1 mode=[Core]
2025-06-03T13:58:16.704545Z  INFO influxdb3_catalog::object_store: persisted next catalog sequence put_result=PutResult { e_tag: Some("3a274b7-636ab45af8821-165"), version: None } object_path=CatalogFilePath(Path { raw: "node-1/catalogs/00000000000000000015.catalog" })
2025-06-03T13:58:16.704747Z  INFO influxdb3::commands::serve: catalog initialized instance_id="e6924667-4e43-4448-b415-1db2bf1947ff"
2025-06-03T13:58:16.706511Z DEBUG influxdb3_wal::object_store: file name path and wal file name file_name_with_path="00000000001.wal" wal_file_name="00000000001"
2025-06-03T13:58:16.706526Z  INFO influxdb3_wal::object_store: replaying WAL files
2025-06-03T13:58:17.425994Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958535005913635 max_timestamp_ns=1748958535385503195 wal_file_number=1 snapshot_details=None
..
2025-06-03T13:58:18.014564Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.308013331s batch_len=10
...
2025-06-03T13:58:19.207792Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.193187331s batch_len=10
...
2025-06-03T13:58:20.518162Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.310334895s batch_len=10
...
2025-06-03T13:58:21.502265Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958570316101350 max_timestamp_ns=1748958570675102202 wal_file_number=37 snapshot_details=None
2025-06-03T13:58:21.515776Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958571319291536 max_timestamp_ns=1748958571595634555 wal_file_number=38 snapshot_details=None
2025-06-03T13:58:21.527686Z  INFO influxdb3_wal::object_store: replaying WAL file with details n_ops=1 min_timestamp_ns=1748958572324018085 max_timestamp_ns=1748958572676922851 wal_file_number=39 snapshot_details=None
2025-06-03T13:58:21.617081Z DEBUG influxdb3_wal::object_store: replaying batch completed time_taken=1.098898418s batch_len=9
2025-06-03T13:58:21.617114Z  INFO influxdb3_wal::object_store: completed replaying wal files time_taken=4.910589488s
2025-06-03T13:58:21.617139Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-06-03T13:58:21.617144Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-06-03T13:58:21.617155Z  WARN influxdb3::commands::serve: server started without auth (`--without-auth` switch), all token creation and regeneration of admin token endpoints are disabled
2025-06-03T13:58:22.205161Z  INFO influxdb3::commands::serve: setting up server with authz disabled for paths paths_without_authz=[]
2025-06-03T13:58:22.205264Z  INFO influxdb3_server: startup time: 5583ms address=0.0.0.0:8181

@praveen-influx praveen-influx force-pushed the pk/wal-replay-concurrency-limit branch 5 times, most recently from d24d593 to c17f448 Compare June 3, 2025 14:06
@praveen-influx praveen-influx marked this pull request as ready for review June 3, 2025 14:08
@praveen-influx praveen-influx requested a review from a team June 3, 2025 14:08
@praveen-influx praveen-influx force-pushed the pk/wal-replay-concurrency-limit branch from c17f448 to ba20225 Compare June 3, 2025 14:13
WAL replay currently loads _all_ WAL files concurrently running into
OOM. This commit adds a CLI parameter `--wal-replay-concurrency-limit`
that would allow the user to set a lower limit and run WAL replay again.

closes: #26481
@praveen-influx praveen-influx force-pushed the pk/wal-replay-concurrency-limit branch from ba20225 to 29a73cd Compare June 3, 2025 14:14
@praveen-influx praveen-influx merged commit a67b50d into main Jun 3, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop WAL replay from running into OOM
2 participants