Skip to content

optimize WACZ creation:#976

Draft
ikreymer wants to merge 2 commits intomainfrom
optimize-wacz
Draft

optimize WACZ creation:#976
ikreymer wants to merge 2 commits intomainfrom
optimize-wacz

Conversation

@ikreymer
Copy link
Member

  • pipe cdx lines -> gzip -> hasher -> output instead of joining into one string buffer
  • finish cdx block if string array somehow exceeds 10MB to avoid very large cdx blocks
  • use native finished() function in streamFinish(), catch exceptions

(Was initially done to address large CDX lines, but seems like a better approach overall. Not super high pri now that warcio 2.4.10 limits CDX line length)

- pipe cdx lines -> gzip -> hasher -> output instead of joining into one large buffer
- finish cdx block if string array exceeds 10M to avoid very large cdx blocks
- use native finished() function in streamFinish(), catch exceptions
@ikreymer ikreymer requested review from emma-sg and tw4l February 17, 2026 19:37
@tw4l
Copy link
Member

tw4l commented Feb 18, 2026

Seeing the following messages in testing:

{"timestamp":"2026-02-18T16:27:01.457Z","logLevel":"info","context":"general","message":"Merging CDX","details":{}}
(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 error listeners added to [WriteStream]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [WriteStream]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 finish listeners added to [WriteStream]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 end listeners added to [WriteStream]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
{"timestamp":"2026-02-18T16:27:01.739Z","logLevel":"info","context":"general","message":"Generating WACZ","details":{}}
{"timestamp":"2026-02-18T16:27:01.740Z","logLevel":"info","context":"general","message":"Num WARC Files: 4","details":{}}
{"timestamp":"2026-02-18T16:27:24.253Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: done","details":{}}

@tw4l
Copy link
Member

tw4l commented Feb 18, 2026

Post change, now seeing a more serious error:

{"timestamp":"2026-02-18T20:03:31.759Z","logLevel":"info","context":"general","message":"Merging CDX","details":{}}
node:internal/crypto/hash:135
    throw new ERR_CRYPTO_HASH_FINALIZED();
    ^

Error [ERR_CRYPTO_HASH_FINALIZED]: Digest already called
    at Hash.update (node:internal/crypto/hash:135:11)
    at Transform.transform [as _transform] (file:///app/dist/util/wacz.js:36:31)
    at Transform._write (node:internal/streams/transform:171:8)
    at writeOrBuffer (node:internal/streams/writable:572:12)
    at _write (node:internal/streams/writable:501:10)
    at Writable.write (node:internal/streams/writable:510:10)
    at Gzip.ondata (node:internal/streams/readable:1009:22)
    at Gzip.emit (node:events:524:28)
    at addChunk (node:internal/streams/readable:561:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:512:3) {
  code: 'ERR_CRYPTO_HASH_FINALIZED'
}

Node.js v20.20.0

This was also uncaught, so it stopped execution of the crawler at that point.

@ikreymer ikreymer marked this pull request as draft February 20, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants