Do not use "http/2" protocol version in HTTP headers in WARC files #42

sebastian-nagel · 2020-10-04T20:15:37Z

340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using HTTP/2 after a Java security upgrade which included ALPN and therefor allowed for HTTP/2. The crawler started to use HTTP/2 after an automatic restart.

The mentioned WARC files may cause WARC readers (eg. jwarc) to fail while parsing the HTTP headers:

request

GET /2020/09/12/business/brexit-no-deal-uk-economy/index.html HTTP/2
...

response
```
HTTP/2 200 
```

To address the issue:

for now block usage of HTTP/2
test which WARC parsers fail
enable the WARC bolt to write failure-proof files when using HTTP/2 (cf. WARC revision 1.1 (modification): support of HTTP 2.X protocol in WARC format. iipc/warc-specifications#15, WARC-Protocol field proposal iipc/warc-specifications#42)
push fixes to the WARC parser libs or rewrite the WARC files so that they're compatible

Affected files:

s3://commoncrawl/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200912083952-00000.warc.gz
...
s3://commoncrawl/crawl-data/CC-NEWS/2020/10/CC-NEWS-20201004110027-00339.warc.gz

More than 80% of the records are captured using HTTP/2.

The text was updated successfully, but these errors were encountered:

jnioche · 2023-11-15T15:04:36Z

This will be fixed when the NewsCrawler is ported to StormCrawler 2.x. The fix is available since StormCrawler 2.7.

Signed-off-by: Julien Nioche <[email protected]>

sebastian-nagel · 2024-07-09T10:37:15Z

The fix is available since StormCrawler 2.7.

See apache/incubator-stormcrawler#1010

sebastian-nagel mentioned this issue Oct 5, 2020

HTTP protocol implementation: allow to configure which protocol version(s) to use apache/incubator-stormcrawler#827

Closed

sebastian-nagel mentioned this issue Mar 6, 2023

invalid HTTP message at byte position 6: HTTP/2<-- HERE --> 200 iipc/jwarc#70

Closed

jnioche added a commit that referenced this issue Nov 15, 2023

Temporary workaround for #42: use http/1.1

0bfccf5

Signed-off-by: Julien Nioche <[email protected]>

sebastian-nagel mentioned this issue Jul 9, 2024

WARC writer support HTTP/2 commoncrawl/nutch#29

Closed

7 tasks

sebastian-nagel mentioned this issue Oct 23, 2024

WARCBolt support for HTTP/2 #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use "http/2" protocol version in HTTP headers in WARC files #42

Do not use "http/2" protocol version in HTTP headers in WARC files #42

sebastian-nagel commented Oct 4, 2020

jnioche commented Nov 15, 2023

sebastian-nagel commented Jul 9, 2024

Do not use "http/2" protocol version in HTTP headers in WARC files #42

Do not use "http/2" protocol version in HTTP headers in WARC files #42

Comments

sebastian-nagel commented Oct 4, 2020

jnioche commented Nov 15, 2023

sebastian-nagel commented Jul 9, 2024