Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use "http/2" protocol version in HTTP headers in WARC files #42

Open
sebastian-nagel opened this issue Oct 4, 2020 · 2 comments

Comments

@sebastian-nagel
Copy link
Collaborator

340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using HTTP/2 after a Java security upgrade which included ALPN and therefor allowed for HTTP/2. The crawler started to use HTTP/2 after an automatic restart.

The mentioned WARC files may cause WARC readers (eg. jwarc) to fail while parsing the HTTP headers:

  • request
    GET /2020/09/12/business/brexit-no-deal-uk-economy/index.html HTTP/2
    ...
    
  • response
    HTTP/2 200 
    

To address the issue:

Affected files:

s3://commoncrawl/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200912083952-00000.warc.gz
...
s3://commoncrawl/crawl-data/CC-NEWS/2020/10/CC-NEWS-20201004110027-00339.warc.gz

More than 80% of the records are captured using HTTP/2.

@jnioche
Copy link
Contributor

jnioche commented Nov 15, 2023

This will be fixed when the NewsCrawler is ported to StormCrawler 2.x. The fix is available since StormCrawler 2.7.

jnioche added a commit that referenced this issue Nov 15, 2023
@sebastian-nagel
Copy link
Collaborator Author

The fix is available since StormCrawler 2.7.

See apache/incubator-stormcrawler#1010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants