-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to avoid intermittent errors when checking links #86
Try to avoid intermittent errors when checking links #86
Conversation
It's a global tranpsort and could be changed by any dependency. Signed-off-by: Douglas Camata <[email protected]>
This is an effort to avoid random errors when checking a bunch of links across many files: - Considering this is a tool often ran in resource constrained environments, like Github Actions, the HTTP client timeout was bumped to 30 seconds. - Colly was configured with a depth of 1, to avoid crawling too much information from the websites - The parallelism configuration of `colly` has been brought down from 100 to 10. This setting means that at most 10 requests will be sent in parallel for the same matching domain. This way we are more friendly to the servers and hopefully they will be happier with us too, returning more responses with 200 status code. - Added a random delay of up to 1 second to create new requests to matching domains, again to be more friendly with the servers, etc, etc. Signed-off-by: Douglas Camata <[email protected]>
`DialContext` is the new way to go. Signed-off-by: Douglas Camata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, some comments - it's though tuning play we have to until we do proper caching with ~hours/days jitter as we have planned in #54
Thanks for helping here! 💪🏽
@@ -241,7 +257,8 @@ func NewValidator(ctx context.Context, logger log.Logger, linksValidateConfig [] | |||
} | |||
if err := v.c.Limit(&colly.LimitRule{ | |||
DomainGlob: "*", | |||
Parallelism: 100, | |||
Parallelism: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parallelism configuration of colly has been brought down from 100
to 10. This setting means that at most 10 requests will be sent in
parallel for the matching domains, instead of 100. This way we are
more friendly to the servers and hopefully they will be happier with us too,
returning more responses with 200 status code faster.
Hope based development 🙈
I think this makes sense generally, but I wonder if it makes sense to perhaps expose those options so we can tune it to every environment. Right now we are bit guessing with programmatic change that takes time to modify and release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only hope when it comes to checking third-party links (on remote servers we don't own) from third-party servers (Github Actions runners), sending many concurrent requests . 😂
I'll try to make some important parameters configurable via the command line 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, I am moving them to the config file, like the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some parameters in the config file via 23f1dec.
@@ -241,7 +257,8 @@ func NewValidator(ctx context.Context, logger log.Logger, linksValidateConfig [] | |||
} | |||
if err := v.c.Limit(&colly.LimitRule{ | |||
DomainGlob: "*", | |||
Parallelism: 100, | |||
Parallelism: 10, | |||
RandomDelay: 1 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might take forever to validate things 🤔 Again, useful to expose an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't always take forever to validate things. The delays are a random number between 0 and 1s, not always 1s. I have some runs that are just 1 minute slower than without the random delay, but again, it's hard to measure and say with confidence why a run was slow (was my network slower this time? was somebody else's server slower? which one? how slow?).
collector.SetClient(&http.Client{ | ||
Timeout: 30 * time.Second, | ||
}) | ||
transport := &http.Transport{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, but also before tuning we should probably expose the options we want to tune to outside of binary. The problem is that running it locally almost never reproduces what's will happen on GH actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which configuration from the transport do you think would be valuable to expose as configuration, @bwplotka? I think the MaxIdleConns
could be useful. But the other ones all seem like too much/too deep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think tuning MaxIdleConns
to different values will be useful. But I found out MaxConnsPerHost
(by default 2), which can be useful. For example, Colly might be trying to concurrently send 10 requests to Github, but the transport will only hold 2 connections to Github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 23f1dec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for taking this up! 💪🏻
Some tiny comments.
linktransformerMetrics := newLinktransformerMetrics(reg) | ||
collector := colly.NewCollector(colly.Async(), colly.StdlibContext(ctx), colly.MaxDepth(1)) | ||
collector.SetClient(&http.Client{ | ||
Timeout: 30 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saswatamcode if the user doesn't provide a timeout, we would end up relying on the default timeout of Colly: 10 seconds. Are we fine with that? Or do we prefer to have a higher default timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in a02848d
Signed-off-by: Douglas Camata <[email protected]>
These options are: * Colly's HTTP parallelism * Colly's random delay between HTTP requests * HTTP transport's max connections per host
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
I made some parameters configurable via the validate config file and updated the example in the README to include them and the |
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
Bad editor! Signed-off-by: Douglas Camata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks!
Related to #22.
Stop relying on the
http.DefaultHTTPTransport
: it's effectively global state. It can bemodified by any other package. It has different default configuration depending on the
Go compiler version and we cannot trust that these defaults are good for us.
Considering this is a tool often ran in resource constrained
environments, like Github Actions, the HTTP client timeout was bumped
to 30 seconds.
Colly was configured with a depth of 1, to avoid crawling too much
information from the websites
The parallelism configuration of
colly
has been brought down from 100to 10. This setting means that at most 10 requests will be sent in
parallel for the matching domains, instead of 100. This way we are
more friendly to the servers and hopefully they will be happier with us too,
returning more responses with 200 status code faster.
Added a random delay of up to 1 second to create new requests to
matching domains, again to be more friendly with the servers, etc, etc.
How did I test this?
I compiled a binary and ran the
make check-docs
target of the https://github.com/thanos-io/thanos project a couple of times without my changes and with them.Without the changes it's easy to reproduce the failures indicated by #22.
With the changes, I couldn't reproduce them.
How does it affect run time?
Locals runs were taking:
There are a lot of variables involved though, it isn't a reliable measurement.
I think I could be able to get rid of the random delay to make this faster while still avoiding the intermittent failures, if that's desirable. WDYT, @bwplotka?