Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactive prewarm pool #4725

Open
tysonnorris opened this issue Nov 14, 2019 · 3 comments
Open

Reactive prewarm pool #4725

tysonnorris opened this issue Nov 14, 2019 · 3 comments

Comments

@tysonnorris
Copy link
Contributor

Feature suggestion: Instead of prewarm config being statically defined at deployment, make the pool behavior reactive to load. This may be more applicable to cluster managed resources (mesos/k8s/yarn) where each invoker is not restricted to local resources for launching actions.

Since reliance on prewarm containers is a key point in improving performance, we should consider ways to keep as many as possible "cold start" containers in the "prewarm" workflow.

As an example, in current implementation where a fixed number of prewarms is launched at start, and the prewarm pool is replenished each time one is taken for use:

  • start with 10 prewarms
  • load of 15 actions will use 10 prewarms + 5 cold containers
  • 10 prewarms are replaced

Doing this once for a burst of traffic may be an anomaly, but if it is a pattern every few minutes, we can easily begin to see that we often run at a deficit of prewarms.
It would be nice to allow operators to define rules around prewarms like:

nodejs:10-256MB
  period - 1 minute
  threshold - 4
  miss-count - 2
  prewarm-increment - 3

So that if there are 2 consecutive 1 minute periods where number of prewarm "miss" (aka cold starts) for 256MB nodejs:10 activations exceed 4, then 3 additional prewarms are added to the system.

We would also need some form of "prewarm idle release" process, similar to the existing idle timeout, to reduce the number of prewarms if they get to a state of being unused.

@style95
Copy link
Member

style95 commented Nov 15, 2019

It reminds me of this:
#4225 (comment)

Since it is related to "estimation", it is highly likely related to machine learning.
And we might want to delegate the estimation process to something out of the box in the future.

@tysonnorris
Copy link
Contributor Author

Yes agree this is similar idea. The differences I would say are:

  • currently there is no "admin API" at controller, which exposes operator-specific APIs that adjust configs after deployment - I think this type of API is required for exposing these controls externally; for now I would avoid this by keeping logic internal to invoker
  • while adding more sophisticated prediction approach would be great, reactive approach specifically does not make guesses at all, it explicitly only reacts to data in the past, more similar to a health check - you would (typically) not attempt to predict whether a health check will fail, but you do want precise behavior when a series of health check failures occur. With prewarms, this would simply trigger loading of additional prewarms, once some "prewarm misses" have occurred. And when allowing loading of additional prewarms, we also need to allow some form of "prewarm idle timeout", since we also don't want to end up with prewarms occupying excess resources, if they are not getting used.

@tysonnorris
Copy link
Contributor Author

In a simple initial version this might be something like:

  • add a startTime and (configurable) TTL to prewarm containers in ContainerPool
  • add a scheduled task to periodically delete unused prewarms that have passed TTL expiration
  • add a configurable "prewarm scaleup step" where if a cold start container matches a prewarm config AND no prewarm was available AND the prewarm count for that config is below the runtime manifest config then launch up to "prewarm scaleup step" prewarms to replenish prewarms in case load has increased

In addition, startup behavior may be worth changing as well (to avoid starting a bunch of prewarms, just to delete them on TTL), by changing the manifest to use "initCount" and "maxCount" instead of "count" - to set an invoker startup config, and a cap on number of prewarm containers that are allowed over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants