-
Notifications
You must be signed in to change notification settings - Fork 20
Improve BackbeatAPI pause/resume logic #2644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development/9.1
Are you sure you want to change the base?
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
5c752cc
to
8bdad47
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 3 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.1 #2644 +/- ##
===================================================
+ Coverage 73.91% 74.25% +0.34%
===================================================
Files 201 201
Lines 13446 13591 +145
===================================================
+ Hits 9938 10092 +154
+ Misses 3498 3489 -9
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command:
Alternatively, the |
lib/util/LocationStatusManager.js
Outdated
@@ -438,7 +434,9 @@ class LocationStatusManager { | |||
date.setMinutes(date.getMinutes() + 1); | |||
this._scheduledResumeJobs[service][location] = schedule.scheduleJob(date, | |||
triggerResume.bind(this)); | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add a test for this case
lib/util/LocationStatusManager.js
Outdated
this._deleteResumeJob(location, service); | ||
this._locationStatusStore[location].setServiceResumeSchedule(service, null); | ||
return this._updateServiceStatusForLocation(location, next); | ||
this._deleteResumeJob(location, service); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here _deleteResumeJob()
will be applied even if we fail to update the state in mongo : leading to inconsistent state...
lib/util/LocationStatusManager.js
Outdated
} | ||
this._locationStatusStore[location].resumeLocation(service); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting the "in-memory" value afterwards allows to handle errors, but does not actually solve everything: there are still many race conditions here...
- First and foremost, since we set the value afterwards, changing the value quickly will not "toggle" but keep trying to do the same transition
- In addition, the state is changed asynchronously: so we may make multiple changes to mongo in parallel, and the results may not be processed in the same order...and so the (in-mem) state may not match the state in mongo (i.e. send "Pause" to mongo ; send "Resume" to mongo ; process the end of the "Resume" call ; process of the "Suspend" call → in mongo we may have state Resume, but store "suspend" in memory)
→ the way to handle this is to actually have a proper automaton, so we can detect the "conflicting" operations and deal with them (reject, retry, ignore/wait for completion...)
lib/util/LocationStatusManager.js
Outdated
this._deleteResumeJob(location, service); | ||
return this._updateServiceStatusForLocation(location, err => { | ||
if (err) { | ||
this._logger.debug('failed to delete scheduled resume', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not tested
lib/util/LocationStatusManager.js
Outdated
if (!paused) { | ||
this._updateServiceStatusForLocation(location, err => { | ||
if (err) { | ||
this._logger.debug('failed to pause service', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this log should be moved to _updateServiceStatusForLocation
, may be less redundant than having it after each call...
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
c0d3dcc
to
ff585d7
Compare
ff585d7
to
c054a62
Compare
This commit fixes the callback error format in the LocationStatusManager class. It also ensures that the status is stored in memory only if it has been updated in MongoDB successfully, ensuring thus consistensy between both. Issue: BB-658
c054a62
to
a22cfb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to review everything but due to some changes I discuss, some of my comments will likely end up irrelevant, I suggest reading the whole review and work on the "big part"s first, so you don't waste time addressing some of my comments I wrote with the current code
|
||
/** | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
/** |
@@ -343,7 +924,7 @@ class LocationStatusManager { | |||
return cb(null, statuses); | |||
}); | |||
} | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/util/LocationStatusManager.js
Outdated
@@ -12,20 +12,156 @@ const actions = { | |||
resume: 'resumeService', | |||
}; | |||
|
|||
// Operation types for the state machine | |||
const OPERATION_TYPES = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const OPERATION_TYPES = { | |
const OPERATION_TYPES = Object.freeze({ |
if we want to use enum-like values without TS I suggest using this to ensure it will never be changed in any way
lib/util/LocationStatusManager.js
Outdated
} | ||
|
||
canTransitionTo(newState) { | ||
const validTransitions = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const validTransitions = { | |
const validTransitions = Object.freeze({ |
lib/util/LocationStatusManager.js
Outdated
/** | ||
* Represents a queued operation | ||
*/ | ||
class QueuedOperation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a generic class name that should:
- Have a more specific name if we can only handle specific operations, here it seems it's related to location, so it should be in the name (maybe
ServiceOperation
?) - Be fully unit tested
- Be in a dedicated file, as we tend to avoid having multiple classes in the same file
if (operation.retryCount < operation.maxRetries) { | ||
setTimeout(() => { | ||
this._operationQueue.push(operation); | ||
}, 1000 * Math.pow(2, operation.retryCount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have support for exponentional backoff in arsenal?
We should add a comment maybe stating the periods it gives given the number of retries we have right now, maybe it will be too big or too small
{ key, oldState: stateMachineInstance.state, newState: newSmState, | ||
oldScheduledResume: stateMachineInstance.scheduledResumeDate, | ||
newScheduledResume: newScheduledResumeDate, | ||
zkDataUsed: currentZkData }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ key, oldState: stateMachineInstance.state, newState: newSmState, | |
oldScheduledResume: stateMachineInstance.scheduledResumeDate, | |
newScheduledResume: newScheduledResumeDate, | |
zkDataUsed: currentZkData }); | |
{ | |
key, oldState: stateMachineInstance.state, | |
newState: newSmState, | |
oldScheduledResume: stateMachineInstance.scheduledResumeDate, | |
newScheduledResume: newScheduledResumeDate, | |
zkDataUsed: currentZkData, | |
}); |
lib/util/LocationStatusManager.js
Outdated
(stateMachineInstance.scheduledResumeDate ? | ||
stateMachineInstance.scheduledResumeDate.toISOString() : null) !== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(stateMachineInstance.scheduledResumeDate ? | |
stateMachineInstance.scheduledResumeDate.toISOString() : null) !== | |
(stateMachineInstance.scheduledResumeDate?.toISOString?.() |
Also what if we get two null
(or undefined
with my suggestion) -> the condition will pass, is that what we want?
if (this._serviceConfig[service] && !this._serviceConfig[service].isMongo) { | ||
this._logger.debug('Initializing state machine for ZK-based service', { key, service, location }); | ||
try { | ||
sourceData = await this._getZkDataForLocation(service, location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls ZK every time we process an operation, is that OK? I saw we increased the timeouts in tests from 1000ms to 2000ms, is this change fine if we pause/resume at high rate?
lib/util/LocationStatusManager.js
Outdated
if (transition) { | ||
transition(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (transition) { | |
transition(); | |
} | |
transition?.(); |
- A log in case there is no transition?
Issue: BB-658