Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement operations status polling in backend pod #749

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mbarnes
Copy link
Collaborator

@mbarnes mbarnes commented Oct 21, 2024

What this PR does

This adds functionality to the backend pod which updates Cosmos DB items with the status of active (non-terminal) asynchronous operations for clusters. Status polling for node pools is not currently possible; this part will be completed when the new /api/aro_hcp/v1 OCM endpoint becomes available.

This runs on two independent polling intervals, each of which can be overridden through environment variables.

The first polling, which by default runs every 30s, scans the Cosmos DB "Operations" container for items with a non-terminal status and stores an internal list.

The second polling, which by default runs every 10s, iterates over that list and queries Cluster Service for the status of each resource. It then translates the Cluster Service status to a ProvisioningState value, takes an error message for failed operations, and -- with the subscription locked -- updates the appropriate "Operations" item and "Resources" item in Cosmos DB.

Additionally, if a deletion operation has completed successfully, it deletes the corresponding "Resources" item from Cosmos DB.

Jira: ARO-8598 - Prototype async op status notification mechanism for RP
Link to demo recording:

Special notes for your reviewer

For the record, I'm not a fan of this design and consider it a "Mark I" iteration. Polling introduces a lot of latency in the statuses reported by the Microsoft.RedHatOpenShift/hcpOpenShiftClusters API. But we're stuck with polling for now for two reasons:

  1. Microsoft's own Go SDK for Cosmos is minimally functional and lacks any support for change feeds. So notification of changes to Cosmos containers through the SDK is currently not possible, meaning the backend cannot respond immediately to new operations. (There's other ways to address this but the team agreed to initially just poll for simplicity's sake.)
  2. Cluster Service today lacks any kind of publish-subscribe mechanism and is, by design, unaware of the ARO-HCP RP. This is something that could more easily be addressed if the recently posted ADR is approved. For example, a dedicated Cluster Service backend for ARO-HCP could update Cosmos DB directly and eliminate the need for polling altogether.

Copy link
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, this appears to perform what we need. This feels like code that should get some tests because it is foundational - can we add them to solidify our requirements as code? e.g. happy paths / error paths / "no more than 1 call is made to Items since it is not concurrency safe", etc

Comment on lines +32 to +53
func (iter operationCacheIterator) Items(ctx context.Context) iter.Seq[[]byte] {
return func(yield func([]byte) bool) {
for _, doc := range iter.operation {
// Marshalling the document struct only to immediately unmarshal
// it back to a document struct is a little silly but this is to
// conform to the DBClientIterator interface.
item, err := json.Marshal(doc)
if err != nil {
iter.err = err
return
}

if !yield(item) {
return
}
}
}
}

func (iter operationCacheIterator) GetError() error {
return iter.err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look concurrency safe - not sure if that's a requirement. Should we at least log err as we set it? How would we know what call to Items failed?

Copy link
Collaborator Author

@mbarnes mbarnes Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterators are short-lived. They're not meant to be concurrency safe, they're meant to be used as part of a range-over-function pattern, which is new in Go 1.23. (Think generator functions in Python, if you're familiar with that.)

The pattern as described in the blog post doesn't address the possibility of iteration failing (such as when paging over results from a remote source) so I came up with this iterator interface as a way to stash an iterator error.

The idiom looks like:

for item := range iterator.Items(ctx) {
    ...
}

// Find out if iteration completed or aborted early.
err := iterator.GetError()

An iterator error would immediately break the "for" loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe worth mentioning, since I didn't realize you were highlighting the in-memory cache...

I don't believe this particular function is currently used. I wrote it for the in-memory cache in order to fulfill the DBClient interface, but the in-memory cache is nowadays only used to mock database operations in unit tests.

@mbarnes
Copy link
Collaborator Author

mbarnes commented Oct 23, 2024

This feels like code that should get some tests because it is foundational - can we add them to solidify our requirements as code?

Are you referring to the new database functions or to the backend code paths... or both? The backend might be difficult to write unit tests for at the moment since parts of it rely on database locking (see #680), which I don't have a way to mock at the moment. (Mocking should be doable, it's just not done.)

@mbarnes mbarnes force-pushed the backend-operations-scan branch 4 times, most recently from a8d253e to 8e280fe Compare October 28, 2024 12:09
Copy link
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I still haven't had time for a complete review here - a few more comments... I will take time early next week for something comprehensive - thank you!

backend/operations_scanner.go Outdated Show resolved Hide resolved
Matthew Barnes added 6 commits October 29, 2024 21:45
Converts the result of azcosmos.ContainerItem.NewQueryItemsPager
to a failable push iterator (push iterators are new in Go 1.23).
This also defines a DBClientIterator interface so the in-memory
cache can mimic QueryItemsIterator.
Will add something similar for node pools once the new "aro_hcp"
API is available from Cluster Service.
@mbarnes mbarnes force-pushed the backend-operations-scan branch 2 times, most recently from f4c965c to 36f0c43 Compare November 1, 2024 16:05
@mbarnes
Copy link
Collaborator Author

mbarnes commented Nov 1, 2024

@SudoBrendan Added unit tests to the backend: Add OperationsScanner commit.

We don't currently have subscription locking mocked because it's very CosmosDB-centric, so these unit tests only cover backend functions that run while a subscription lock is being held. But that still covers what I consider to be the core logic of the backend.

Periodically scans the "Operations" Cosmos DB container.
// to the values given, updates the LastTransitionTime, and returns true. This
// is intended to be used with DBClient.UpdateOperationDoc.
func (doc *OperationDocument) UpdateStatus(status arm.ProvisioningState, err *arm.CloudErrorBody) bool {
if doc.Status != status {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a case where we will have the status being the same, but err not being the same? Is that worth adding to this conditional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say for certain one way or another, but currently the only time we include error information in the operation status is when the provisioning state is "Failed" and I wouldn't expect the error code/message returned from Cluster Service to continue to change once a failure state is reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants