Implement operations status polling in backend pod #749

mbarnes · 2024-10-21T13:03:56Z

What this PR does

This adds functionality to the backend pod which updates Cosmos DB items with the status of active (non-terminal) asynchronous operations for clusters. Status polling for node pools is not currently possible; this part will be completed when the new /api/aro_hcp/v1 OCM endpoint becomes available.

This runs on two independent polling intervals, each of which can be overridden through environment variables.

The first polling, which by default runs every 30s, scans the Cosmos DB "Operations" container for items with a non-terminal status and stores an internal list.

The second polling, which by default runs every 10s, iterates over that list and queries Cluster Service for the status of each resource. It then translates the Cluster Service status to a ProvisioningState value, takes an error message for failed operations, and -- with the subscription locked -- updates the appropriate "Operations" item and "Resources" item in Cosmos DB.

Additionally, if a deletion operation has completed successfully, it deletes the corresponding "Resources" item from Cosmos DB.

Jira: ARO-8598 - Prototype async op status notification mechanism for RP
Link to demo recording:

Special notes for your reviewer

For the record, I'm not a fan of this design and consider it a "Mark I" iteration. Polling introduces a lot of latency in the statuses reported by the Microsoft.RedHatOpenShift/hcpOpenShiftClusters API. But we're stuck with polling for now for two reasons:

Microsoft's own Go SDK for Cosmos is minimally functional and lacks any support for change feeds. So notification of changes to Cosmos containers through the SDK is currently not possible, meaning the backend cannot respond immediately to new operations. (There's other ways to address this but the team agreed to initially just poll for simplicity's sake.)
Cluster Service today lacks any kind of publish-subscribe mechanism and is, by design, unaware of the ARO-HCP RP. This is something that could more easily be addressed if the recently posted ADR is approved. For example, a dedicated Cluster Service backend for ARO-HCP could update Cosmos DB directly and eliminate the need for polling altogether.

SudoBrendan

At a glance, this appears to perform what we need. This feels like code that should get some tests because it is foundational - can we add them to solidify our requirements as code? e.g. happy paths / error paths / "no more than 1 call is made to Items since it is not concurrency safe", etc

SudoBrendan · 2024-10-23T14:10:53Z

internal/database/cache.go

+func (iter operationCacheIterator) Items(ctx context.Context) iter.Seq[[]byte] {
+	return func(yield func([]byte) bool) {
+		for _, doc := range iter.operation {
+			// Marshalling the document struct only to immediately unmarshal
+			// it back to a document struct is a little silly but this is to
+			// conform to the DBClientIterator interface.
+			item, err := json.Marshal(doc)
+			if err != nil {
+				iter.err = err
+				return
+			}
+
+			if !yield(item) {
+				return
+			}
+		}
+	}
+}
+
+func (iter operationCacheIterator) GetError() error {
+	return iter.err
+}


this doesn't look concurrency safe - not sure if that's a requirement. Should we at least log err as we set it? How would we know what call to Items failed?

Iterators are short-lived. They're not meant to be concurrency safe, they're meant to be used as part of a range-over-function pattern, which is new in Go 1.23. (Think generator functions in Python, if you're familiar with that.)

The pattern as described in the blog post doesn't address the possibility of iteration failing (such as when paging over results from a remote source) so I came up with this iterator interface as a way to stash an iterator error.

The idiom looks like:

for item := range iterator.Items(ctx) { ... } // Find out if iteration completed or aborted early. err := iterator.GetError()

An iterator error would immediately break the "for" loop.

Also maybe worth mentioning, since I didn't realize you were highlighting the in-memory cache...

I don't believe this particular function is currently used. I wrote it for the in-memory cache in order to fulfill the DBClient interface, but the in-memory cache is nowadays only used to mock database operations in unit tests.

mbarnes · 2024-10-23T15:17:18Z

This feels like code that should get some tests because it is foundational - can we add them to solidify our requirements as code?

Are you referring to the new database functions or to the backend code paths... or both? The backend might be difficult to write unit tests for at the moment since parts of it rely on database locking (see #680), which I don't have a way to mock at the moment. (Mocking should be doable, it's just not done.)

SudoBrendan

sorry I still haven't had time for a complete review here - a few more comments... I will take time early next week for something comprehensive - thank you!

backend/operations_scanner.go

Converts the result of azcosmos.ContainerItem.NewQueryItemsPager to a failable push iterator (push iterators are new in Go 1.23).

This also defines a DBClientIterator interface so the in-memory cache can mimic QueryItemsIterator.

Will add something similar for node pools once the new "aro_hcp" API is available from Cluster Service.

mbarnes · 2024-11-01T16:11:10Z

@SudoBrendan Added unit tests to the backend: Add OperationsScanner commit.

We don't currently have subscription locking mocked because it's very CosmosDB-centric, so these unit tests only cover backend functions that run while a subscription lock is being held. But that still covers what I consider to be the core logic of the backend.

Periodically scans the "Operations" Cosmos DB container.

SudoBrendan · 2024-11-06T18:52:54Z

internal/database/document.go

+// to the values given, updates the LastTransitionTime, and returns true. This
+// is intended to be used with DBClient.UpdateOperationDoc.
+func (doc *OperationDocument) UpdateStatus(status arm.ProvisioningState, err *arm.CloudErrorBody) bool {
+	if doc.Status != status {


Is there ever a case where we will have the status being the same, but err not being the same? Is that worth adding to this conditional?

I can't say for certain one way or another, but currently the only time we include error information in the operation status is when the provisioning state is "Failed" and I wouldn't expect the error code/message returned from Cluster Service to continue to change once a failure state is reached.

mbarnes requested review from bennerv, mjlshen and SudoBrendan as code owners October 21, 2024 13:03

mbarnes force-pushed the backend-operations-scan branch from 4426287 to e910484 Compare October 21, 2024 19:03

mbarnes requested review from jharrington22, UlrichSchlueter and zgalor as code owners October 21, 2024 19:03

mbarnes force-pushed the backend-operations-scan branch from e910484 to cd016ef Compare October 21, 2024 19:08

mbarnes mentioned this pull request Oct 23, 2024

Block requests based on provisioning state #759

Merged

SudoBrendan reviewed Oct 23, 2024

View reviewed changes

mbarnes force-pushed the backend-operations-scan branch 4 times, most recently from a8d253e to 8e280fe Compare October 28, 2024 12:09

SudoBrendan reviewed Oct 29, 2024

View reviewed changes

backend/operations_scanner.go Outdated Show resolved Hide resolved

Matthew Barnes added 6 commits October 29, 2024 21:45

frontend: Add a note to ConvertCStoHCPOpenShiftCluster

3731c4d

database: Add QueryItemsIterator type

f6b843c

Converts the result of azcosmos.ContainerItem.NewQueryItemsPager to a failable push iterator (push iterators are new in Go 1.23).

database: Add DBClient.ListAllOperationDocs method

0124440

This also defines a DBClientIterator interface so the in-memory cache can mimic QueryItemsIterator.

database: Add OperationDocument.UpdateStatus method

38ec043

ocm: Add ClusterServiceConfig.GetCSClusterStatus method

cadfb88

Will add something similar for node pools once the new "aro_hcp" API is available from Cluster Service.

ocm: Move MockClusterServiceClient to a separate file

d0f2b4d

mbarnes force-pushed the backend-operations-scan branch 2 times, most recently from f4c965c to 36f0c43 Compare November 1, 2024 16:05

mbarnes force-pushed the backend-operations-scan branch from 36f0c43 to e9b1f9b Compare November 1, 2024 16:12

backend: Add OperationsScanner

4b8ff49

Periodically scans the "Operations" Cosmos DB container.

mbarnes force-pushed the backend-operations-scan branch from e9b1f9b to 4b8ff49 Compare November 1, 2024 16:17

mbarnes requested a review from SudoBrendan November 1, 2024 16:24

SudoBrendan approved these changes Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement operations status polling in backend pod #749

Implement operations status polling in backend pod #749

mbarnes commented Oct 21, 2024 •

edited

Loading

SudoBrendan left a comment •

edited

Loading

SudoBrendan Oct 23, 2024

mbarnes Oct 23, 2024 •

edited

Loading

mbarnes Oct 23, 2024

mbarnes commented Oct 23, 2024

SudoBrendan left a comment

mbarnes commented Nov 1, 2024 •

edited

Loading

SudoBrendan Nov 6, 2024

mbarnes Nov 6, 2024

Implement operations status polling in backend pod #749

Are you sure you want to change the base?

Implement operations status polling in backend pod #749

Conversation

mbarnes commented Oct 21, 2024 • edited Loading

What this PR does

Special notes for your reviewer

SudoBrendan left a comment • edited Loading

Choose a reason for hiding this comment

SudoBrendan Oct 23, 2024

Choose a reason for hiding this comment

mbarnes Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

mbarnes Oct 23, 2024

Choose a reason for hiding this comment

mbarnes commented Oct 23, 2024

SudoBrendan left a comment

Choose a reason for hiding this comment

mbarnes commented Nov 1, 2024 • edited Loading

SudoBrendan Nov 6, 2024

Choose a reason for hiding this comment

mbarnes Nov 6, 2024

Choose a reason for hiding this comment

mbarnes commented Oct 21, 2024 •

edited

Loading

SudoBrendan left a comment •

edited

Loading

mbarnes Oct 23, 2024 •

edited

Loading

mbarnes commented Nov 1, 2024 •

edited

Loading