Skip to content

feat(persistence): update replica health condition as atomic operation #1879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

dsharma-dc
Copy link
Contributor

@dsharma-dc dsharma-dc commented Jun 20, 2025

Fail the transaction operation and shutdown the nexus to avoid succeeding IO via remaining replicas. This ensures data integrity in case the replica being marked here has been picked up as source of truth elsewhere for the volume.

@dsharma-dc dsharma-dc requested a review from a team as a code owner June 20, 2025 04:23
@dsharma-dc
Copy link
Contributor Author

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Jun 20, 2025
@bors-openebs-mayastor
Copy link

try

Build succeeded:

/// Information about children.
pub children: Vec<ChildInfo>,
}

impl Display for NexusInfo {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{{children: [")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just call the debug impl here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just call the debug impl here?

Sure, will remove this, was using during initial testing.

nexus_info.children.iter_mut().for_each(|c| {
if c.uuid == uuid {
c.healthy = *healthy;
}
});

let mut txn = NexusInfoTxn {
key_info: &mut persistent_nexus_info,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this cross call mutation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutating this in the failure case to set the new flag's value in the in-core object key_info here, just in case required to debug.

@dsharma-dc
Copy link
Contributor Author

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Jun 23, 2025
@bors-openebs-mayastor
Copy link

try

Build succeeded:

@dsharma-dc
Copy link
Contributor Author

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Jun 24, 2025
@bors-openebs-mayastor
Copy link

try

Build succeeded:

Fail the transaction operation and shutdown the nexus to avoid succeeding
IO via remaining replicas. This ensures data integrity in case the replica
being marked here has been picked up as source of truth elsewhere for
this volume.

Signed-off-by: Diwakar Sharma <[email protected]>
@dsharma-dc
Copy link
Contributor Author

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Jul 15, 2025
@bors-openebs-mayastor
Copy link

try

Build succeeded:

Comment on lines +49 to +65
//impl Display for NexusInfo {
// fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
// write!(f, "{{children: [")?;
//
// for (i, cinfo) in self.children.iter().enumerate() {
// if i != 0 {
// write!(f, ", ")?;
// }
// write!(f, "uuid: {}, healthy: {}", cinfo.uuid, cinfo.healthy)?;
// }
// write!(
// f,
// "], clean_shutdown: {}, do_self_shutdown: {}}}",
// self.clean_shutdown, self.do_self_shutdown
// )
// }
//}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

})?;

if !txn_resp.succeeded() {
if let TxnOpResponse::Get(g) = &txn_resp.op_responses()[0] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check if len > 1?


if !txn_resp.succeeded() {
if let TxnOpResponse::Get(g) = &txn_resp.op_responses()[0] {
if let Some(kv) = g.kvs().first() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to take here rather copying the value on return

// This is needed because absence of above response data shouldn't get
// implicitly assumed as success. With this the top level caller can
// compare this returned blank value with some expected value.
return Ok(Some(vec![]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return a StoreError here as this seems like an etcd bug if we hit this?

if let Some(current_value) = txn_resp {
let val = serde_json::from_slice::<NexusInfo>(&current_value).unwrap();

warn!("current state found: key - {key}, value - {val:?}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to after the if statement?

name: self.name.clone(),
});
}
error!("{self:?}: failed to persist nexus info transaction: {err}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same log once strategy we have in save

});
} else {
// Don't need to check individual op responses.
debug!(?key, "{self:?}: state save transaction successful");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as save

Suggested change
debug!(?key, "{self:?}: state save transaction successful");
trace!(?key, "{self:?}: the state was saved successfully");

@@ -1,7 +1,7 @@
//! Definition of a trait for a key-value store together with its error codes.

use async_trait::async_trait;
use etcd_client::{Compare, Error, TxnOp, TxnResponse};
use etcd_client::Error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed now I hope?

ops_success: Vec<TxnOp>,
ops_failure: Option<Vec<TxnOp>>,
) -> Result<TxnResponse, StoreError> {
new_value: &[u8],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be &impl StoreValue ?

Comment on lines +67 to +71
pub struct NexusInfoTxn<'a> {
pub key_info: &'a mut PersistentNexusInfo,
// Expected value for the key.
pub expected: NexusInfo,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of these need to be pub?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants