Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running argocd sync app --dry-run commands affect metrics #21899

Open
3 tasks done
jsolana opened this issue Feb 18, 2025 · 20 comments
Open
3 tasks done

Running argocd sync app --dry-run commands affect metrics #21899

jsolana opened this issue Feb 18, 2025 · 20 comments
Labels
bug Something isn't working

Comments

@jsolana
Copy link
Contributor

jsolana commented Feb 18, 2025

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Related to #21661

Currently, executing the command argocd app sync --dry-run affects both the application’s state and the internal metrics exposed by ArgoCD (eg: argocd_app_info).

The main issue is that if there are alerts based on these metrics, and the dry-run execution identifies an error (e.g., a change that violates a Kyverno policy or an invalid CRD schema), the application state changes to SyncErr. This also updates the metrics, which can potentially trigger alerts based on these metrics.

For example:

# alert manager alert definition
- alert: ArgoCDApplicationUnknown
  expr: sum by (cluster) (argocd_app_info{sync_status="Unknown"}) > 0
  for: 15m
...

Since the definition of a dry-run is to execute requests without persistence, I wonder if it makes sense to handle it in a way that ensures no changes are made.

Alternatively, adding a dry_run label to differentiate operational requests from dry-run requests could also be an option (a similar change has been proposed for kyverno link).

To Reproduce

Run argocd app sync --dry-run command.

Expected behavior

There are different proposals:

  1. Add a dryrun label to distinguish dryrun activity from real ones.
  2. Ignore dryrun executions (not affecting metrics).

Initially I fond of the 1 because dryrun has a cost associated in terms of resources consumption / performance. Ignoring activity related to dryrun make extremely hard to identify the reason of performance issues.

Version

It is affecting whatever version because currently dryrun executions are not distinguished.

@jsolana jsolana added the bug Something isn't working label Feb 18, 2025
@harshitSkyscanner
Copy link

harshitSkyscanner commented Feb 25, 2025

I am trying to get more exposure to argocd codebase and develop some understanding. It looks like an easy fix if it is just to add a label.
I can help contribute to it.

@jsolana
Copy link
Contributor Author

jsolana commented Feb 25, 2025

Thanks mate, I ve just create this MR to fix it. Let me know wdyt

@agaudreault
Copy link
Member

Have you investigated why the sync status changes for dry runs? This seems more like it should be fixed by not modifying the persisted sync status on dry runs rather than adding it to the metrics.

@harshitSkyscanner
Copy link

harshitSkyscanner commented Feb 25, 2025

Thanks mate, I ve just create this MR to fix it. Let me know wdyt

Ahh, I just saw this. I created the PR too. I did more or less the same thing but also updated some other metrics as well. I wasn't sure about my approach so I didn't added tests just yet.

#22014

@harshitSkyscanner
Copy link

Have you investigated why the sync status changes for dry runs? This seems more like it should be fixed by not modifying the persisted sync status on dry runs rather than adding it to the metrics.

@agaudreault - That is something that can be investigated in parallel but should not affect this PR. Metrics should have an indicator if they were generated via dry run or a normal operation.

@jsolana
Copy link
Contributor Author

jsolana commented Feb 25, 2025

Thanks mate, I ve just create this MR to fix it. Let me know wdyt

Ahh, I just saw this. I created the PR too. I did more or less the same thing but also updated some other metrics as well. I wasn't sure about my approach so I didn't added tests just yet.

#22014

Taking in count this comment not sure if this is the way to go.

Have you investigated why the sync status changes for dry runs? This seems more like it should be fixed by not modifying the persisted sync status on dry runs rather than adding it to the metrics.

The idea is to split this issue in two different ones:

  • This one related to metrics
  • Another one related to change and persist the app status running a dryrun op

Wdyt?

Want to know more about how repo server persists app status, any idea where to look?
Thanks!

@harshitSkyscanner
Copy link

harshitSkyscanner commented Feb 27, 2025

Want to know more about how repo server persists app status, any idea where to look?
Thanks!

I have been looking into this today. Shouldn't this be somewhere in controller/sync.go?
So maybe a fix like this - #22064

I am happy to add more stuff to it if I am even remotely on the right track here.

Also, just to clarify what happens here is during dry run sync, it is that updates the application sync state incorrectly right?
(Any logs here would be helpful as well to see the flow of code)

@crenshaw-dev
Copy link
Member

I agree that it's weird to update the sync state for a dry run... but if we don't do that, does the controller have a way to communicate the success/failure of the dry run up to the UI/CLI? If not, maybe instead of not persisting the state, we just need to persist it to a different field so that we preserve both the real sync state and the dry run sync state?

@harshitSkyscanner
Copy link

I agree that it's weird to update the sync state for a dry run... but if we don't do that, does the controller have a way to communicate the success/failure of the dry run up to the UI/CLI? If not, maybe instead of not persisting the state, we just need to persist it to a different field so that we preserve both the real sync state and the dry run sync state?

I was just trying to reproduce this in my local with argo-cd version 2.14.2.

The full behaviour that I can see is -
SYNC STATUS - remains unchanged
LAST SYNC - Changes to whatever the result of dry run sync was

I kind of agree with @crenshaw-dev here. Looking at the code, if we don't update anything, we won't be able to return the result back correctly. So we need to handle the results of dry run sync in a different manner

@crenshaw-dev
Copy link
Member

crenshaw-dev commented Feb 28, 2025

I think @jsolana's PR probably solves the metrics problem. But yeah, a more general solution would probably involve CRD, controller, API, CLI, and UI work.

@harshitSkyscanner
Copy link

@crenshaw-dev - Any idea on where to start?
Currently at this - #22064

@crenshaw-dev
Copy link
Member

I think we need to start by enumerating the user experiences that would be impacted by not updating the sync status field, whether those experiences are via the UI, CLI, or API. Then we need to figure out how to rebuild those using a new dry run sync status field.

@jsolana
Copy link
Contributor Author

jsolana commented Mar 4, 2025

I'm wondering if this is more of a UI feedback issue—where "Last Sync" isn't clearly communicated—rather than a fundamental change in how Sync and DryRun are handled internally.

Perhaps it would be useful if the UI included information on whether the "Last Sync" operation was a DryRun or not.

Related with #22059

I think @jsolana's PR probably solves the metrics problem

For us this is the "most urgent" issue cause impact developers experience with false positives.
@crenshaw-dev The PR is completed, in case you have time and could take a look. It would super nice 🙏 🙇

@crenshaw-dev
Copy link
Member

crenshaw-dev commented Mar 4, 2025

@jsolana agreed, adding clarifying information about whether it was a dry run might be a sufficient fix with relatively light CR/code changes.

@jsolana
Copy link
Contributor Author

jsolana commented Mar 19, 2025

Thanks to @agaudreault , I now have a much clearer idea of the path to follow (in theory 😅 ):

  • Persist DryRun in the Status of the operation (which corresponds to the State of the Last Operation).
  • Update the metrics to take the data from Status.DryRun (and not from Operation, which reflects the current operation).
  • Improve UI to reflect Last Sync was a dryrun

Make sense?

@jsolana
Copy link
Contributor Author

jsolana commented Mar 28, 2025

Hi!

Do you think SyncOperationResult is the proper place to extend and persist the DryRun value? (code)

// SyncOperationResult represent result of sync operation
type SyncOperationResult struct {
	// Resources contains a list of sync result items for each individual resource in a sync operation
	Resources ResourceResults `json:"resources,omitempty" protobuf:"bytes,1,opt,name=resources"`
	// Revision holds the revision this sync operation was performed to
	Revision string `json:"revision" protobuf:"bytes,2,opt,name=revision"`
	// Source records the application source information of the sync, used for comparing auto-sync
	Source ApplicationSource `json:"source,omitempty" protobuf:"bytes,3,opt,name=source"`
	// Source records the application source information of the sync, used for comparing auto-sync
	Sources ApplicationSources `json:"sources,omitempty" protobuf:"bytes,4,opt,name=sources"`
	// Revisions holds the revision this sync operation was performed for respective indexed source in sources field
	Revisions []string `json:"revisions,omitempty" protobuf:"bytes,5,opt,name=revisions"`
	// ManagedNamespaceMetadata contains the current sync state of managed namespace metadata
	ManagedNamespaceMetadata *ManagedNamespaceMetadata `json:"managedNamespaceMetadata,omitempty" protobuf:"bytes,6,opt,name=managedNamespaceMetadata"`
	// If true, the sync will be performed as a dry-run  without actually applying any changes.
	DryRun bool `json:"dryRun,omitempty" protobuf:"bytes,3,opt,name=dryRun"`
}

Thanks!

@crenshaw-dev
Copy link
Member

My recommendation would be to persist any dry run result in a completely different field (e.g. status.dryRunOperationResult). This would allow the UI/CLI to present the result of both the most recent actual operation and the most recent dry run at the same time.

@agaudreault
Copy link
Member

Currently, the SyncResult is persisted in the OperationState. Knowing that there can only be 1 operation in progress at the same time, there should only be 1 OperationState. The problem is that the status.operationState represent the result of the ongoing operation, but also the lastOperation when completed.

The way I see to solve this is to have 1 new fields status.lastCompletedNonDryRunOperation (for sure another name).

  • When an operation is triggered. It updates the status.operationState.
  • When it completes, if it is NOT a dry run, it copies status.operationState to status.lastCompletedNonDryRunOperation.

Most code stays unaffected because the behavior of status.operationState stays the same. However, it is easy to make a method to return the last effective operation:

getLastNonDryRunOperationResult() {
  if app.status.operationState.operation.DryRun() {
    return app.status.lastCompletedNonDryRunOperation
  }
 // Complete or ongoing non-dry run operation
 return app.status.operationState.operation
}

We could also persist the last dry-run operation in the status, but I see little value to do that.

@crenshaw-dev
Copy link
Member

I think the prohibition of concurrent dry-run and non-dry-run operations is a matter of implementation, not API. I could definitely imagine breaking the dry-run operations into their own work queue in the future.

The UI/CLI currently just pull the latest operation state. Having to toggle between lastCompletedNonDryRunOperation and operationState complicates the client code. Seems like it would be nicer to just fetch the state they want: dry run or non-dry run.

@jsolana
Copy link
Contributor Author

jsolana commented Mar 31, 2025

Hi @agaudreault @crenshaw-dev , In case you can take a look to this PR to confirm I understand properly the proposal

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants