-
Notifications
You must be signed in to change notification settings - Fork 2.7k
fix(txt-registry): skip creation of already-existing TXT records (#4914) #5459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Welcome @u-kai! |
Hi @u-kai. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Thanks. |
What happens the txt record has also been deleted ? |
@mloiseleur Proposed updatefunc (r *TXTRegistry) ApplyChanges(ctx context.Context, c *plan.Changes) error {
...
if err := r.provider.ApplyChanges(ctx, c); err != nil {
return err
}
// ✨ NEW: drop the snapshot so the next Records() call rebuilds it
r.existingTXT = txtSnapshot{} // formerly existingTXT
return nil
}
Let me know if this per-loop cleanup addresses your concerns; I can push the change right away. 🙇♂️ |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
There are concerns Regarding TXT Caching as a Solution I'm unclear how caching will resolve the underlying issue. From my perspective, this looks like a bug in either the CRD source, the TXT registry, or how records and plans are being handled. Instead of identifying and fixing that root cause, the current proposal is to introduce a new feature – TXT caching. The issue missing how-to reproduce, what the arguments are and version for reported bug is ~0.13, when latest version is 0.17, which means, the issue may no longer even exist. The approach could actually worsen the situation for TXT records. Implementing a cache on a stateless service means the problem is likely to reappear during restarts or new service startups. This proposed caching mechanism isn't a fix. We need to find the root cause of the problem. The expectation should be to reproduce the issue in a controlled environment and then try to pinpoint where the bug lies. Relevant issue #4998 It appears there's a bug within either the CRD source, the TXT registry, or the mechanism for pulling records for a specific provider and handling the plan. Instead of addressing this underlying bug, the decision has been made to introduce a new feature: TXT caching. |
TXT registry already have a cache Line 46 in 36bc7d6
This is a high-impact issue, and without being able to reproduce it, it's tough to pinpoint the exact cause. It's even possible that I propose to start with:
First two could be done in any order, but with focus on reproducing the problem and providing a unit test, that reveals the bug. This is currently just an assumption, and without reproducing it, it's hard to be certain. We need to confirm that if a TXT registry record exists without a corresponding managed record (like A, AAAA, or CNAME), we either:
|
Fixes #5340 |
Fixes #5003 ^ this issue is related, but not sure where scope of the fix will cover it |
Thank you for your feedback. Initially, I approached this by introducing a caching mechanism. The primary issue I'm addressing is ensuring that ExternalDNS can recreate an A record if it was accidentally deleted. Currently, when attempting to create an A record, ExternalDNS also tries to create a corresponding TXT record via the To resolve this, I implemented a mechanism to store existing TXT records during the Regarding I’ve pushed the updated implementation based on this approach. Please let me know if I misunderstood any part of your suggestion — happy to revise if needed! |
Hi @mloiseleur, I mentioned earlier that I would wait for your confirmation before pushing the changes, but I received a detailed comment from @ivankatliarchuk and pushed the revised implementation to help clarify the current behavior. The latest code uses a temporary structure between Let me know if this direction still looks good to you — I’m happy to adjust if needed! |
Have you managed to validate that issue is still present or it's related to specific flags
|
relates: #3977 |
Yes, I was able to reproduce the issue with both Environment
Steps to Reproduce
Wait for the next reconciliation loop. ResultOn the next loop, ExternalDNS tries to recreate the A record and the associated TXT record.
|
@ivankatliarchuk I'm currently reviewing the test cases and will follow up once everything is ready. |
registry/txt.go
Outdated
generatedTXTs := im.generateTXTRecord(r) | ||
for _, txt := range generatedTXTs { | ||
// If the TXT record is already managed by this instance, skip it | ||
if im.existingTXTs.isManaged(txt) { | ||
continue | ||
} | ||
filteredChanges.Create = append(filteredChanges.Create, txt) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this logic behind the solution.
Why we could not have something simple like?
for _, r := range filteredChanges.Create {
if im.existingTXTs.is(not)Managed(r) {
continue
}
if r.Labels == nil {
r.Labels = make(map[string]string)
}
r.Labels[endpoint.OwnerLabelKey] = im.ownerID
if im.cacheInterval > 0 {
im.addToCache(r)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
The reason the logic is structured this way is because filteredChanges.Create
does not initially contain TXT records.
The TXT records are generated on the fly using im.generateTXTRecord(r)
, and those are what I intend to filter using existingTXTs
.
Checking r
directly (as in your example) wouldn’t work in this case, since it’s typically an A or CNAME record, not the TXT itself.
Let me know if I misunderstood your suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation seems quite expensive in terms for computation.
It should be possible to pass any record to im.existingTXTs.is(not)Managed
and validate it somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a refactor — the generateTXTRecordWithFilter part is a bit more involved, but the main loop is now cleaner and should be more efficient overall.
Open to feedback, let me know what you think!
@u-kai Would you please rebase this PR on top of master, where the flaky test has been fixed ? |
We don't want to recreate TXT records in this case! |
} | ||
|
||
// isNotManaged reports whether the given endpoint's TXT record is absent from the existing set. | ||
// Used to determine whether a new TXT record needs to be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the doc string is wrong, but we do not want to recreate the TXT record!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant here was that this is used to determine whether a TXT record should be created — I didn’t intend to say that it is about recreating the record.
But if this is confusing, I’m happy to remove it.
func (im *existingTXTs) reset() { | ||
// Reset the existing TXT records for the next reconciliation loop. | ||
// This is necessary because the existing TXT records are only relevant for the current reconciliation cycle. | ||
im.entries = make(map[recordKey]struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a possible data race, I think we should guard all access to im.entries by a Mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a Mutex here. From what I can see, TXTRegister is only used inside Controller.RunOnce, which doesn't seem to use any goroutines.
Also, other cache fields in TXTRegister are being updated without a Mutex as well, so it seems consistent not to add one here.
registry/txt.go
Outdated
@@ -244,7 +301,9 @@ func (im *TXTRegistry) generateTXTRecord(r *endpoint.Endpoint) []*endpoint.Endpo | |||
txt.WithSetIdentifier(r.SetIdentifier) | |||
txt.Labels[endpoint.OwnedRecordLabelKey] = r.DNSName | |||
txt.ProviderSpecific = r.ProviderSpecific | |||
endpoints = append(endpoints, txt) | |||
if filter(txt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's basically if true
as far as I understand https://github.com/kubernetes-sigs/external-dns/pull/5459/files#r2107584832 , right?
If so please drop the added complexity. If we need filtering than have it but other than that I don't think we should have an if
branch here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need the if here because we use isNotManaged for filtering.
The purpose of this change is to filter out existing TXT records when an A record was mistakenly deleted and later recreated by External-DNS.
This filtering is necessary because trying to create a TXT record that already exists would result in an error. So if a TXT record already exists, we avoid creating it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also addresses the comment raised here: #5459 (comment).
@@ -259,14 +318,18 @@ func (im *TXTRegistry) generateTXTRecord(r *endpoint.Endpoint) []*endpoint.Endpo | |||
txtNew.WithSetIdentifier(r.SetIdentifier) | |||
txtNew.Labels[endpoint.OwnedRecordLabelKey] = r.DNSName | |||
txtNew.ProviderSpecific = r.ProviderSpecific | |||
endpoints = append(endpoints, txtNew) | |||
if filter(txtNew) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see hear 🙇 #5459 (comment)
…cords and ApplyChanges
…to reduce memory usage
@@ -279,7 +342,7 @@ func (im *TXTRegistry) ApplyChanges(ctx context.Context, changes *plan.Changes) | |||
} | |||
r.Labels[endpoint.OwnerLabelKey] = im.ownerID | |||
|
|||
filteredChanges.Create = append(filteredChanges.Create, im.generateTXTRecord(r)...) | |||
filteredChanges.Create = append(filteredChanges.Create, im.generateTXTRecordWithFilter(r, im.existingTXTs.isNotManaged)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see hear 🙇 #5459 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of cleanup and missing negative tests
For sure we do not want to recreate ownership records.
newEndpointWithOwner("record-1.test-zone.example.org", "1.1.1.1", endpoint.RecordTypeA, ownerId), | ||
}, | ||
}, | ||
// TODO: Test TXT record regeneration when only A/CNAME records exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No!
expectedCreate: []*endpoint.Endpoint{ | ||
newEndpointWithOwner("record-1.test-zone.example.org", "1.1.1.1", endpoint.RecordTypeA, ownerId), | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are tests missing for:
- do not recreate $RR, if another external-dns instance owns the record
- do not recreate $RR, if there is no owner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify in what situation you think this case would occur? 🙇
In this PR, the tests are intended to verify that records (such as A or CNAME) are properly created when they are desired (e.g. via Ingress host settings).
At this point, the records do not yet exist, so naturally they will have no owner at that stage.
In that sense, the "no owner" case is simply the expected initial state, not an exceptional condition we should test for.
Regarding ownership conflicts: I believe this would happen if multiple external-dns instances were deployed in the same cluster and watching the same resources (e.g. the same Ingress), without proper isolation (such as domainFilter).
However, in my view, this is a misconfiguration, and not something we should test for here.
@szuecs
You're right — there are cases where deleting the TXT record is intentional to relinquish ownership. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest seems legit to me
We just w8 for @szuecs view
} | ||
} | ||
|
||
func (im *existingTXTs) add(r *endpoint.Endpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But agree, we need to be super careful, as pretty much every deployment rely on txt registry
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What does it do ?
Adds a cache-based filter to TXTRegistry so that
External-DNS
no longertries to recreate
TXT
records that already exist in the provider.Fixes #4914
Motivation
When only the data record (
A/CNAME
) is removed,External-DNS
next sync looprecreates that record and also tries to create the unchanged
TXT
record,which providers like
Route53
reject with “record already exists”.Caching existing
TXT
entries and skipping them inApplyChanges()
preventsthis failure.
More