-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Current Behavior
In some condition, when the ip of the domain changed, the apisix keep use the old ip, cause 504 gateway timeout.
And it would never resume, until do apisix reload
At the same time, dig and nslookup command return the newest ip.
Expected Behavior
apisix should detect the ip changed
Error Logs
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: parse_domain_for_nodes(): parse_domain_for_nodes: [{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:213: parse_domain_in_route(): parse_domain_in_route | new_nodes=[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:219: parse_domain_in_route(): parse_domain_in_route | up_conf:{"timeout":{"send":30,"connect":30,"read":30},"hash_on":"vars","type":"roundrobin","parent":{"update_count":0,"modifiedIndex":5360,"orig_modifiedIndex":5360,"clean_handlers":{},"createdIndex":5360,"has_domain":true,"key":"/bk-gateway-apisix/routes/apigw.prod.2347","value":{"timeout":{"send":30,"connect":30,"read":30},"desc":"Returns anything passed in request data.","name":"apigw-prod-anything-get","labels":{"gateway.bk.tencent.com/stage":"prod","gateway.bk.tencent.com/gateway":"apigw"},"update_time":1752566944,"plugins":{"bk-proxy-rewrite":{"match_subpath":false,"uri":"/anything","subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false},"bk-resource-context":{"bk_resource_name":"anything_get","bk_resource_id":2347,"bk_resource_auth":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false},"bk_resource_auth_obj":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false}}},"uris":["/api/apigw/prod/anything","/api/apigw/prod/anything/"],"upstream":{"timeout":"table: 0x7f119b810dd0","hash_on":"vars","type":"roundrobin","parent":"table: 0x7f1199322a98","original_nodes":[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],"nodes":"table: 0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 0x7f11693587e0"},"status":1,"id":"apigw.prod.2347","service_id":"apigw.prod.stage-4","priority":0,"methods":["GET"],"create_time":1752566944}},"original_nodes":"table: 0x7f11693587e0","nodes":"table: 0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 0x7f11693587e0"}, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:221: parse_domain_in_route(): parse_domain_in_route | compare result:true, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:223: parse_domain_in_route(): parse_domain_in_route | no change, use old route, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
Steps to Reproduce
- add a route with
route.upstream.nodes
and thenodes[0].host = httpbin
, which is a svc in k8s, route to the httpbin service
$ curl -H "X-API-KEY: $admin_key" http://127.0.0.1:9180/apisix/admin/routes/apigw.prod.2347 | jq
{
"key": "/bk-gateway-apisix/routes/apigw.prod.2347",
"modifiedIndex": 5360,
"createdIndex": 5360,
"value": {
"timeout": {
"send": 30,
"connect": 30,
"read": 30
},
"desc": "Returns anything passed in request data.",
"name": "apigw-prod-anything-get",
"update_time": 1752566944,
"plugins": {
"proxy-rewrite": {
"method": "GET",
"uri": "/anything"
}
},
"create_time": 1752566944,
"upstream": {
"timeout": {
"send": 30,
"connect": 30,
"read": 30
},
"nodes": [
{
"weight": 100,
"priority": 1,
"port": 80,
"host": "httpbin"
}
],
"pass_host": "node",
"scheme": "http",
"type": "roundrobin"
},
"labels": {
"gateway.bk.tencent.com/stage": "prod",
"gateway.bk.tencent.com/gateway": "apigw"
},
"id": "apigw.prod.2347",
"service_id": "apigw.prod.stage-4",
"status": 1,
"methods": [
"GET"
],
"uris": [
"/api/apigw/prod/anything",
"/api/apigw/prod/anything/"
]
}
}
here, the route.upstream.nodes[0].host = httpbin`
- add
core.log.error
for debug
apisix/init.lua
local function parse_domain_in_route(route)
local nodes = route.value.upstream.nodes
local new_nodes, err = upstream_util.parse_domain_for_nodes(nodes)
core.log.error("parse_domain_in_route | new_nodes=", core.json.delay_encode(new_nodes, true))
if not new_nodes then
return nil, err
end
local up_conf = route.dns_value and route.dns_value.upstream
core.log.error("parse_domain_in_route | up_conf:", core.json.delay_encode(up_conf, true))
local ok = upstream_util.compare_upstream_node(up_conf, new_nodes)
core.log.error("parse_domain_in_route | compare result:", ok)
if ok then
core.log.error("parse_domain_in_route | no change, use old route")
return route
end
-- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result
-- has changed
-- Here we copy the whole route instead of part of it,
-- so that we can avoid going back from route.value to route during copying.
route.dns_value = core.table.deepcopy(route).value
route.dns_value.upstream.nodes = new_nodes
core.log.info("parse route which contain domain: ",
core.json.delay_encode(route, true))
return route
end
and
apisix/utils/upstream.lua
local function parse_domain_for_nodes(nodes)
core.log.error("parse_domain_for_nodes: ", core.json.delay_encode(nodes, true))
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
core.log.error("parse_domain_for_nodes: host=", host)
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
local ip, err = core.resolver.parse_domain(host)
if ip then
local new_node = core.table.clone(node)
new_node.host = ip
new_node.domain = host
core.table.insert(new_nodes, new_node)
end
if err then
core.log.error("dns resolver domain: ", host, " error: ", err)
end
else
core.log.error("parse_domain_for_nodes: add the node back")
core.table.insert(new_nodes, node)
end
end
return new_nodes
end
_M.parse_domain_for_nodes = parse_domain_for_nodes
- apisix reload and update routes in etcd, trigger
config_etcd.lua:389: sync_data()
- at the same time, delete the httpbin service and kubectl apply it again (the cluster ip would be changed) 【not 100% Reproducible】
- curl it
according to the error.log,
- the
parse_domain-for_nodes
args 1 is[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}]
, the host is a ip here
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: parse_domain_for_nodes(): parse_domain_for_nodes: [{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
- while it's not a domain, so it would not
core.resolver.parse_domain(host)
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
- then it been added back
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
so the worker would never detect the ip changes, until apisix reload
Environment
- APISIX version (run
apisix version
): 3.2.1 - Operating system (run
uname -a
): - OpenResty / Nginx version (run
openresty -V
ornginx -V
): - etcd version, if relevant (run
curl http://127.0.0.1:9090/v1/server_info
): - APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run
luarocks --version
):
Metadata
Metadata
Assignees
Labels
Type
Projects
Status