Skip to content

bug: upstream in route domain ip changes not been detected and use the old ip #12436

@wklken

Description

@wklken

Current Behavior

In some condition, when the ip of the domain changed, the apisix keep use the old ip, cause 504 gateway timeout.

And it would never resume, until do apisix reload

At the same time, dig and nslookup command return the newest ip.

Expected Behavior

apisix should detect the ip changed

Error Logs

2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: parse_domain_for_nodes(): parse_domain_for_nodes: [{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:213: parse_domain_in_route(): parse_domain_in_route | new_nodes=[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:219: parse_domain_in_route(): parse_domain_in_route | up_conf:{"timeout":{"send":30,"connect":30,"read":30},"hash_on":"vars","type":"roundrobin","parent":{"update_count":0,"modifiedIndex":5360,"orig_modifiedIndex":5360,"clean_handlers":{},"createdIndex":5360,"has_domain":true,"key":"/bk-gateway-apisix/routes/apigw.prod.2347","value":{"timeout":{"send":30,"connect":30,"read":30},"desc":"Returns anything passed in request data.","name":"apigw-prod-anything-get","labels":{"gateway.bk.tencent.com/stage":"prod","gateway.bk.tencent.com/gateway":"apigw"},"update_time":1752566944,"plugins":{"bk-proxy-rewrite":{"match_subpath":false,"uri":"/anything","subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false},"bk-resource-context":{"bk_resource_name":"anything_get","bk_resource_id":2347,"bk_resource_auth":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false},"bk_resource_auth_obj":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false}}},"uris":["/api/apigw/prod/anything","/api/apigw/prod/anything/"],"upstream":{"timeout":"table: 0x7f119b810dd0","hash_on":"vars","type":"roundrobin","parent":"table: 0x7f1199322a98","original_nodes":[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],"nodes":"table: 0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 0x7f11693587e0"},"status":1,"id":"apigw.prod.2347","service_id":"apigw.prod.stage-4","priority":0,"methods":["GET"],"create_time":1752566944}},"original_nodes":"table: 0x7f11693587e0","nodes":"table: 0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 0x7f11693587e0"}, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:221: parse_domain_in_route(): parse_domain_in_route | compare result:true, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:223: parse_domain_in_route(): parse_domain_in_route | no change, use old route, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"

Steps to Reproduce

  1. add a route with route.upstream.nodes and the nodes[0].host = httpbin, which is a svc in k8s, route to the httpbin service
$ curl -H "X-API-KEY: $admin_key"  http://127.0.0.1:9180/apisix/admin/routes/apigw.prod.2347 | jq
{
  "key": "/bk-gateway-apisix/routes/apigw.prod.2347",
  "modifiedIndex": 5360,
  "createdIndex": 5360,
  "value": {
    "timeout": {
      "send": 30,
      "connect": 30,
      "read": 30
    },
    "desc": "Returns anything passed in request data.",
    "name": "apigw-prod-anything-get",
    "update_time": 1752566944,
    "plugins": {
      "proxy-rewrite": {
        "method": "GET",
        "uri": "/anything"
      }
    },
    "create_time": 1752566944,
    "upstream": {
      "timeout": {
        "send": 30,
        "connect": 30,
        "read": 30
      },
      "nodes": [
        {
          "weight": 100,
          "priority": 1,
          "port": 80,
          "host": "httpbin"
        }
      ],
      "pass_host": "node",
      "scheme": "http",
      "type": "roundrobin"
    },
    "labels": {
      "gateway.bk.tencent.com/stage": "prod",
      "gateway.bk.tencent.com/gateway": "apigw"
    },
    "id": "apigw.prod.2347",
    "service_id": "apigw.prod.stage-4",
    "status": 1,
    "methods": [
      "GET"
    ],
    "uris": [
      "/api/apigw/prod/anything",
      "/api/apigw/prod/anything/"
    ]
  }
}

here, the route.upstream.nodes[0].host = httpbin`

  1. add core.log.error for debug

apisix/init.lua

local function parse_domain_in_route(route)
    local nodes = route.value.upstream.nodes
    local new_nodes, err = upstream_util.parse_domain_for_nodes(nodes)
    core.log.error("parse_domain_in_route | new_nodes=", core.json.delay_encode(new_nodes, true))
    if not new_nodes then
        return nil, err
    end

    local up_conf = route.dns_value and route.dns_value.upstream
    core.log.error("parse_domain_in_route | up_conf:", core.json.delay_encode(up_conf, true))
    local ok = upstream_util.compare_upstream_node(up_conf, new_nodes)
    core.log.error("parse_domain_in_route | compare result:", ok)
    if ok then
        core.log.error("parse_domain_in_route | no change, use old route")
        return route
    end

    -- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result
    -- has changed

    -- Here we copy the whole route instead of part of it,
    -- so that we can avoid going back from route.value to route during copying.
    route.dns_value = core.table.deepcopy(route).value
    route.dns_value.upstream.nodes = new_nodes
    core.log.info("parse route which contain domain: ",
                  core.json.delay_encode(route, true))
    return route
end

and

apisix/utils/upstream.lua

local function parse_domain_for_nodes(nodes)
    core.log.error("parse_domain_for_nodes: ", core.json.delay_encode(nodes, true))
    local new_nodes = core.table.new(#nodes, 0)
    for _, node in ipairs(nodes) do
        local host = node.host
        core.log.error("parse_domain_for_nodes: host=", host)
        if not ipmatcher.parse_ipv4(host) and
                not ipmatcher.parse_ipv6(host) then
            local ip, err = core.resolver.parse_domain(host)
            if ip then
                local new_node = core.table.clone(node)
                new_node.host = ip
                new_node.domain = host
                core.table.insert(new_nodes, new_node)
            end

            if err then
                core.log.error("dns resolver domain: ", host, " error: ", err)
            end
        else
            core.log.error("parse_domain_for_nodes: add the node back")
            core.table.insert(new_nodes, node)
        end
    end

    return new_nodes
end
_M.parse_domain_for_nodes = parse_domain_for_nodes
  1. apisix reload and update routes in etcd, trigger config_etcd.lua:389: sync_data()
  2. at the same time, delete the httpbin service and kubectl apply it again (the cluster ip would be changed) 【not 100% Reproducible】
  3. curl it

according to the error.log,

  1. the parse_domain-for_nodes args 1 is [{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], the host is a ip here

2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: parse_domain_for_nodes(): parse_domain_for_nodes: [{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"

  1. while it's not a domain, so it would not core.resolver.parse_domain(host)

2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"

  1. then it been added back

2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.example.com"


so the worker would never detect the ip changes, until apisix reload

Environment

  • APISIX version (run apisix version): 3.2.1
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V):
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):

Metadata

Metadata

Assignees

No one assigned

    Labels

    checkingcheck first if this issue occurred

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions