Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ovs-ovn偶现启动失败,报错 br-int.mgmt: connection failed, 启动重试时间较长 #4506

Open
mengyu1987 opened this issue Sep 11, 2024 · 2 comments
Labels
bug Something isn't working no-issue-activity

Comments

@mengyu1987
Copy link

Kube-OVN Version

v1.12.22

Kubernetes Version

v.1.27.6

Operation-system/Kernel Version

5.10.0-136.12.0.86.4.hl202.x86_64

Description

同一个node上的ovs-ovn pod重启,有时20s正常启动,有时需要2m错误重试一次才能起起来,报错如下:
ovs-ovn报错日志
ovs-ovn报错日志-1

Steps To Reproduce

重启ovs-ovn

Current Behavior

重启ovs-ovn 报错,启动时间较长,影响虚机网络联通

Expected Behavior

重启ovs-ovn 不报错,可以较快时间正常启动

@mengyu1987 mengyu1987 added the bug Something isn't working label Sep 11, 2024
Copy link

dosubot bot commented Sep 11, 2024

@mengyu1987 你好!我是Dosu,一个可以帮助你解决问题、回答疑问,甚至成为贡献者的机器人。如果你有任何疑问或困难,我会在等待人类维护者的过程中为你提供帮助。

I found some similar issues that might be relevant to the problem you're experiencing:

  1. 删除重建ovs-ovn pod,高频率报错/var/run/openvswitch/br-int.mgmt: connection failed (No such file or directory),pod恢复时间较长,导致虚机业务中断时间较长。 - This issue was closed and the suggested solution was to upgrade to version v1.12.18, as it was fixed in v1.12.17.

  2. ovs-ovn has been crashed randomly - This issue was resolved by downgrading to version 1.11.10.

  3. 某个节点ovs报错,导致该节点网络无法使用 - The suggested solution was to check if the issue was caused by insufficient memory leading to OOM (Out of Memory) errors.

For your specific issue with intermittent startup failures and the br-int.mgmt: connection failed error, you might want to consider the following recommended configurations for Open vSwitch (OVS) and OVN to ensure stable and quick startup times:

  1. OVN Remote Probe Interval:

    OVN_REMOTE_PROBE_INTERVAL=${OVN_REMOTE_PROBE_INTERVAL:-10000}
  2. OVN Remote OpenFlow Interval:

    OVN_REMOTE_OPENFLOW_INTERVAL=${OVN_REMOTE_OPENFLOW_INTERVAL:-180}
  3. Restricting the Number of Threads:
    For hosts with more than 12 cores:

    if [[ `nproc` -gt 12 ]]; then
        ovs-vsctl --no-wait set Open_vSwitch . other_config:n-revalidator-threads=4
        ovs-vsctl --no-wait set Open_vSwitch . other_config:n-handler-threads=10
    fi
  4. Setting Remote OVN-SB for OVN-Controller:

    function gen_conn_str {
      if [[ -z "${OVN_DB_IPS}" ]]; then
        if [[ "$ENABLE_SSL" == "false" ]]; then
          x="tcp:[${OVN_SB_SERVICE_HOST}]:${OVN_SB_SERVICE_PORT}"
        else
          x="ssl:[${OVN_SB_SERVICE_HOST}]:${OVN_SB_SERVICE_PORT}"
        fi
      else
        t=$(echo -n "${OVN_DB_IPS}" | sed 's/[[:space:]]//g' | sed 's/,/ /g')
        if [[ "$ENABLE_SSL" == "false" ]]; then
          x=$(for i in ${t}; do echo -n "tcp:[$i]:$1",; done | sed 's/,$//')
        else
          x=$(for i in ${t}; do echo -n "ssl:[$i]:$1",; done | sed 's/,$//')
        fi
      fi
      echo "$x"
    }
    ovs-vsctl set open . external-ids:ovn-remote="$(gen_conn_str 6642)"
    ovs-vsctl set open . external-ids:ovn-remote-probe-interval="${OVN_REMOTE_PROBE_INTERVAL}"
    ovs-vsctl set open . external-ids:ovn-openflow-probe-interval="${OVN_REMOTE_OPENFLOW_INTERVAL}"
    ovs-vsctl set open . external-ids:ovn-encap-type="${TUNNEL_TYPE}"
    ovs-vsctl set open . external-ids:hostname="${KUBE_NODE_NAME}"
  5. DPDK Tunnel Interface Configuration:

    DPDK_TUNNEL_IFACE=${DPDK_TUNNEL_IFACE:-br-phy}
    TUNNEL_TYPE=${TUNNEL_TYPE:-geneve}
  6. OVS DPDK Configuration File:

    OVS_DPDK_CONFIG_FILE=/opt/ovs-config/ovs-dpdk-config
    if ! test -f "$OVS_DPDK_CONFIG_FILE"; then
        echo "missing ovs dpdk config"
        exit 1
    fi
    source $OVS_DPDK_CONFIG_FILE
  7. Link Socket Directory:

    mkdir -p /usr/local/var/run
    if [ -L /usr/local/var/run/openvswitch ]; then
        echo "sock exist"
    else
        echo "link sock"
        ln -s /var/run/openvswitch /usr/local/var/run/openvswitch
    fi
  8. Kernel Module Checks and Workarounds:

    modinfo -m openvswitch
    modinfo -m geneve
    
    if modinfo -m ip_tables; then
      modprobe ip_tables
    fi
    
    if grep -q "3.10.0-862" /proc/version; then
        echo "kernel version 3.10.0-862 has a nat related bug that will affect ovs function, please update to a version greater than 3.10.0-898"
        exit 1
    fi
    
    if [ ! -f "/proc/net/if_inet6" ] && grep -q "3.10" /proc/version; then
        echo "geneve requires ipv6, please add ipv6.disable=0 to kernel follow the instruction below:"
        echo "
    vi /etc/default/grub
    find GRUB_CMDLINE_LINUX=  and change ipv6.disable=1 to ipv6.disable=0
    grub2-mkconfig -o /boot/grub2/grub.cfg
    reboot
    cat /proc/cmdline"
        exit 1
    fi

These configurations can help in ensuring that OVS and OVN start up quickly and remain stable [1][2][3].

To continue talking to Dosu, mention @dosu.

Copy link
Contributor

Issues go stale after 60d of inactivity. Please comment or re-open the issue if you are still interested in getting this issue fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working no-issue-activity
Projects
None yet
Development

No branches or pull requests

1 participant