Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLAB random setup-vpcs errors when configuring server VMs #292

Open
pau-hedgehog opened this issue Dec 30, 2024 · 6 comments
Open

VLAB random setup-vpcs errors when configuring server VMs #292

pau-hedgehog opened this issue Dec 30, 2024 · 6 comments
Assignees

Comments

@pau-hedgehog
Copy link
Contributor

pau-hedgehog commented Dec 30, 2024

There are known issues with the hhnet script: https://github.com/githedgehog/fabricator/blob/master/pkg/hhfab/hhnet.sh

23:54:56 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-08\": running hhnet \"bond 1008 enp2s1 enp2s2\": Process exited with status 1: out: Failed to get IP address for bond0.1008\n"

But as it's hitting the CI from time to time it needs to be addressed to make the CI more reliable and avoid having to retry the job:
https://github.com/githedgehog/fabricator/actions/runs/12530586269/job/34947281228

@pau-hedgehog pau-hedgehog self-assigned this Dec 30, 2024
@pau-hedgehog
Copy link
Contributor Author

Another hit:
https://github.com/githedgehog/fabricator/actions/runs/12589851894/job/35090547613#step:7:489

00:01:41 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-02\": running hhnet \"bond 1002 enp2s1 enp2s2\": Process exited with status 1: out: Failed to get IP address for bond0.1002\n"

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 4, 2025

So the error happens during this line execution:

bin/hhfab vlab up -v --ready setup-vpcs --ready test-connectivity --ready exit --mode=${{ matrix.buildmode }}

It would be helpful to capture the state (logs, outputs) of the VLAB just after the failure to be able to analyse the root cause.

As hhfab vlab up and successive commands run in the same context, currently there is no way to gather logs unless we implement the artifact gathering inside fabricator.

@Frostman , to progress on this one, what's your preferred approach? Expand hhfab helpers or allow hhfab to (also) run "detached" (eg. a Service interface) and allow artifact gathering externally?

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 20, 2025

I could reproduce this in env-3

14:45:47 INF Configuring VPCs and VPCAttachments
14:45:47 INF Configuring networking on servers
14:45:47 DBG Verified server=server-7
...
14:45:47 DBG Verified server=server-9
14:45:59 INF Configured server=server-5 addr=10.30.5.2/24 netconf="vlan 1005 enp2s1"
14:45:59 INF Configured server=server-7 addr=10.30.7.2/24 netconf="vlan 1007 enp2s1"
14:46:01 DBG Show tech collected successfully entry=server-4 output=/home/ubuntu/hhfab/show-tech-output/server-4-show-tech.log
14:46:01 DBG Show tech collected successfully entry=server-8 output=/home/ubuntu/hhfab/show-tech-output/server-8-show-tech.log
...
14:46:34 DBG Show tech collected successfully entry=as7712-01 output=/home/ubuntu/hhfab/show-tech-output/as7712-01-show-tech.log
14:46:36 DBG Show tech collected successfully entry=as7712-02 output=/home/ubuntu/hhfab/show-tech-output/as7712-02-show-tech.log
14:46:36 INF Show tech files saved in folder=/home/ubuntu/hhfab/show-tech-output
14:46:36 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-2\": running hhnet \"bond 1002 enp2s1 enp2s2\": context deadline exceeded: out: "

I could see this in the server logs:

-- Boot 247ddf04ef154360af24543a8535b9d6 --
Jan 20 14:30:38 server-2 systemd[1]: Starting systemd-networkd.service - Network Configuration...
Jan 20 14:30:38 server-2 systemd-networkd[1250]: lo: Link UP
Jan 20 14:30:38 server-2 systemd-networkd[1250]: lo: Gained carrier
Jan 20 14:30:38 server-2 systemd-networkd[1250]: Enumeration completed
Jan 20 14:30:38 server-2 systemd[1]: Started systemd-networkd.service - Network Configuration.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s0: Configuring with /etc/systemd/network/99-lldp.network.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s0: Link UP
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s0: Gained carrier
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s2: Configuring with /etc/systemd/network/99-lldp.network.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s2: Link UP
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s1: Configuring with /etc/systemd/network/99-lldp.network.
Jan 20 14:30:38 server-2 systemd-networkd[1250]: enp2s1: Link UP
Jan 20 14:30:39 server-2 systemd-networkd[1250]: enp2s0: DHCPv4 address 172.31.3.10/24 acquired from 172.31.3.2
Jan 20 14:30:40 server-2 systemd-networkd[1250]: enp2s0: Gained IPv6LL
Jan 20 14:42:38 server-2 systemd-networkd[1250]: enp2s2: Gained carrier
Jan 20 14:42:39 server-2 systemd-networkd[1250]: enp2s2: Gained IPv6LL
Jan 20 14:43:44 server-2 systemd-networkd[1250]: enp2s1: Gained carrier
Jan 20 14:43:46 server-2 systemd-networkd[1250]: enp2s1: Gained IPv6LL
Jan 20 14:45:47 server-2 systemd-networkd[1250]: enp2s1: Link DOWN
Jan 20 14:45:47 server-2 systemd-networkd[1250]: enp2s1: Lost carrier
Jan 20 14:45:47 server-2 systemd-networkd[1250]: enp2s1: DHCPv6 lease lost
Jan 20 14:45:48 server-2 systemd-networkd[1250]: enp2s2: Link DOWN
Jan 20 14:45:48 server-2 systemd-networkd[1250]: enp2s2: Lost carrier
Jan 20 14:45:48 server-2 systemd-networkd[1250]: enp2s2: DHCPv6 lease lost
Jan 20 14:45:53 server-2 systemd-networkd[1250]: bond0: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 20 14:45:53 server-2 systemd-networkd[1250]: bond0: Link UP
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s1: Link UP
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s2: Link UP
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s1: Gained carrier
Jan 20 14:45:53 server-2 systemd-networkd[1250]: bond0: Gained carrier
Jan 20 14:45:53 server-2 systemd-networkd[1250]: enp2s2: Gained carrier
Jan 20 14:45:54 server-2 systemd-networkd[1250]: bond0.1002: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 20 14:45:54 server-2 systemd-networkd[1250]: bond0.1002: Link UP
Jan 20 14:45:54 server-2 systemd-networkd[1250]: bond0.1002: Gained carrier
Jan 20 14:45:55 server-2 systemd-networkd[1250]: bond0: Gained IPv6LL
Jan 20 14:45:56 server-2 systemd-networkd[1250]: bond0.1002: Gained IPv6LL

Around that time I see an error in the Agent of (s5248-05) and it looks like it restarts:

time=2025-01-20T14:44:23.745Z level=DEBUG msg=Action idx=269 weight=69 summary="Create DHCP relay Vlan1001" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1001]"
time=2025-01-20T14:44:23.870Z level=INFO msg="Action applied" idx=269 summary="Create DHCP relay Vlan1001"
time=2025-01-20T14:44:23.870Z level=DEBUG msg=Action idx=270 weight=69 summary="Create DHCP relay Vlan1002" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1002]"
time=2025-01-20T14:44:24.079Z level=INFO msg="Action applied" idx=270 summary="Create DHCP relay Vlan1002"
time=2025-01-20T14:44:24.079Z level=DEBUG msg=Action idx=271 weight=69 summary="Create DHCP relay Vlan1003" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1003]"
time=2025-01-20T14:44:24.227Z level=INFO msg="Action applied" idx=271 summary="Create DHCP relay Vlan1003"
time=2025-01-20T14:44:24.227Z level=DEBUG msg=Action idx=272 weight=69 summary="Create DHCP relay Vlan1010" command=update path="relay-agent/dhcp/interfaces/interface[id=Vlan1010]"
time=2025-01-20T14:44:24.327Z level=INFO msg="Action applied" idx=272 summary="Create DHCP relay Vlan1010"
time=2025-01-20T14:44:24.328Z level=INFO msg="Config applied" name=s5248-05 gen=3 res=10712 took=1m30.48444858s
time=2025-01-20T14:44:24.328Z level=DEBUG msg="Ensuring alloy is installed and running"
time=2025-01-20T14:44:28.108Z level=DEBUG msg="Alloy ensured" took=946.87283ms
time=2025-01-20T14:44:28.109Z level=INFO msg="IDs ready" install=d2764927-9f79-4ac4-94e3-afd9f4739169 run=e9a891d4-4994-48fa-84ba-61ab66af4d4a
time=2025-01-20T14:44:28.109Z level=INFO msg="Starting watch for config changes in K8s"
time=2025-01-20T14:44:33.912Z level=DEBUG msg="Switch state updated" took=5.749965271s
time=2025-01-20T14:44:34.040Z level=DEBUG msg="Sending heartbeat" name=s5248-05
time=2025-01-20T14:44:39.766Z level=DEBUG msg="Switch state updated" took=5.726302612s
time=2025-01-20T14:44:39.900Z level=DEBUG msg="Sending heartbeat" name=s5248-05
time=2025-01-20T14:44:45.828Z level=DEBUG msg="Switch state updated" took=5.927379521s
time=2025-01-20T14:44:45.911Z level=ERROR msg=Failed err="failed to run agent: failed to heartbeat after processing: failed to update agent heartbeat: Operation cannot be fulfilled on agents.agent.githedgehog.com \"s5248-05\": th
e object has been modified; please apply your changes to the latest version and try again"
time=2025-01-20T14:44:52.275Z level=INFO msg=Starting version=v0.58.0 basedir=/etc/sonic/hedgehog/
time=2025-01-20T14:44:52.311Z level=INFO msg="Config loaded from file" name=s5248-05 gen=3 res=10712
time=2025-01-20T14:44:52.433Z level=INFO msg="Processing agent config" name=s5248-05 gen=3 res=10712
time=2025-01-20T14:44:52.434Z level=INFO msg="Control link configuration applied"
time=2025-01-20T14:44:52.794Z level=DEBUG msg="Desired state generated"
time=2025-01-20T14:45:01.047Z level=DEBUG msg="Actual state loaded" took=8.252908701s
time=2025-01-20T14:45:01.049Z level=DEBUG msg="Actions calculated" count=19
time=2025-01-20T14:45:01.077Z level=DEBUG msg="Actual <> Desired" diff="--- Actual State"
time=2025-01-20T14:45:01.077Z level=DEBUG msg="Actual <> Desired" diff="+++ Desired State"

Does this look like any known issue @Frostman ?

@edipascale
Copy link
Contributor

I just saw this exact same issue in env-1 while testing the VRF scaling, for what it's worth:

admin@s5248-03:~$ grep -C 3 "has been modified" /var/log/agent.log
time=2025-01-20T16:37:30.491Z level=DEBUG msg="Switch state updated" took=57.59832286s
time=2025-01-20T16:37:30.764Z level=DEBUG msg="Sending heartbeat" name=s5248-03
time=2025-01-20T16:38:28.449Z level=DEBUG msg="Switch state updated" took=57.68468957s
time=2025-01-20T16:38:28.678Z level=ERROR msg=Failed err="failed to run agent: failed to heartbeat after processing: failed to update agent heartbeat: Operation cannot be fulfilled on agents.agent.githedgehog.com \"s5248-03\": the object has been modified; please apply your changes to the latest version and try again"
time=2025-01-20T16:38:35.124Z level=INFO msg=Starting version=v0.58.0 basedir=/etc/sonic/hedgehog/
time=2025-01-20T16:38:35.257Z level=INFO msg="Config loaded from file" name=s5248-03 gen=278 res=1883321
time=2025-01-20T16:38:35.424Z level=INFO msg="Processing agent config" name=s5248-03 gen=278 res=1883321
--
time=2025-01-20T16:44:26.493Z level=DEBUG msg="Switch state updated" took=58.052304011s
time=2025-01-20T16:44:26.853Z level=DEBUG msg="Sending heartbeat" name=s5248-03
time=2025-01-20T16:45:23.611Z level=DEBUG msg="Switch state updated" took=56.757426164s
time=2025-01-20T16:45:23.780Z level=ERROR msg=Failed err="failed to run agent: failed to heartbeat after processing: failed to update agent heartbeat: Operation cannot be fulfilled on agents.agent.githedgehog.com \"s5248-03\": the object has been modified; please apply your changes to the latest version and try again"
time=2025-01-20T16:45:30.096Z level=INFO msg=Starting version=v0.58.0 basedir=/etc/sonic/hedgehog/
time=2025-01-20T16:45:30.185Z level=INFO msg="Config loaded from file" name=s5248-03 gen=278 res=1883321
time=2025-01-20T16:45:30.338Z level=INFO msg="Processing agent config" name=s5248-03 gen=278 res=1883321

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 27, 2025

18:38:17 WRN Error running on-ready commands err="setting up VPCs: configuring servers: configuring server \"server-8\": running hhnet \"vlan 1008 enp2s1\": Process exited with status 1: out: Failed to get IP address for enp2s1.1008\n"

First hit with show-tech captured:
https://github.com/githedgehog/fabricator/actions/runs/12993975054/job/36237967537?pr=349

show-tech-hlab-spine-leaf-true-iso.zip

@pau-hedgehog
Copy link
Contributor Author

show-tech-hlab-spine-leaf-true-iso.zip

	=== systemd-networkd logs ===
Jan 27 18:13:02 localhost systemd[1]: Starting systemd-networkd.service - Network Configuration...
Jan 27 18:13:02 localhost systemd-networkd[748]: lo: Link UP
Jan 27 18:13:02 localhost systemd-networkd[748]: lo: Gained carrier
Jan 27 18:13:02 localhost systemd-networkd[748]: Enumeration completed
Jan 27 18:13:02 localhost systemd[1]: Started systemd-networkd.service - Network Configuration.
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s0: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s1: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s0: Link UP
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s1: Link UP
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s0: Gained carrier
Jan 27 18:13:02 localhost systemd-networkd[748]: enp2s1: Gained carrier
Jan 27 18:13:03 localhost systemd-networkd[748]: enp2s0: DHCPv4 address 172.31.8.10/24 acquired from 172.31.8.2
Jan 27 18:13:03 localhost systemd-networkd[748]: enp2s0: Gained IPv6LL
Jan 27 18:13:04 localhost systemd-networkd[748]: enp2s1: Gained IPv6LL
Jan 27 18:13:04 localhost systemd[1]: Stopping systemd-networkd.service - Network Configuration...
Jan 27 18:13:04 localhost systemd-networkd[748]: enp2s1: DHCPv6 lease lost
Jan 27 18:13:04 localhost systemd-networkd[748]: enp2s0: DHCPv6 lease lost
Jan 27 18:13:04 localhost systemd[1]: systemd-networkd.service: Deactivated successfully.
Jan 27 18:13:04 localhost systemd[1]: Stopped systemd-networkd.service - Network Configuration.
Jan 27 18:13:07 server-8 systemd[1]: Starting systemd-networkd.service - Network Configuration...
Jan 27 18:13:08 server-8 systemd-networkd[1338]: lo: Link UP
Jan 27 18:13:08 server-8 systemd-networkd[1338]: lo: Gained carrier
Jan 27 18:13:08 server-8 systemd-networkd[1338]: Enumeration completed
Jan 27 18:13:08 server-8 systemd[1]: Started systemd-networkd.service - Network Configuration.
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s0: Configuring with /etc/systemd/network/99-lldp.network.
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s1: Configuring with /etc/systemd/network/99-lldp.network.
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s0: Link UP
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s0: Gained carrier
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s1: Link UP
Jan 27 18:13:08 server-8 systemd-networkd[1338]: enp2s1: Gained carrier
Jan 27 18:13:09 server-8 systemd-networkd[1338]: enp2s0: DHCPv4 address 172.31.8.10/24 acquired from 172.31.8.2
Jan 27 18:13:10 server-8 systemd-networkd[1338]: enp2s0: Gained IPv6LL
Jan 27 18:13:10 server-8 systemd-networkd[1338]: enp2s1: Gained IPv6LL
Jan 27 18:25:37 server-8 systemd-networkd[1338]: enp2s1: Lost carrier
Jan 27 18:25:37 server-8 systemd-networkd[1338]: enp2s1: DHCPv6 lease lost
Jan 27 18:36:34 server-8 systemd-networkd[1338]: enp2s1: Link DOWN
Jan 27 18:36:39 server-8 systemd-networkd[1338]: enp2s1: Link UP
Jan 27 18:36:39 server-8 systemd-networkd[1338]: enp2s1.1008: Link UP
Jan 27 18:36:39 server-8 systemd-networkd[1338]: enp2s1.1008: Configuring with /usr/lib/systemd/network/zz-default.network.
Jan 27 18:35:46.411363+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- doVlanTask: Vlan Vlan1008 already present
Jan 27 18:35:46.411384+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- doVlanTask: autostate enable vl.m_autostate enable
Jan 27 18:35:46.411723+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface Vlan1008 MTU to 9100
Jan 27 18:35:46.411723+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- findVlanOperState: admin_state UP vl.m_autostate enable up_mmbr_cnt 0 isUp DOWN
Jan 27 18:35:46.411723+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- updateDbVlanOperStatus: vlan Vlan1008 status down
Jan 27 18:35:46.411743+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: Intf Vlan1008 mac 00:00:00:11:11:11 op: SET
Jan 27 18:35:46.411763+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: STATE VLAN TBL Vlan1008 oper down
Jan 27 18:35:46.412554+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- ifChangeInformNextHop: ifChangeInformNextHop Vlan1008 up 0
Jan 27 18:35:46.412554+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- flushFdbByPort: flushFdbByPort for Vlan1008 ; flush_static:0 flush_mclag/evpn mh 
Jan 27 18:35:46.412554+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: Intf Vlan1008 mac 00:00:00:11:11:11 op: SET
Jan 27 18:35:46.412581+00:00 2025 ds3000-03 NOTICE swss#intfmgrd: :- doTask: STATE VLAN TBL Vlan1008 oper down
Jan 27 18:35:46.412603+00:00 2025 ds3000-03 NOTICE swss#orchagent: :- IntfTrackOrch:IntfTrackGroup:update:105 PORT_OPER_STATE_CHANGE Port:Vlan1008 Up:0
Jan 27 18:35:46.412829+00:00 2025 ds3000-03 NOTICE swss#vlanmgrd: :- adjustVlanMtu: No change in MTU of Vlan1008 required, it continues to be oper_mtu = 9036 cfg_mtu = 9100 new_mtu=9100
Jan 27 18:35:46.412829+00:00 2025 ds3000-03 NOTICE swss#vlanmgrd: :- doVlanStateNotificationTask: vlanmgr received port status notification state down vlan Vlan1008 id 1008

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants