Skip to content

Missing ECMP nexthops for OSPFv3 inter-area routes #16197

@gromit1811

Description

@gromit1811

Description

Since 217e505, topotest ospf6_ecmp_inter_area intermittently fails due to a wrong number of nexthops for certain routes. See comments to #16055 where this was mentioned initially and further discussion in comments to #15899.

Failure rates are ~10% in my tests but vary wildly (I also saw 200 successful runs in a row). @acooks-at-bda reported a 100% failure rate in his tests, but I've been unable to get anywhere near when I tried to reproduce his environment.

When the error occurs, the initial pre-condition nexthop check in

expect_num_nexthops("r1", [1, 1, 1, 1, 2, 3, 3, 3, 3], 4)
fails.

Note: This report is mostly a placeholder to record the fact that I'm investigating this. I can't dedicate too much time to it, so if somebody wants to help or is faster than me, be my guest 😉

Version

Git master 217e505a67df1ac03483f7c9a97cf4947dd40707

How to reproduce

pytest tests/topotests/ospf6_ecmp_inter_area

Expected behavior

Test succeeds

Actual behavior

Test sometimes fails with errors like this (nexthop pattern is not always exactly the same):

======================================= test session starts ========================================
platform linux -- Python 3.12.3, pytest-7.4.3, pluggy-1.3.0
rootdir: /home/frr/frr/tests/topotests
configfile: pytest.ini
collected 3 items                                                                                  

tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py .Fs                      [100%]

============================================= FAILURES =============================================
_______________________________________ test_ecmp_inter_area _______________________________________

    def test_ecmp_inter_area():
        "Test whether OSPFv3 ECMP nexthops are properly updated for inter-area routes after link down"
        tgen = get_topogen()
        if tgen.routers_have_failure():
            pytest.skip(tgen.errors)
    
        def num_nexthops(router):
            # Careful: "show ipv6 ospf6 route json" doesn't work here. It will
            # only list one route type per prefix and that might not necessarily
            # be the best/selected route. "show ipv6 route ospf6 json" only
            # lists selected routes, so that's more useful in this case.
            routes = tgen.gears[router].vtysh_cmd("show ipv6 route ospf6 json", isjson=True)
            route_prefixes_infos = sorted(routes.items())
            # Note: ri may contain one entry per routing protocol, but since
            # we've explicitly requested only ospf6 above, we can count on ri[0]
            # being the entry we're looking for.
            return [ri[0]["internalNextHopActiveNum"] for rp, ri in route_prefixes_infos]
    
        def expect_num_nexthops(router, expected_num_nexthops, count):
            "Wait until number of nexthops for routes matches expectation"
            logger.info(
                "waiting for OSPFv3 router '{}' nexthops {}".format(
                    router, expected_num_nexthops
                )
            )
            test_func = partial(num_nexthops, router)
            _, result = topotest.run_and_expect(
                test_func, expected_num_nexthops, count=count, wait=3
            )
            assert (
                result == expected_num_nexthops
            ), "'{}' wrong number of route nexthops".format(router)
    
        # Check nexthops pre link-down
        # tgen.mininet_cli()
>       expect_num_nexthops("r1", [1, 1, 1, 1, 2, 3, 3, 3, 3], 4)

/home/frr/frr/tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py:195: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

router = 'r1', expected_num_nexthops = [1, 1, 1, 1, 2, 3, ...], count = 4

    def expect_num_nexthops(router, expected_num_nexthops, count):
        "Wait until number of nexthops for routes matches expectation"
        logger.info(
            "waiting for OSPFv3 router '{}' nexthops {}".format(
                router, expected_num_nexthops
            )
        )
        test_func = partial(num_nexthops, router)
        _, result = topotest.run_and_expect(
            test_func, expected_num_nexthops, count=count, wait=3
        )
>       assert (
            result == expected_num_nexthops
        ), "'{}' wrong number of route nexthops".format(router)
E       AssertionError: 'r1' wrong number of route nexthops
E       assert [1, 1, 1, 1, 2, 1, ...] == [1, 1, 1, 1, 2, 3, ...]
E         At index 5 diff: 1 != 3
E         Use -v to get more diff

/home/frr/frr/tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py:189: AssertionError
---------------------------------------- Captured log call -----------------------------------------
2024-06-11 18:11:15,253 ERROR: topo: 'num_nexthops' failed after 12.30 seconds
------------------------- generated xml file: /tmp/topotests/topotests.xml -------------------------
===================================== short test summary info ======================================
FAILED tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py::test_ecmp_inter_area - AssertionError: 'r1' wrong number of route nexthops
============================= 1 failed, 1 passed, 1 skipped in 34.32s ==============================

Additional context

The actual issue seems to be that sometimes one of the 2 ABRs (R5 and R6) doesn't originate an Inter-Router LSAs (type 4) when it should, causing a path to the destination (R7) to be lost. I don't know yet why that happens.

Note: The problem is most likely neither caused by 217e505 nor by b925570 (the bugfix which the topotest update is trying to verify) but existed before. It was noticed only now just because there was no testcase for inter-area ECMP routes before.

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageNeeds further investigation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions