Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Role not idempotent for infiniband interface #277

Open
mjrasobarnett opened this issue Aug 27, 2020 · 12 comments
Open

Role not idempotent for infiniband interface #277

mjrasobarnett opened this issue Aug 27, 2020 · 12 comments
Assignees

Comments

@mjrasobarnett
Copy link

Hello,
I've been trying out this role against a server that has both ethernet and infiniband interfaces, and I've found that the infiniband interface is continually updated on every run.

I am using the role with the following variables:

  roles:                                                                  
    - role: "network"                                                     
      network_provider: "{{ network_provider_os_default }}"              
      network_allow_restart: no                                          
      network_connections:                                               
        - name: 'em4'                                                    
          type: 'ethernet'                                               
          interface_name: 'em4'                                          
          state: 'up'                                                    
          ip:                                                            
            address:                                                                                                                                                                                                
              - "192.168.0.1/24"                                       
        - name: 'ib0'                                                    
          type: 'infiniband'                                             
          state: 'up'                                                    
          interface_name: 'ib0'                                          
          ip:                                                            
            address:                                                     
              - "10.0.0.1/16"

The remote host is a RHEL7.8 system, and I'm using the latest tag of this repo, v1.2.0 with ansible 2.9.10.

The 'Configure networking connection profiles' task always shows as changed, with the following in the stderr returned:

  stderr: |-
    [005] <info>  #0, state:up persistent_state:present, 'em4': connection em4, d1e89262-2a9a-4a05-b551-f867a1dee0a4 already up to date
    [006] <info>  #1, state:up persistent_state:present, 'ib0': update connection ib0, ef06f091-c51b-43fa-91e8-b3691a085506
    [007] <info>  #0, state:up persistent_state:present, 'em4': up connection em4, d1e89262-2a9a-4a05-b551-f867a1dee0a4 skipped because already active
    [008] <info>  #1, state:up persistent_state:present, 'ib0': up connection ib0, ef06f091-c51b-43fa-91e8-b3691a085506 (is-modified)
    [009] <info>  #1, state:up persistent_state:present, 'ib0': connection reapplied
  stderr_lines: <omitted>

I've been attempting to trace the behaviour through the module code on the remote host, and I can see that the key check is inside the 'run_action_present' function of the Cmd_nm class, where,
self.nmutil.connection_compare is returning false.

I've been trying to observe why this is, but it's becoming difficult as I think this function is then using python bindings from NetworkManager itself to compare the two connection objects, and I haven't been able to follow what happens at this point.

I was wondering if someone might be able to offer any guidance on what might be the issue here, and how I could observe the state perhaps to see why it keeps trying to update this IB connection continuously?

Many thanks,
Matt

@tyll
Copy link
Member

tyll commented Aug 27, 2020

@thom311 do you have an idea?

@mjrasobarnett
Copy link
Author

I'd like to be able to trace down the error myself here to try and give a more precise description of the problem (if any), but I've gotten stuck right now coming up against function calls out to the libnm python interface.

I was trying to drill into the function:

return not (not (con_a.compare(con_b, compare_flags)))

and I can see from the docs for the libnm python interface there is the useful looking function: https://lazka.github.io/pgi-docs/#NM-1.0/classes/Connection.html#NM.Connection.diff

so I've been naively trying to use this to figure out what is different for this IB interface adding the following lines into the 'connection_compare' function:

        out_settings = {}
        print("Diff connections")
        diff_result = con_a.diff(con_b, compare_flags, out_settings)
        compare_result = con_a.compare(con_b, compare_flags)
        print("Diff result: {}".format(diff_result))
        print("Compare result: {}".format(compare_result))
        pdb.set_trace()
        return not (not (con_a.compare(con_b, compare_flags)))

however this gives me an error:

Diff connections
(process:24687): libnm-CRITICAL **: 12:23:40.086: ((libnm-core/nm-connection.c:656)): assertion '<dropped>' failed
Diff result: False
Compare result: False
> /home/cloud-user/.ansible/tmp/ansible-tmp-1598548135.701105-14411-223849403578804/debug_dir/ansible/modules/network_connections.py(757)connection_compare()
-> return not (not (con_a.compare(con_b, compare_flags)))

so I suspect I'm not using this function correctly, but I'm not sure what I should be passing for 'out_settings' here.

I wondered if you might have a hint for me on how I can debug this possibly?

Thanks again!

@thom311
Copy link
Contributor

thom311 commented Sep 1, 2020

@mjrasobarnett it seems that nm_connection_diff() doesn't have the correct gtk-annotations for the @out_settings argument. So, I think it will be hard (impossible?) in the current form to call it.

See https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/f15c7bbe8d5dc6278248ae3b5938d968cbf31b43/libnm-core/nm-connection.c#L613

@thom311
Copy link
Contributor

thom311 commented Sep 1, 2020

I fail to annotate nm_connection_diff() in a way that works with pygobject. Something like

@out_settings: (type GLib.HashTable(utf8,GLib.HashTable(utf8,utf8))) (out) (allow-none) (transfer full):

should work, but it doesn't.

that means, the function cannot be called from python.

@thom311
Copy link
Contributor

thom311 commented Sep 1, 2020

I was wondering if someone might be able to offer any guidance on what might be the issue here, and how I could observe the state perhaps to see why it keeps trying to update this IB connection continuously?

enable level=TRACE logs, and see what NetworkManager logs when the profile gets updated. See https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf#n28 for hints about logging.

@mjrasobarnett
Copy link
Author

Thanks a lot for looking into this for me.

I've enabled TRACE logging, and I've captured the following for my server when this is executed and the profile is updated:

Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <trace> [1598975873.7744] auth: call[70]: CheckAuthorization(org.freedesktop.NetworkManager.settings.modify.system), subject=unix-process[pid=51721, uid=0, start=43556992] (succeeding for root)
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <trace> [1598975873.7744] auth: call[70]: completed: authorized=1, challenge=0 (simulated)
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <trace> [1598975873.7749] ifcfg-rh: write: write connection ib0 (9ea65e9e-afb1-4f0b-bb5c-efd1db53dd52) to file "/etc/sysconfig/network-scripts/ifcfg-ib0"
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <debug> [1598975873.7751] ifcfg-rh: write: connection ib0 (9ea65e9e-afb1-4f0b-bb5c-efd1db53dd52) was modified by persisting it to "/etc/sysconfig/network-scripts/ifcfg-ib0"
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <info>  [1598975873.7755] settings-connection[0x559ee828d0e0,9ea65e9e-afb1-4f0b-bb5c-efd1db53dd52]: write: successfully updated (ifcfg-rh: update /etc/sysconfig/network-scripts/ifcfg-ib0), connection was modified in the process
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <debug> [1598975873.7756] Saving secrets for connection /org/freedesktop/NetworkManager/Settings/1 (ib0)
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <info>  [1598975873.7756] audit: op="connection-update" uuid="9ea65e9e-afb1-4f0b-bb5c-efd1db53dd52" name="ib0" args="802-3-ethernet.s390-options,802-3-ethernet.mac-address-blacklist" pid=51721 uid=0 result="success"
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <debug> [1598975873.7759] device[0x559ee82bad30] (em2): add_pending_action (1): 'autoactivate'
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <debug> [1598975873.7759] device[0x559ee82c1ba0] (em3): add_pending_action (1): 'autoactivate'
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <debug> [1598975873.7760] device[0x559ee82bad30] (em2): remove_pending_action (0): 'autoactivate'
Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <debug> [1598975873.7760] device[0x559ee82c1ba0] (em3): remove_pending_action (0): 'autoactivate'
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6851] auth: call[71]: CheckAuthorization(org.freedesktop.NetworkManager.network-control), subject=unix-process[pid=51721, uid=0, start=43556992] (succeeding for root)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6851] auth: call[71]: completed: authorized=1, challenge=0 (simulated)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6853] device[0x559ee82d1bb0] (ib0): reapply (version-id 3 (unmodified))
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6853] firewall: [0x559ee8297400,change*:"ib0"]: firewall zone change ib0:default (not running, simulate success)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6853] device[0x559ee82d1bb0] (ib0): ip4-config: update (commit=1, new-config=0x559ee8273130)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6853] platform: (ib0) address: adding or updating IPv4 address: 10.44.241.5/16 lft forever pref forever lifetime 588-0[4294967295,4294967295] dev 6 flags noprefixroute src unknown
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6854] platform-linux: event-notification: RTM_NEWADDR, flags 0, seq 73: 10.44.241.5/16 lft forever pref forever lifetime 588-0[4294967295,4294967295] dev 6 flags permanent,noprefixroute src kernel
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6854] platform-linux: do-add-ip4-address[6: 10.44.241.5/16]: success
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6854] platform: ip4-dev-route: register 10.44.0.0/16 via 0.0.0.0 dev 6 metric 0 mss 0 rt-src rt-kernel scope link pref-src 10.44.241.5
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6854] device[0x559ee82d1bb0] (ib0): ip6-config: update (commit=1, new-config=0x559ee8241a50)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6855] device[0x559ee82d1bb0] (ib0): ip6-config: update IP Config instance (/org/freedesktop/NetworkManager/IP6Config/5)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6855] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] policy: set-hostname: updating hostname (ip6 conf)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] hostname: transient hostname retrieval failed
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] policy: get-hostname: "rds-oss39"
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] hostname: transient hostname retrieval failed
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] policy: get-hostname: "rds-oss39"
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] policy: set-hostname: hostname already set to 'rds-oss39' (from system configuration)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6855] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6855] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] pacrunner: call[0x559ee82d3c90]: removing...
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6855] pacrunner: call[0x559ee83072a0]: send: new config ({'Interface': <'ib0'>, 'Method': <'direct'>, 'BrowserOnly': <false>},)
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <trace> [1598975883.6855] pacrunner: call[0x559ee83072a0]: sending...
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <info>  [1598975883.6856] audit: op="device-reapply" interface="ib0" ifindex=6 pid=51721 uid=0 result="success"
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6856] firewall: [0x559ee8297400,change*:"ib0"]: complete: fake success
Sep 01 16:58:03 rds-oss39 NetworkManager[50605]: <debug> [1598975883.6857] pacrunner: call[0x559ee83072a0]: sending failed: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.pacrunner" does not exist

One line that caught my eye is:

Sep 01 16:57:53 rds-oss39 NetworkManager[50605]: <info>  [1598975873.7756] audit: op="connection-update" uuid="9ea65e9e-afb1-4f0b-bb5c-efd1db53dd52" name="ib0" args="802-3-ethernet.s390-options,802-3-ethernet.mac-address-blacklist" pid=51721 uid=0 result="success"

I'm not quite sure why though, as the log doesn't show any values changing. I'll test some more with other interfaces to see what I can find.

@thom311
Copy link
Contributor

thom311 commented Sep 1, 2020

it tells you that NM thinks that the properties "802-3-ethernet.s390-options,802-3-ethernet.mac-address-blacklist" differ. This is also what nm_connection_diff() probably would tell you. There is a bug here somewhere.

@mjrasobarnett
Copy link
Author

Yes this looks to be the issue, I added a couple of lines to dump all the parameters of the 'con_cur' and 'con_new' connections, and comparing them shows:

$ diff con_cur con_new
23c23
<   connection.timestamp: 1598550808
---
>   connection.timestamp: 0
25,39d24
<   802-3-ethernet.auto-negotiate: False
<   802-3-ethernet.cloned-mac-address: None
<   802-3-ethernet.duplex: None
<   802-3-ethernet.generate-mac-address-mask: None
<   802-3-ethernet.mac-address: None
<   802-3-ethernet.mac-address-blacklist: []
<   802-3-ethernet.mtu: 0
<   802-3-ethernet.name: 802-3-ethernet
<   802-3-ethernet.port: None
<   802-3-ethernet.s390-nettype: None
<   802-3-ethernet.s390-options: <GLib.HashTable object at 0x7f8bf874b738 (GHashTable at 0x1848360)>
<   802-3-ethernet.s390-subchannels: []
<   802-3-ethernet.speed: 0
<   802-3-ethernet.wake-on-lan: 1
<   802-3-ethernet.wake-on-lan-password: None
46c31
<   ipv4.addresses: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x1889920)>
---
>   ipv4.addresses: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x1889a40)>
66c51
<   ipv4.routes: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x1889960)>
---
>   ipv4.routes: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x1889a80)>
69c54
<   ipv6.addresses: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x1889980)>
---
>   ipv6.addresses: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x1889aa0)>
84c69
<   ipv6.method: ignore
---
>   ipv6.method: auto
89c74
<   ipv6.routes: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x18899a0)>
---
>   ipv6.routes: <GLib.PtrArray object at 0x7f8bf874b738 (GPtrArray at 0x1889ac0)>

So when it's doing 'connection_create', for some reason these 802-3-ethernet fields are not created.

I can't see anything in the ansible module setting these fields directly though, so perhaps this is an issue in libnm? From observing what happens when changing values on an ethernet interface on the same host, I can see that these settings appear in the connection object in 'connection_create' after the lines:

        if connection["type"] == "ethernet":
            s_con.set_property(
                NM.SETTING_CONNECTION_TYPE, NM.SETTING_WIRED_SETTING_NAME
            )
            s_wired = self.connection_ensure_setting(con, NM.SettingWired)
            s_wired.set_property(NM.SETTING_WIRED_MAC_ADDRESS, connection["mac"])

@mjrasobarnett
Copy link
Author

Ah, adding in a quick workaround in the form:

        elif connection["type"] == "infiniband":
            s_con.set_property(
                NM.SETTING_CONNECTION_TYPE, NM.SETTING_INFINIBAND_SETTING_NAME
            )
            s_infiniband = self.connection_ensure_setting(con, NM.SettingInfiniband)
+           s_wired = self.connection_ensure_setting(con, NM.SettingWired)
            s_infiniband.set_property(
                NM.SETTING_INFINIBAND_MAC_ADDRESS, connection["mac"]
            )
            s_infiniband.set_property(
                NM.SETTING_INFINIBAND_TRANSPORT_MODE,
                connection["infiniband"]["transport_mode"],
            )
            if connection["infiniband"]["p_key"] != -1:
                s_infiniband.set_property(
                    NM.SETTING_INFINIBAND_P_KEY, connection["infiniband"]["p_key"]
                )
                if connection["parent"]:
                    s_infiniband.set_property(
                        NM.SETTING_INFINIBAND_PARENT,
                        ArgUtil.connection_find_master(
                            connection["parent"], connections, idx
                        ),
                    )

Appears to fix the behaviour.

@thom311
Copy link
Contributor

thom311 commented Sep 3, 2020

this is indeed very likely a bug in NetworkManager. Still, maybe the role should try to workaround it. The roles main goal is to create a well-define profile. To always add an ethernet section to an infiniband profile, seems in general desirable to me.

mjrasobarnett added a commit to mjrasobarnett/network that referenced this issue Sep 8, 2020
@ffmancera ffmancera self-assigned this Oct 22, 2020
@cathay4t
Copy link
Collaborator

cathay4t commented Oct 22, 2020

From my testing, NM.SettingWired should not be included on IPoIB interface as they don't have layer 2.

The speed and mac of IPoIB is not changebale. The MTU should be done by NM.SettingInfiniband.props.mtu.

@thom311
Copy link
Contributor

thom311 commented Oct 23, 2020

from the information here I don't understand who/where the [ethernet] setting gets added.

I agree with Gris, that the infiniband profile probably should not have such a setting, but where does it come from?

mjrasobarnett added a commit to mjrasobarnett/network that referenced this issue Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants