Skip to content

Conversation

BenAtAmazon
Copy link
Contributor

@BenAtAmazon BenAtAmazon commented Oct 7, 2025

Proposed Changes

This PR adds support for multiple hostname paths in the AWS peer discovery plugin to enable zero-downtime rolling upgrades during hostname migration scenarios. The implementation allows RabbitMQ nodes to discover peers using multiple hostname paths, ensuring cluster formation succeeds even when nodes are configured with different hostname paths during rolling upgrades.

Backward Compatibility

Existing single hostname_path configuration continues to work (unchanged).
We fallback to single path behavior when no numbered paths are configured.

Configuration Examples

Multiple paths for zero-downtime migration
cluster_formation.aws.hostname_path.1 = networkInterfaceSet,2,privateIpAddressesSet,1,privateDnsName
cluster_formation.aws.hostname_path.2 = privateDnsName
cluster_formation.aws.hostname_path.3 = networkInterfaceSet,1,privateIpAddressesSet,2,privateIpAddress

Note: This follows the existing pattern we have for classic_config:

cluster_formation.classic_config.nodes.1 = rabbit@<hostnameA>
cluster_formation.classic_config.nodes.2 = rabbit@<hostnameB>
cluster_formation.classic_config.nodes.3 = rabbit@<hostnameC>
Single path (backward compatible)
cluster_formation.aws.hostname_path = privateDnsName

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
This is simply a reminder of what we are going to look for before merging your code.

  • Mandatory: I (or my employer/client) have have signed the CA (see https://github.com/rabbitmq/cla)
  • I have read the CONTRIBUTING.md document
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Paths when is_list(Paths) ->
?LOG_DEBUG("AWS peer discovery using multiple hostname paths"),
Paths;
_Invalid ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we get a non list configuration? Can this check be pushed to the cuttlefish schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is already handled in the cuttlefish schema but I was trying to be extra defensive here which is probably unnecessary. In theory if the env variable was somehow updated to be an invalid value (e.g. non-list) after cuttlefish checks but before peer discovery starts, this would allow fallback instead of failure. But that's quite impractical, will remove the _Invalid case to simplify.

true -> List;
_ -> ""
end
catch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think there would be a need for a try/catch here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to guard against get_value/2, not latin1_char_list. I'll update the try to scope this correctly like this (sorry for the confusion):

-spec get_hostname(path(), props()) -> string().
get_hostname([], _Props) -> 
    "";  %% Handle empty paths gracefully
get_hostname(Path, Props) ->
    List = try
        lists:foldl(fun get_value/2, Props, Path)
    catch
        _:_ -> ""
    end,
    case io_lib:latin1_char_list(List) of
        true -> List;
        _ -> ""
    end.

The reason why a catch for get_value/2 is needed is I noticed that if an invalid path is provided (e.g. this one when a third network interface doesn't exist on the instance)

cluster_formation.aws.hostname_path.2 = networkInterfaceSet,3,privateIpAddressesSet,1,privateIpAddress

I see this startup failure:

[error] <0.208.0> BOOT FAILED
[error] <0.208.0> ===========
[error] <0.208.0> Exception during startup:
[error] <0.208.0>
[error] <0.208.0> error:function_clause
[error] <0.208.0>
[error] <0.208.0>     lists:nth_1/2, line 301
[error] <0.208.0>         args: [1,[]]
[error] <0.208.0>     rabbit_peer_discovery_aws:get_value/2, line 416
[error] <0.208.0>     lists:foldl_1/3, line 2151
[error] <0.208.0>     rabbit_peer_discovery_aws:get_hostname/2, line 406
[error] <0.208.0>     rabbit_peer_discovery_aws:-extract_unique_hostnames/2-lc$^1/1-1-/4, line 465
[error] <0.208.0>     rabbit_peer_discovery_aws:-extract_unique_hostnames/2-lc$^1/1-1-/4, line 465
[error] <0.208.0>     rabbit_peer_discovery_aws:extract_unique_hostnames/2, line 465
[error] <0.208.0>     rabbit_peer_discovery_aws:get_hostname_name_from_reservation_set/2, line 327

During hostname migration scenarios we may encounter invalid paths like this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just account for that case in get_value/2 if it is a possibility?

https://github.com/rabbitmq/rabbitmq-server/pull/14705/files#r2417353623

try/catch should be very, very rarely used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try catches like that are a bit dangerous as you are catching 'everything' that could go wrong there, even cases where a crash might be preferred to catch a bug. Train yourself to think of ways to not use them, unless cases where you are forced to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, in this case I wonder if there's a better way that doesn't add significant complexity though 🤔

Copy link
Collaborator

@lukebakken lukebakken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be aware that the genie is extremely lazy with whitespace and will gladly add space characters on blank lines, at the end of lines, etc.

get_hostname_paths() ->
M = ?CONFIG_MODULE:config_map(?BACKEND_CONFIG_KEY),
UsePrivateIP = get_config_key(aws_use_private_ip, M),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is extraneous whitespace here - in the diff, I see four space characters here.

true -> List;
_ -> ""
end
catch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds support for multiple hostname paths in the AWS peer discovery plugin to enable zero-downtime rolling upgrades during hostname migration scenarios.

The implementation allows RabbitMQ nodes to discover peers using multiple hostname paths, ensuring cluster formation succeeds even when nodes are configured with different hostname paths during rolling upgrades.

Example usage:

cluster_formation.aws.hostname_path.1 = privateDnsName
cluster_formation.aws.hostname_path.2 = privateIpAddress
@BenAtAmazon BenAtAmazon force-pushed the aws/add-peer-discovery-multi-hostname-path branch from ca6b404 to 175a956 Compare October 10, 2025 03:06
catch
_:_ -> false
end
end,
Copy link
Collaborator

@SimonUnge SimonUnge Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid try catch statements, i.e you could do:

case string:to_integer(Str) of
              {_, ""} -> true;
              _ -> false
          end

true -> List;
_ -> ""
end
catch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try catches like that are a bit dangerous as you are catching 'everything' that could go wrong there, even cases where a crash might be preferred to catch a bug. Train yourself to think of ways to not use them, unless cases where you are forced to.

true -> List;
_ -> ""
end
catch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just account for that case in get_value/2 if it is a possibility?

https://github.com/rabbitmq/rabbitmq-server/pull/14705/files#r2417353623

try/catch should be very, very rarely used.

end.

-spec get_value(string()|integer(), props()) -> props().
get_value(_, []) ->
Copy link
Collaborator

@lukebakken lukebakken Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
get_value(Key, []) when is_integer(Key) ->
[];
get_value(Key, Props) when is_integer(Key) ->
{"item", Props2} = lists:nth(Key, Props),
Props2;
get_value(Key, Props) ->
Value = proplists:get_value(Key, Props),
sort_ec2_hostname_path_set_members(Key, Value).

Copy link
Collaborator

@the-mikedavis the-mikedavis Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second clause would be redundant, no? It should always be covered by the first clause

(suggested edited)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants