Skip to content

Conversation

@nmorey
Copy link

@nmorey nmorey commented Nov 7, 2025

While debugging mpich 4.3.2 with ch4:ucx on s390x, several issues popped up.
Found, and tested on 1.19.0

First:

sles@ucx-debug:~/mpich/mpich-4.3.2/src/mpi/romio/test> mpirun -np 2 ./file_info -fname test
[1762529195.399132] [ucx-debug:464075:0]     ucp_context.c:2055 UCX  ERROR UCX_DYNAMIC_TL_PROGRESS_FACTOR must be > 0
[1762529195.399238] [ucx-debug:464074:0]     ucp_context.c:2055 UCX  ERROR UCX_DYNAMIC_TL_PROGRESS_FACTOR must be > 0

This fixed by the first patch. Config parser was wrongly looking for a time instead of a regular int. Not sure why I only see this on s390 though.

Second:

sles@ucx-debug:~/mpich/mpich-4.3.2/src/mpi/romio/test> ./file_info -fname test
[1762533043.448147] [ucx-debug:488518:0]            self.c:276  UCX  ERROR failed to allocate device resource

This is caused by UCT/SELF scaning the option to a uint while it is a size_t in memory.

Summary by CodeRabbit

  • Chores
    • Updated several configuration parameter types to improve parsing and validation consistency for runtime settings.
  • Documentation
    • Added a new contributor to the AUTHORS list.

@coderabbitai
Copy link

coderabbitai bot commented Nov 7, 2025

Walkthrough

Two configuration table entries had their config value types changed: UCP's DYNAMIC_TL_PROGRESS_FACTOR switched from time-units to uint, and UCT self's NUM_DEVICES switched from int to unsigned long. AUTHORS gained one new contributor.

Changes

Cohort / File(s) Summary
UCP Core Configuration
src/ucp/core/ucp_context.c
Changed DYNAMIC_TL_PROGRESS_FACTOR config entry type from UCS_CONFIG_TYPE_TIME_UNITS to UCS_CONFIG_TYPE_UINT.
UCT Self Transport Configuration
src/uct/sm/self/self.c
Changed NUM_DEVICES config entry type from UCS_CONFIG_TYPE_INT to UCS_CONFIG_TYPE_ULONG (affects num_devices representation in config parsing).
Project authorship
AUTHORS
Added Nicolas Morey <[email protected]> to the AUTHORS list.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Check parsing/validation behavior for DYNAMIC_TL_PROGRESS_FACTOR and NUM_DEVICES (type parsing, defaults, and error messages).
  • Verify downstream uses assume the updated numeric types and that any casts are safe.
  • Confirm AUTHORS entry formatting.

Poem

🐇 I nibble bytes beneath the tree,
Types hopped places, small and spry,
Time to uint, int to ulong tune,
A new name added under the moon,
My whiskers twitch — the commit's a sigh.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Config type fixes' accurately reflects the main changes: fixing configuration type mismatches in two modules (UCP and UCT/SELF) that caused runtime errors.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a484a0 and 87cf007.

📒 Files selected for processing (3)
  • AUTHORS (1 hunks)
  • src/ucp/core/ucp_context.c (1 hunks)
  • src/uct/sm/self/self.c (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/ucp/core/ucp_context.c
  • AUTHORS
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (2)
src/uct/sm/self/self.c (2)

55-63: LGTM! The config type change correctly addresses the type mismatch.

The change from UCS_CONFIG_TYPE_INT to UCS_CONFIG_TYPE_ULONG properly aligns the configuration parser with the in-memory representation, fixing the "failed to allocate device resource" error mentioned in the PR description.


269-294: I need to search the codebase directly to find the struct definition.

Review comment verification inconclusive—unable to locate struct definition.

The codebase search revealed that the configuration uses UCS_CONFIG_TYPE_INT for the NUM_DEVICES field, but the exact field type definition of num_devices within uct_self_md_t could not be located to fully verify the stated concerns. The function parameter num_tl_devices_p is confirmed to be unsigned *.

Without access to the struct definition or the specific commit showing the "config type fix" referenced in the review, I cannot definitively confirm whether:

  • self_md->num_devices is now unsigned long, size_t, or remains int
  • Whether line 270's loop comparison poses an actual signed/unsigned mismatch
  • Whether line 293's assignment causes truncation on 64-bit platforms

Given that typical device counts are small positive integers and the practical risk is minimal even if type mismatches exist, but verification cannot be completed with available codebase access, further manual inspection of the header file (src/uct/sm/self/self.h) is needed to confirm the actual field types and assess whether changes are necessary.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nmorey nmorey force-pushed the dev/master/type-fixes branch from da4da6d to 1a484a0 Compare November 18, 2025 17:30
size_t may be larger than an int. This causes issue on big endian systems

Signed-off-by: Nicolas Morey <[email protected]>
@nmorey nmorey force-pushed the dev/master/type-fixes branch from 1a484a0 to 87cf007 Compare November 19, 2025 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants