Skip to content

Conversation

@ddeclerck
Copy link
Collaborator

@ddeclerck ddeclerck commented Jul 21, 2025

Just experiments for now.

Works "just fine" on ppc64le & ppc64le-p10.
A couple of failures and an unexpected pass on s390x:

31: signal handling SIGTERM in COBOL                FAILED (used_binaries.at:1445)                     note: fails randomly
800: CALL C with callback, PROCEDURE DIVISION EXTERN FAILED (run_misc.at:4958)
801: CALL C with callback, ENTRY-CONVENTION EXTERN   FAILED (run_misc.at:5034)
869: runtime check: write to internal storage (1)    FAILED (run_misc.at:14337)
928: System routines for files                       FAILED (run_file.at:4174)
985: INDEXED file numeric keys ordering              UNEXPECTED PASS
1188: CALL BY VALUE numeric literal with SIZE IS      FAILED (run_extensions.at:2001)
1189: CALL BY VALUE to C                              FAILED (run_extensions.at:2088)
1252: EXHIBIT statement                               FAILED (run_extensions.at:5395)
1264: EXAMINE TALLYING                                FAILED (run_extensions.at:6221)

@ddeclerck ddeclerck force-pushed the ibm_ci branch 4 times, most recently from 0bd1054 to 7987b46 Compare July 21, 2025 18:54
@ddeclerck
Copy link
Collaborator Author

ddeclerck commented Jul 23, 2025

Most of the failures seem to be related to the use of big-endian.

Let's explore the failure on test 800 (801 is similar).

The relevant part is this COBOL "extern" procedure:

       IDENTIFICATION   DIVISION.
       PROGRAM-ID.      callback.
       ENVIRONMENT DIVISION.
       CONFIGURATION SECTION.
       SPECIAL-NAMES.
           CALL-CONVENTION 0 IS EXTERN.
       DATA             DIVISION.
       LINKAGE          SECTION.
       01 P1            USAGE POINTER.
       01 P2            USAGE BINARY-LONG.
       01 P3            PIC X(8).
       PROCEDURE        DIVISION EXTERN USING
                        BY VALUE P1 P2 BY REFERENCE P3.
           IF P1 NOT EQUAL ADDRESS OF P3
              DISPLAY "P1 != ADDRESS OF P3: " P1.
           IF P2 NOT EQUAL 42
              DISPLAY "P2 != 42: " P2.
           IF P3 NOT EQUAL "CALLBACK"
              DISPLAY "P3 != CALLBACK: " P3.
           EXIT PROGRAM.

The generated C code for this procedure is (relevant parts only):

int
callback (cob_u8_t *b_35, cob_s64_t b_36, cob_u8_t *b_37)
{
  ...
  return callback_ (0, (cob_u8_ptr)&b_35, (cob_u8_ptr)&b_36, b_37);
}

static int
callback_ (const int entry, cob_u8_t *b_35, cob_u8_t *b_36,  cob_u8_t *b_37)
{
  ...
  /* Line: 29        : IF                 : prog.cob */
  module->statement = STMT_IF;
  module->module_stmt = 0x0010001D;
  if (((int)cob_cmp_s32 (b_36, COB_S64_C(42)) != 0))
  {
    /* Line: 30        : DISPLAY            : prog.cob */
    module->statement = STMT_DISPLAY;
    module->module_stmt = 0x0010001E;
    cob_display (0, 1, 2, &c_3, COB_SET_DATA (f_36, b_36));
  }
  ...
}

So we have a COBOL procedure that takes a 32-bit integer as second argument, and the corresponding C argument (callback function) is declared as a 64-bit integer. When calling callback_, we take a pointer to the memory location of this argument, and use that pointer when calling cob_cmp_s32. This only considers the first 32-bits / 4 bytes of the given memory location. Works fine on little-endian, where the LSB are stored first, but not on big-endian, since the LSB are stored last.

A possible solution would be to have C declarations match the COBOL argument sizes more closely, but that would probably break the current ABI. Or the pointers could be adjusted according the the actual size when performing the intermediate call (for instance in the case above, we could generate return callback_ (0, (cob_u8_ptr)&b_35, (cob_u8_ptr)&b_36 + 4, b_37);).

Test 928 is also an endian issue. We are calling the CBL_READ_FILE routine with flag 128.
The C prototype for this function is:

int
cob_sys_read_file (unsigned char *file_handle, unsigned char *file_offset, unsigned char *file_len, unsigned char *flags, unsigned char *buf)

The generated code that calls this function is:

    cob_content content_4;
    ...
    content_4.dataint = 128;
    ...
    b_2 = cob_sys_read_file (b_19, b_20, b_21, content_4.data, b_22);

cob_sys_read_file considers only the first byte of the flags argument, but we've stored a 32-bit integer in big-endian order, so the relevant byte is not the first byte (it would be on little-endian).

For test 1252, the problem comes from different generated field types for TALLY: on a little-endian platform, it is a COB_TYPE_NUMERIC_COMP5 (with flag COB_FLAG_BINARY_SWAP), while on big endian it is a COB_TYPE_NUMERIC_BINARY. Thie yields different code paths being executed in termio.c:cob_display_common, with different ouput sizes.

@GitMensch
Copy link
Collaborator

GitMensch commented Jul 28, 2025

cob_sys_read_file considers only the first byte of the flags argument, but we've stored a 32-bit integer in big-endian order, so the relevant byte is not the first byte (it would be on little-endian).

please add a first/last position code here using the precompiler and the WORDS_BIGENDIAN defintion

A possible solution would be to have C declarations match the COBOL argument sizes more closely, but that would probably break the current ABI.

As the type is "BY VALUE", the 64bit integer should be used in all cases, the cast is wrong (that's an "unfinished" part so breaking ABI here is OK and if we don't believe this will have a big impact even does not need a NEWS entry).
Can you adjust codegen here?`

I guess rebasing will fix the non-ibm CIs, no?

@ddeclerck
Copy link
Collaborator Author

I guess rebasing will fix the non-ibm CIs, no?

Once PR #237 is merged, yes.

@GitMensch
Copy link
Collaborator

For test 1252, the problem comes from different generated field types for TALLY: on a little-endian platform, it is a COB_TYPE_NUMERIC_COMP5 (with flag COB_FLAG_BINARY_SWAP), while on big endian it is a COB_TYPE_NUMERIC_BINARY. Thie yields different code paths being executed in termio.c:cob_display_common, with different ouput sizes.

hm... that part is about "expectations" how non-pretty/binary is printed...
Depending on the flow in that function it may help to explicit enable -fpretty-display ... if not then we can use two expected outputs or sed them...

@GitMensch
Copy link
Collaborator

BTW: Can we debug via ssh on the IBM provided CI environments?

@ddeclerck
Copy link
Collaborator Author

ddeclerck commented Jul 28, 2025

BTW: Can we debug via ssh on the IBM provided CI environments?

Not sure, I'll ask them.

I set up an Ubuntu/S390x QEMU VM on my machine for my investigations, but it's (obvisouly) very slow.

Copy link
Collaborator

@GitMensch GitMensch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, everything "just works" :-)

I'm a bit confused why the big-endian issues "just vanished"?!?

It would be interesting to see a run that uses VBISAM to check how the result looks there (we may disable it later if there are too much errrors).

Comment on lines 69 to 72
run: |
sed -i '/AT_SETUP(\[runtime check: write to internal storage (1)\])/a AT_SKIP_IF(\[true\])' tests/testsuite.src/run_misc.at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you possibly want to add the reason as a comment, I think that's "hardened builds break on failure, so libcob cannot detect the error directly after"

@ddeclerck
Copy link
Collaborator Author

Wow, everything "just works" :-)

I'm a bit confused why the big-endian issues "just vanished"?!?

Sorry for the false joy - the testsuite step in th S390x CI is set to "continue-on-error" ; the issues are still there 😅

@GitMensch
Copy link
Collaborator

Ah, I was already questioning myself... then please let it fail ... and possibly add those to expected failure via sed, get the CI in the repo, then do a follow-up PR (may just use this one if directly committed to the 3x workflows, otherwise a new one) which drops the "sed" and fixes the issues.
... I'm quite sure I've fixed the last one (cobc failure), I just cannot remember what that was...

@ddeclerck ddeclerck force-pushed the ibm_ci branch 2 times, most recently from 1261d77 to 4a9d7eb Compare July 29, 2025 22:56
@ddeclerck ddeclerck changed the title [WIP] Add IBM CI [WIP] Fix IBM CI Jul 29, 2025
@ddeclerck ddeclerck changed the base branch from gcos4gnucobol-3.x to gitside-gnucobol-3.x November 20, 2025 16:23
@ddeclerck ddeclerck force-pushed the ibm_ci branch 2 times, most recently from 1c24029 to d770655 Compare November 21, 2025 15:52
@ddeclerck
Copy link
Collaborator Author

@GitMensch We've started to tackle these big-endian issues (since they also occur on AIX, which we need to handle for our new customer).

For now I'm attempting to fix this one:

1266: EXAMINE TALLYING                                FAILED (run_extensions.at:6221)

This one occus because we hit CB_TREE_TAG_UNEXPECTED_ABORT (x); in codegen.c:output_integer (the unhandled case being FIELD). This occurs because on big-endian machines, EXAMINE X TALLYING ... generates an assign where the destination is a field (the register TALLY), as you can see in typeck.c:cb_build_move_num_zero (the flag flag_binary_swap is false on this case). On little-endian machine, we instead build a call to memset, which does not cause this bug.

My quick fix is to wrap the destination under a reference in tree.c:cb_build_assign if I detect it is a field. I don't know if this is the "right" way to do it, but it "just works".

@GitMensch
Copy link
Collaborator

please add Changelogs and check if/how this is solved in 4.x (a merge of a change that includes even more is often useful, if possible);
Ron did run his tests also on AIX and Solaris, if I remember correctly.

@ddeclerck
Copy link
Collaborator Author

please add Changelogs and check if/how this is solved in 4.x (a merge of a change that includes even more is often useful, if possible); Ron did run his tests also on AIX and Solaris, if I remember correctly.

Good point indeed.
Testing, 4.x has more failures in general on AIX.
But for the specific test about calling CBL_READ_FILE, 4.x just passes a properly sized field instead of a numeric literal, ie:

        01 FLAGS        PIC X(1)   USAGE COMP-X VALUE 128.
        ...
        CALL 'CBL_READ_FILE' USING
        FHANDLE OFFSET NBYTES FLAGS READ-BUFFER

The code itself for the cob_sys_*** functions in 4.x does not contain the big-endian tweaks that are present in 3.x.
I made some tests on MF on AIX and indeed numeric literals passed by reference are always 32-bit. 4.x matches this behavior.
Should we make 3.x behave the same ?

@GitMensch
Copy link
Collaborator

You cannot pass a literal BY REFERENCE, can you?
It would be either BY CONTENT (generating a temporary buffer in the caller [COMP-5]?) or BY VALUE.

Can you please check what MF actually does?
Is that a 32bit environment passing 32bit integers or is it a 64bit one?
What happens if you pass a 64bit integer (binary truncation?)

... and how do GC3 + GC4 handle that?

@ddeclerck
Copy link
Collaborator Author

You cannot pass a literal BY REFERENCE, can you? It would be either BY CONTENT (generating a temporary buffer in the caller [COMP-5]?) or BY VALUE.

MF on AIX (Server Express 5.1) allows BY REFERENCE on a numeric literal, without warning/error. I belive this just behaves as BY CONTENT...

Can you please check what MF actually does? Is that a 32bit environment passing 32bit integers or is it a 64bit one? What happens if you pass a 64bit integer (binary truncation?)

We're on 64-bit. I made some tests. Passing BY VALUE a numeric literal larger than 32-bit gives a compile-time error. If you pass it BY CONTENT, the literal may be larger than 32-bit, but truncated to 32-bit at run time.

@GitMensch
Copy link
Collaborator

Please share your test programs (ideally as a commit in the testsuite of this PR ;-) even if those currently "fail" in GC) and I'll rerun them with VisualCOBOL 32+64 bit on RHEL.

@ddeclerck
Copy link
Collaborator Author

ddeclerck commented Nov 27, 2025

Please share your test programs (ideally as a commit in the testsuite of this PR ;-) even if those currently "fail" in GC) and I'll rerun them with VisualCOBOL 32+64 bit on RHEL.

I had in fact two variations of the following program, one for GnuCOBOL and one for AIX (not sure yet what "expected" output we would want in the test suite). This is the GnuCOBOL variant.
cprog.c
prog.cob.txt

The AIX variant has a 32-bit constant instead in the 4th call, and has a different output on the 5th et 6th calls (the value is truncated to 32-bit, if we use print_p64 it is shown twice, but if we use print_p32 then we can't distinguish the GnuCOBOL and MF behaviors).

I.e the AIX output (if we adjust the literal in the 4th call):

Received value (64-bit): 123456789abcdef0
Pointed value (64-bit): 123456789abcdef0
Pointed value (64-bit): 123456789abcdef0
Received value (64-bit): 12345678
Pointed value (64-bit): 9abcdef09abcdef0
Pointed value (64-bit): 9abcdef09abcdef0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants