-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dialect option: normalize BCD "on-the-fly" when moving from ALPHANUMERIC to NUMERIC #200
base: gcos4gnucobol-3.x
Are you sure you want to change the base?
Conversation
2918afc
to
675824c
Compare
This seems to work. Should probably be improved. MSYS2 CI failure is caused by different output format for floats (eg. where we have '-1.2345679E+9' in most environments, in MSYS2 we have '-1.2345679E+09') - any idea where this could be adjusted ? |
We had a similar issue with COMP-2 printing before, the solution was to never "printf" but instead use snprintf and post-adjust, if necessary. Just FYI: That's the result with MF, after dropping the extension COMP-N (and the invalid index use):
... with explicit dropping the numeric checks: MF VC
MF VC
|
I see. So it seems BCD is also "normalized" in MF and IBM, except that the sign is mostly ignored (except in MF when mving to NUMERIC-EDITED). Also I don't get why the results for binary in MF have exactly the opposite sign compared to GCOS... And then there are also differences in the way the fields are displayed - but I think this is not related to normalization. I'm thinking maybe we should have two options, one to normalize the digits, and one for the sign (but then it would have to be more than a YES/NO flag, to account for MF keeping the sign only when moving to NUMEDIC-EDITED). |
If I don't miss something, the most visible thing is that this is NOT about BCD (COMP-3/COMP-6) normalization, but about normalization of I'll do the "review" after some meeting, to comment on the actual changes. (just FYI - added the ACU results which are different and took too long to get in the first place; I'd ignore that output for now) |
Well, I'd still say it's BCD-related, since we're trying to interpret DISPLAY data as unpacked BCD. BTW, it could be interesting to try with an ASCII sign ("p") instead of the EBCDIC sign ("}") in SRC-ALNUM. |
Note: I've added the ASCII p variant above as well. Concerning the PR: that's effectively a general issue. I've thought that I have broken something there, but this code goes back to at least OC 1.1, my changes only produces the same "intended" result with less instructions.
The question is how to go on. Using COB_D2I does work (is a bit more expensive, of course), so I tend to use it unconditionally in this function. Another approach would be to try following COBOL 202x MOVE statement, general rule 7 d 4, which may (I'm not sure) should only be done if the target is neither numeric-display nor numeric-edited (or its national variants if the source is national):
That possibly yields nearly the same result as this PR (nearly because the data will likely always be handled as unsigned). What do you think? |
Would that imply doing this unconditionally instead of using a flag ?
Would this be configurabled through a dialect flag ? Our customer expects the sign to be preserved, as it is the case on GCOS. |
For both options there would be an option to make this depending on a dialect (or optimization) flag. If you like the first and go down that route I think there's no option necessary, is it? It may be good to have that test include a performance check option, as in gnucobol/tests/testsuite.src/run_misc.at Lines 5609 to 5625 in 675824c
This allows to easily verify both correctness and speed if we change that later on. |
I do not have the same results with MF VC (ASCII):
The two differences I notice are:
Now, I barely know how MF works, so maybe I'm doing something wrong ? Or maybe it's something that behaves differently depending on the version... |
a58b6f1
to
9115c8c
Compare
@ddeclerck Given 44c96d2 - what is the current state of this PR (and is it still high-priority)? If it is high, please rebase after upstreaming #204. |
Also note that this change, together with a minimal test case, should be upstreamed as well, potentially with a comment /* skip leading zeroes + */. |
Well, that commit you mention (from PR #201) was about fixing stuff and adding a bit of normalization on moves to NUMERCI-EDITED. This PR is about normalizing when moving from ALPHANUMERIC to NUMERIC.
It's still kind of important, although not as critical as the other issues reported in the past few days (and a new one that just arrived tonight 😭). |
You'd want this in a separate PR ? |
As you want, either a separate one or directly upstream. |
I went through a PR just to ensure the CI passes on all platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Friendly ping @ddeclerck for the current state (a rebase can be done as well ;-)
cobc/config.def
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't our conclusion that we did not need any config option for that as all environments do a normalization and we consider that a fix to a very old issue going back to OpenCOBOL?
Note: for both the dialect option and the general change we need a NEWS entry because of the difference to before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, since different dialects do not always normalize the same way ; some consider the sign, some ignore it, some don't normalize on some targets (for instance when moving to NUMERIC-EDITED). Depends how close we want to be to the original environment. It's true that a non-normalized result is not really useful in practice, and we also have warnings and/or exceptions to detect those invalid data.
int sign; | ||
cob_field_attr attr; | ||
cob_field field; | ||
COB_FIELD_INIT (COB_FIELD_SIZE (f), COB_FIELD_DATA (f), &attr); | ||
COB_ATTR_INIT (COB_TYPE_NUMERIC_DISPLAY, COB_FIELD_SIZE (f), 0, COB_FLAG_HAVE_SIGN, NULL); | ||
sign = cob_real_get_sign (&field, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more reasonable to duplicate the code of the called functions here? This way we don't need an intermediate field definition, just getting const char *p last_data = f->data + f->size - 1;
and check p
as in the function above?
@@ -309,6 +309,7 @@ cob_move_alphanum_to_display (cob_field *f1, cob_field *f2) | |||
const unsigned char *e2 = s2 + COB_FIELD_SIZE (f2); | |||
const unsigned char dec_pt = COB_MODULE_PTR->decimal_point; | |||
const unsigned char num_sep = COB_MODULE_PTR->numeric_separator; | |||
unsigned char last; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we need that for? The called function doesn't change the data, does it?
(numeric.c (cob_decimal_set_display) may change that, so that's a different thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this function (indirectly) calls cob_real_get_sign
, which alters the byte holding the sign (always trailing when normalizing), we have to save/restore it (or maybe we could use cob_put_sign
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we drop the "normalize bcd function" and get the sign "directly", then we don't need to "unpunch" anything and therefore don't need to store/reset that position, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably. I'll dive further into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want an ascii test here for normalization as well.
Haven't looked at that in the past few days (focused on the GC3/GC4 merge ;) ), but that's really a PR I'd like to complete. I added comments to some of yours. There was also an unanswered questions in a former comment: #200 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted before I suggest to drop the dialect configuration for now and always:
- check for leading separate sign (adjusting the data offset as we did before)
- iterate over the digits using
COB_D2I
- check for trailing embedded sign (as done with GCOS)
This won't give us exact results for "real" invalid data, but should be good in most cases, no?
It does change the result for "valid" (with embedded sign) to be as with other compilers, preserves the current behavior with "valid" (leading separate) and mostly changes that "invalid" (unexpected) data is only used as zero for real bad data (half-byte > 9, if not a sign).
The part that seems the most tricky to me is how to handle "embedded" spaces; this part may or may not still need a dialect configuration - we'd possibly need to check what compilers do here and if it is anywhere but in GnuCOBOL handled as zero - in this case we could adjust the result (we'll need a NEWS entry in any case).
What do you think?
#ifndef COB_EBCDIC_MACHINE | ||
*s2++ = (d | 0x30); | ||
#else | ||
*s2++ = (d | 0xF0); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using COB_I2D here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, COB_I2D
should be used.
@@ -309,6 +309,7 @@ cob_move_alphanum_to_display (cob_field *f1, cob_field *f2) | |||
const unsigned char *e2 = s2 + COB_FIELD_SIZE (f2); | |||
const unsigned char dec_pt = COB_MODULE_PTR->decimal_point; | |||
const unsigned char num_sep = COB_MODULE_PTR->numeric_separator; | |||
unsigned char last; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we drop the "normalize bcd function" and get the sign "directly", then we don't need to "unpunch" anything and therefore don't need to store/reset that position, do we?
if (count++ > 0) { | ||
goto error; | ||
} | ||
} else if (!(isspace (*s1) || *s1 == num_sep)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will never get into the isspace
case here, as that would be the integer 0 - so the code should be adjusted to either check for space first or drop the check completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True indeed.
} else { | ||
for (p = s1; p < e1 && *p != dec_pt; ++p) { | ||
const char d = COB_D2I (*p); | ||
if (d >= 0 && d <= 9) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code also counts (not skip) spaces that way - does this match the expected result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually what happens on GCOS is very different. I had not taken into account what happens when there is a decimal point... While the current GnuCOBOL only moves digits that are before the decimal point, GCOS tries to move and normalize all digits. If it encounters a decimal point, it raises an exception because the decimal point - no matter if it is a comma (0x6B) or a dot (0x4B) - does not normalize to a valid digit...
In fact, GnuCOBOL tries to be smart - skips leading spaces, interprets the sign and the decimal point, while GCOS (and others) more or less boldly convert whatever is there. But how much do we want to keep the original GnuCOBOL behavior ? If we do want to keep it (to not break existing programs), we might as well just have two different normalization functions.
@@ -325,21 +326,35 @@ cob_move_alphanum_to_display (cob_field *f1, cob_field *f2) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directly above this code there is a skipping of leading spaces; if we always check the half-byte only, then this should be checked via COB_D2I
instead (otherwise it should be done that way depending on the dialect configuration) ... but somehow care would have to be taken to explicit match the leading +/- as character (I don't mind if this only happens after real space or also after zero; in which case we can handle both sign and leading space/zero in one loop; I also wouldn't mind iterating over everything until we don't find +/-/non-zero COB_D2I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is, something like ' -123'
(where the blank before 123 is a tab - code 0x05 in EBCDIC, and considering minus has code 0x60) normalizes as 50123
on GCOS (checked). The same with a space instead of the tab normalizes as 00123
. Depending on whether we use isspace
or COB_D2I
, and whether we consider the possibility of a leading sign, we'll get different results. I'm not sure what would be the best thing to do. Maybe we could have an option (not a dialect one) to specify which kind of sign to expect (leading separate, trailing embedded, none...) ?
This PR adds a new dialect option to allow on-the-fly "normalization" of BCD when moving from ALPHANUMERIC to NUMERIC. This is needed to properly mimic the GCOS COBOL behavior, for instance moving an ALPHANUMERIC containing "ABC456789}" to a NUMERIC yields "-1234567890" in this implementation (it tries to interpret the source as a PIC S9). See the new test at the end of
run_misc.at
.