Skip to content

Conversation

khwilliamson
Copy link
Contributor

This code can't work properly:

if (UTF ? isIDFIRST_utf8((U8*)s+1) : isWORDCHAR_A(s[1]))

Suppose you have a string composed entirely of ASCII characters beginning with a digit. If the string isn't encoded in UTF-8, the condition is true, but it is false if the string happens to have the UTF-8 flag set for whatever reason. One of those reasons simply is that the Perl program is being compiled under 'use utf8'.

The UTF-8 flag should not change the behavior of ASCII strings.

The code was introduced in 9d58dbc in 2015, to fix [perl #123963] "@". The line it replaced was

if (isWORDCHAR_lazy_if(s+1,UTF))

(The code was modified in 2016 by
fac0f7a as part of a global substitution to use isIDFIRST_utf8_safe() so as to have no possibility of going off the end of the buffer), but that did not affect the logic.

The problem the original commit was trying to solve was that fullwidth digits (U+FF10 etc) were accepted when they shouldn't be, whereas [0-9] should remain as being accepted. The defect is that [0-9] stopped being accepted when the UTF-8 flag is on. The solution adopted here is to change it to instead be

if (isDIGIT_A(s[1]) || isIDFIRST_lazy_if_safe(s+1, send, UTF))

This causes [0-9] to remain accepted regardless of the UTF-8 flag. So when it is on, the only difference between before this commit and after is that [0-9] are accepted.

In the ASCII range, the only difference between \w and IDFirst is that the former includes the digits 0-9, so when the UTF-8 flag is off this evaluates to isWORD_CHAR_A, as before.

(Changing to isIDFIRST from isWORDCHAR in the original commit did solve a bunch of other cases where a \w is not supposed to be the first character in a name. There are about 4K such characters currently in Unicode.)

  • This set of changes does not require a perldelta entry. I don't think we should draw attention to the possibility of having identifiers whose names begin with digits

@tonycoz
Copy link
Contributor

tonycoz commented Oct 12, 2025

  • I don't think we should draw attention to the possibility of having identifiers whose names begin with digits

I identifiers whose names start with digits are really common in perl, if not with @. I don't a reason to hide a fix.

Perhaps:

Fixed parsing of array names starting with a digit in double-quotish context under C<use utf8;>.

@khwilliamson
Copy link
Contributor Author

I misspoke. Certainly all numeric scalars are very common. But other experienced porters were surprised that @12345 is a legal array name. I couldn't find explicit mention of it in our docs

@tonycoz
Copy link
Contributor

tonycoz commented Oct 13, 2025

I could see it going the other way - someone has used "@1" in their code and outputs literally "@1" and it's in production, they upgrade and wonder why it's now outputting nothing.

This code can't work properly:

    if (UTF ? isIDFIRST_utf8((U8*)s+1) : isWORDCHAR_A(s[1]))

Suppose you have a string composed entirely of ASCII characters
beginning with a digit.  If the string isn't encoded in UTF-8, the
condition is true, but it is false if the string happens to have the
UTF-8 flag set for whatever reason.  One of those reasons simply is that
the Perl program is being compiled under 'use utf8'.

The UTF-8 flag should not change the behavior of ASCII strings.

The code was introduced in 9d58dbc in
2015, to fix [perl #123963] "@<fullwidth digit>".  The line it replaced
was

    if (isWORDCHAR_lazy_if(s+1,UTF))

(The code was modified in 2016 by
fac0f7a as part of a global
substitution to use isIDFIRST_utf8_safe() so as to have no possibility
of going off the end of the buffer), but that did not affect the logic.

The problem the original commit was trying to solve was that fullwidth
digits (U+FF10 etc) were accepted when they shouldn't be, whereas [0-9]
should remain as being accepted.  The defect is that [0-9] stopped being
accepted when the UTF-8 flag is on.  The solution is to change it to
instead be

    if (isDIGIT_A(s[1]) || isIDFIRST_lazy_if_safe(s+1, send, UTF))

This causes [0-9] to remain accepted regardless of the UTF-8 flag.
So when it is on, the only difference between before this commit and
after is that [0-9] are accepted.

In the ASCII range, the only difference between \w and IDFirst is that
the former includes the digits 0-9, so when the UTF-8 flag is off this
evaluates to isWORD_CHAR_A, as before.

(Changing to isIDFIRST from isWORDCHAR in the original commit did solve
a bunch of other cases where a \w is not supposed to be the first
character in a name.  There are about 4K such characters currently in
Unicode.)
@khwilliamson khwilliamson force-pushed the toke_bad_word_vs_idstart branch from d359590 to 6667ae1 Compare October 17, 2025 15:30
@khwilliamson
Copy link
Contributor Author

I added a delta entry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants