Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement UAX31's quick check algorithm #16383

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions changelog/dmd.identifier-quick-check.dd
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Identifiers are verifiable against normalization form C

It is now possible to verify identifiers against normalization form C, as per UAX31 and C23.

The quick check algorithm that is in use is an incomplete form that does not handle maybes and can result in false positives in errors.
The purpose of this algorithm in its incomplete state is to minimize normalization in a future update.

It is configurable for both D and ImportC using ``-normalization=<strategy>`` and ``-normalization-importc=<strategy>``.
Using the options ``ignore``, ``deprecate``, ``warn``, and ``error``.

Unless you use non-ASCII characters, this change will not affect you.

Due to bug [24507](https://issues.dlang.org/show_bug.cgi?id=24507) if you use GCC as your preprocessor with any identifier that is not normalized to form C, by default it will produce a warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just support C23 and leave it at that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are, the problem is GCC is picking different behaviors on failure that we want to override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the rest of the text, normalization form C is the big breaking change that people need to be aware of in this PR.

Which is what C23 is defined against.

D.5 Equivalent Normalized Identifiers
1 UAX #31 requires that implementations describe how identifiers are compared and considered
equivalent.
2 C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if anyone even uses normalized identifiers. If we support various incompatible normalized identifier schemes, then the user will be creating code that won't link with other ImportC code.

I suggest pick the C23 scheme and not do any of these other bug accommodations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does gcc have some flag to pick different schemes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They can pick between a few, but the important ones are C and KC.

I do not want to offer any other normalization forms than C to a user.
D form is easily done once you've got C (it's a subset) so I'll write the wrapper function for it during the next PR that adds the normalization function itself.
But the KC/KD require extra tables, while I'm not against seeing them added I don't see a reason for dmd to need it.

The compiler switches introduced here are to pick what to do when it is not normalized.
Right now that is to accept it as-is (with varying levels of errorness) and later that would be to normalize it.
Without this, I cannot verify the quick check algorithm as part of the test suite currently.

The goal of the switches is to give a ± 10 release window when people can transition forward and backward as required.
That window can be changed, but I believe that may work well so as to try and get one GDC release on either side of changing defaults to C23.

We may not keep any of these switches long-term.
Before removing them happens ImportC should be defined against C23 and we need to transition everyone over to C23.
Once that happens picking tables becomes unnecessary and can all be cleaned up.

But until then to switch 100% over right now would be a breaking change without a transition backwards hence the need for switches.
Unless of course, you're telling me to do the hard break, at which point I'll do a different PR removing all that infrastructure.

Iain may have opinions on switches, as well as if we support NFKC. So we do need to hear from him.

Use the GCC option ``-Wnormalized=none`` to disable this [warning](https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wnormalized_003d), and the D compiler will take over.
18 changes: 18 additions & 0 deletions compiler/src/dmd/cli.d
Original file line number Diff line number Diff line change
Expand Up @@ -639,6 +639,24 @@ dmd -cov -unittest myprog.d
`Turns off all array bounds checking, even for safe functions. $(RED Deprecated
(use $(TT $(SWLINK -boundscheck)=off) instead).)`,
),
Option("normalization=<strategy>",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all these strategies? Doesn't C23 specify behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C23 does not specify. Neither does UAX31, as long as we handle it in some stated way (or pickable) we are good.

We don't need the error case as that is already handled with warning as error. We also don't need to expose deprecation externally.

Next PR would be normalization that would get exposed (currently internal only) and later made the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick the simplest way and leave it at that. Why should we do normalization at all? I'm beginning to like Nim's strategy of every non-ASCII code unit is an identifier character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't handle normalization in some form a D identifier may look the same, but won't actually match to a C identifier.

Erroring out basically requires a subset of the normalization process except less well-documented. I haven't implemented the full algorithm, only enough to decrease the cost of normalization to "don't worry about it" levels.

Performing normalization autocorrects, and the pseudo code in idPoolNormalize means you only have to do it once per string representation.

Note: just having a larger table doesn't handle normalization, as long as we accept Unicode it will always be able to be an issue for someone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at Nim's documentation.

Case insensitive comparisons of identifiers: https://nim-lang.org/docs/manual.html#lexical-analysis-identifier-equality
So that would be a "fun" idea to propose.

Anyway, Nim transcompiles to C or C++. We don't.

All their identifiers get verified by the C/C++ compiler, which also means in practice they are defined against UAX31 already with normalization (erroring). Even if their language manual doesn't.

We would not be copying them if we changed our tables. They still have a lot more logic going on than we have had.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we don't do normalization. If a user needs it, he can run a filter over his source files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot recommend this.

You have to implement a check for normalization, and it has to be a complete isNFC check to be C23 compliant, unless you want code to not compile that is required to compile.

From C23:

C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.

To detect if you are in form C requires that you basically perform normalization, except you do not store the result. https://unicode.org/reports/tr15/#Detecting_Normalization_Forms

That includes having all the same tables as you would've had otherwise. Not a great tradeoff. Especially when you consider code formatters should be doing it only on identifiers anyway.

"Specify what to do when an identifier is not normalized",
`The strategy that will be used when an identifier may not be normalized (quick check algorithm).
The quick check algorithm will give off false positives as it does not handle the maybe case.
$(UL
$(LI $(I ignore): Silently accept any failing identifiers (default))
$(LI $(I warn): Emit a warning but accept)
)`
),
Option("normalization-importc=<strategy>",
"Specify what to do when an identifier is not normalized",
`The strategy that will be used with ImportC when an identifier may not be normalized (quick check algorithm).
The quick check algorithm will give off false positives as it does not handle the maybe case.
$(UL
$(LI $(I ignore): Silently accept any failing identifiers (default))
$(LI $(I warn): Emit a warning but accept)
)`
),
Option("nothrow",
"assume no Exceptions will be thrown",
`Turns off generation of exception stack unwinding code, enables
Expand Down
Loading
Loading