-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement UAX31's quick check algorithm #16383
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Identifiers are verifiable against normalization form C | ||
|
||
It is now possible to verify identifiers against normalization form C, as per UAX31 and C23. | ||
|
||
The quick check algorithm that is in use is an incomplete form that does not handle maybes and can result in false positives in errors. | ||
The purpose of this algorithm in its incomplete state is to minimize normalization in a future update. | ||
|
||
It is configurable for both D and ImportC using ``-normalization=<strategy>`` and ``-normalization-importc=<strategy>``. | ||
Using the options ``ignore``, ``deprecate``, ``warn``, and ``error``. | ||
|
||
Unless you use non-ASCII characters, this change will not affect you. | ||
|
||
Due to bug [24507](https://issues.dlang.org/show_bug.cgi?id=24507) if you use GCC as your preprocessor with any identifier that is not normalized to form C, by default it will produce a warning. | ||
Use the GCC option ``-Wnormalized=none`` to disable this [warning](https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wnormalized_003d), and the D compiler will take over. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -639,6 +639,24 @@ dmd -cov -unittest myprog.d | |
`Turns off all array bounds checking, even for safe functions. $(RED Deprecated | ||
(use $(TT $(SWLINK -boundscheck)=off) instead).)`, | ||
), | ||
Option("normalization=<strategy>", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need all these strategies? Doesn't C23 specify behavior? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. C23 does not specify. Neither does UAX31, as long as we handle it in some stated way (or pickable) we are good. We don't need the error case as that is already handled with warning as error. We also don't need to expose deprecation externally. Next PR would be normalization that would get exposed (currently internal only) and later made the default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pick the simplest way and leave it at that. Why should we do normalization at all? I'm beginning to like Nim's strategy of every non-ASCII code unit is an identifier character. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we don't handle normalization in some form a D identifier may look the same, but won't actually match to a C identifier. Erroring out basically requires a subset of the normalization process except less well-documented. I haven't implemented the full algorithm, only enough to decrease the cost of normalization to "don't worry about it" levels. Performing normalization autocorrects, and the pseudo code in Note: just having a larger table doesn't handle normalization, as long as we accept Unicode it will always be able to be an issue for someone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm looking at Nim's documentation. Case insensitive comparisons of identifiers: https://nim-lang.org/docs/manual.html#lexical-analysis-identifier-equality Anyway, Nim transcompiles to C or C++. We don't. All their identifiers get verified by the C/C++ compiler, which also means in practice they are defined against UAX31 already with normalization (erroring). Even if their language manual doesn't. We would not be copying them if we changed our tables. They still have a lot more logic going on than we have had. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest we don't do normalization. If a user needs it, he can run a filter over his source files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I cannot recommend this. You have to implement a check for normalization, and it has to be a complete From C23:
To detect if you are in form C requires that you basically perform normalization, except you do not store the result. https://unicode.org/reports/tr15/#Detecting_Normalization_Forms That includes having all the same tables as you would've had otherwise. Not a great tradeoff. Especially when you consider code formatters should be doing it only on identifiers anyway. |
||
"Specify what to do when an identifier is not normalized", | ||
`The strategy that will be used when an identifier may not be normalized (quick check algorithm). | ||
The quick check algorithm will give off false positives as it does not handle the maybe case. | ||
$(UL | ||
$(LI $(I ignore): Silently accept any failing identifiers (default)) | ||
$(LI $(I warn): Emit a warning but accept) | ||
)` | ||
), | ||
Option("normalization-importc=<strategy>", | ||
"Specify what to do when an identifier is not normalized", | ||
`The strategy that will be used with ImportC when an identifier may not be normalized (quick check algorithm). | ||
The quick check algorithm will give off false positives as it does not handle the maybe case. | ||
$(UL | ||
$(LI $(I ignore): Silently accept any failing identifiers (default)) | ||
$(LI $(I warn): Emit a warning but accept) | ||
)` | ||
), | ||
Option("nothrow", | ||
"assume no Exceptions will be thrown", | ||
`Turns off generation of exception stack unwinding code, enables | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just support C23 and leave it at that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are, the problem is GCC is picking different behaviors on failure that we want to override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the rest of the text, normalization form C is the big breaking change that people need to be aware of in this PR.
Which is what C23 is defined against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be surprised if anyone even uses normalized identifiers. If we support various incompatible normalized identifier schemes, then the user will be creating code that won't link with other ImportC code.
I suggest pick the C23 scheme and not do any of these other bug accommodations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does gcc have some flag to pick different schemes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. They can pick between a few, but the important ones are C and KC.
I do not want to offer any other normalization forms than C to a user.
D form is easily done once you've got C (it's a subset) so I'll write the wrapper function for it during the next PR that adds the normalization function itself.
But the KC/KD require extra tables, while I'm not against seeing them added I don't see a reason for dmd to need it.
The compiler switches introduced here are to pick what to do when it is not normalized.
Right now that is to accept it as-is (with varying levels of errorness) and later that would be to normalize it.
Without this, I cannot verify the quick check algorithm as part of the test suite currently.
The goal of the switches is to give a ± 10 release window when people can transition forward and backward as required.
That window can be changed, but I believe that may work well so as to try and get one GDC release on either side of changing defaults to C23.
We may not keep any of these switches long-term.
Before removing them happens ImportC should be defined against C23 and we need to transition everyone over to C23.
Once that happens picking tables becomes unnecessary and can all be cleaned up.
But until then to switch 100% over right now would be a breaking change without a transition backwards hence the need for switches.
Unless of course, you're telling me to do the hard break, at which point I'll do a different PR removing all that infrastructure.
Iain may have opinions on switches, as well as if we support NFKC. So we do need to hear from him.