-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: turn macro output into untyped AST #508
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted some comments for the sanitizer core branches.
Thanks again, this is awesome, it inspired me to write some thoughts down, I'll share them shortly via discussions.
# can't be expressed in untyped AST. Also only present in post-transform | ||
# AST | ||
# XXX: we could throw away the 'env' part and only use the 'prc' -- I'm | ||
# not sure if that's a good idea however | ||
invalid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, I think dropping the env var might not be the way to go, maybe we want to go with:
- write out locally the:
- env type
- procdef
- call it to simulate it
my thinking
There are two paths here when it comes to defining what a "raisings" means:
- go back in time
- use "reverse" transforms that emit roughly equivalent untyped code
I think 1 is nice where we can reasonably rewind thing, but I don't believe this is a property we can preserve unless we keep a) a full history of transforms, b) forward transforms need to preserve things, c) that last point discounts fusing, and d) I don't think the user always wants a time machine.
I believe the correct answer is phase dependent, so from post-transf
is likely too big a transform (hlo
has run) and history rewind is likely what people do not want. We can't handle this sort of decisioning right now. If it's early sem (macro/template expansion), back to untyped is better as a history rewind.
We don't track enough data to do this, hence my suggestion above for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that there is no equivalent for nkClosure
in the untyped AST. A nkClosure
is essentially a nkTupleConstr
that always has two elements and is always of tyProc
type. Consider the following code:
proc aux(x: proc(p: int)) = discard
proc prc() =
var x = 0
var y: int
proc inner(p: int) =
x = 1
y = p
inner(1) # 1
aux(inner) # 2
let cl = inner # 3
cl(1)
After transf
the body of prc
will look like this:
var :env
internalNew(:env)
var :env.x1 = 0
var :env.y2: int
proc inner(p: int; :envP) = ...
(inner, :env)(1) # 1
# ^ `nkClosure`
aux((inner, :env)) # 2
# ^ `nkClosure`
let cl = (inner, :env) # 3
# ^ `nkClosure`
cl(1)
For 1 and 3 it's possible to turn the nkClosure
back into untyped AST (knowledge of sibling nodes is required, however), bit for 2 it's not.
Right now, what a node unwinds to depends only on the node kind plus the context it is used in. Supporting unwinding the above transformed code would mean that what a node translates to is also dependent on the sub-nodes (and sometimes even sibling nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My terrible idea here was that we wouldn't do that much recovery. Instead we'd emit the creation of the env etc... but that runs into all sorts of issues, with internalNew
and a bunch more stuff leaking out.
Keeping it invalid for now makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation with the before and after transformation.
nkRecCase, | ||
nkRecWhen: | ||
# context dependent | ||
invalid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the when
is eliminated then no worries, but if not, wouldn't we continue sanitizing recursively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sanitizer logic for everything type definition related is not yet implemented, but once it is, both the condition and body AST of the nkRecWhen
will also be considered.
of nkStmtList: | ||
result = stmts(c, n) | ||
of nkImportStmt, nkImportExceptStmt, nkExportStmt, nkExportExceptStmt, nkFromStmt, nkIncludeStmt: | ||
# QUESTION: should we perform any additional validation here or leave that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the validation should remain in sem, doing some basics here for sanity is fine, but I wouldn't "hurt" ourselves to do it here.
of nkForStmt: | ||
checkMinSonsLen(n, 3) | ||
result = prepareFrom(n) | ||
# note: this allows for code that the parser doesn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the syntax ast and the untyped ast spec diverging is concerning. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With how it's currently implemented, the sanitizer allows the following code:
for (a, b), (c, d), (e, f) in items: ...
while the parser doesn't. For what it's worth, with the recent semForVar
refactoring, this would also be correctly analysed by sem.
checkSonsLen(n, 1) | ||
result = expr(c, n[0]) | ||
of nkObjDownConv, nkObjUpConv: | ||
# XXX: these don't exist in typed AST prior to ``transf``. How should they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types might change so the conversions should be deduced again, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm going to check how well the following unwinding works then:
ObjDownConv (x) -> Call (type-of-destination) (x)
of nkElifExpr, nkElifBranch: | ||
parseBranch(c, it, n.kind == nkIfExpr) | ||
of nkElseExpr, nkElse: | ||
# XXX: hm, also make sure that the 'else' is the last item? Or leave |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is the type of validation that we likely want to do here.
Although, this level of validation is likely at the edge of complexity we want to entertain, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser also only allows nkElse
/nkElseExpr
as the last item, so it makes additional sense to enforce that here too.
result = newTreeI(nkInfix, n.info, ident(c, ".."), expr(c, n[0]), expr(c, n[1])) | ||
of nkDotExpr: | ||
result = parseDotExpr(c, n) | ||
of nkCheckedFieldExpr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be deduced again by sem, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. The comment is a remnant of the time were it wasn't fully clear yet what the sanitizer should actually do, i.e. whether it should perform unwinding or translate into untyped AST that is roughly semantically equivalent.
of nkNilLit: | ||
result = newNodeI(nkNilLit, n.info) | ||
of nkDotCall: | ||
# XXX: hm, turn into an ``nkDotExpr`` + ``nkCall`` instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be a more true unwind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I hesitated here is because nkDotCall
can appear in neither typed nor untyped AST a macro gets from sem. Looking at nkDotCall
a bit closer, I think the more proper unwind would be also be only nkDotExpr
. That is:
DotCall (x, y) -> DotExpr (y, x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that's indeed a truer reversal based on what dotTransformation
does.
if it.kind == nkExprColonExpr: parseColonExpr(c, it) | ||
else: expr(c, it) | ||
|
||
of nkError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"technically" this could be:
- if user error, return to
error
pragma - otherwise, extract the untyped
wrongNode
as that should always be a straight swap
If we enable error
pragma blocks, then we might have an easier time:
{.error.}:
this is not valid syntax without commas
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it.
Reports is a blight, there I've made my position absolute clear. Right now this PR is stuck, because said trash has gummed up the works with Concepts and explain reporting, and the overly far reaching structured report hook. I have no idea why help messages and general CLI output really need to go through this overwrought machinery. So, that means no further major changes that make this PR hard/impossible to merge because they poke at too much of the existing "reports" shite in the code base. Sorry. I hate type this out as much as you do reading it, I'm really not happy. The problem is that this blight is literally everywhere and has completely locked everything together. I'm not a fan of doing of pulling the andon cord, but at present I'm not seeing any non-big bang changes to unblock things. In my latest commit I've written a bit of a plan for how to push through, along with starting down that road. The options for what to do next are as follows:
I'll cover the second option first as it's fairly straightforward: there are two or three concepts and The first one means:
Next steps, ideally we can squeeze some time to discuss tomorrow which strategy to take. I'm good to unblock merges as long as there is a path forward. So just need agreement on which option to take, 1, 2 or mystery option 3 which I'm hoping someone else can suggest. If we can't discuss it tomorrow, I'm likely to go with Option 2 as I don't like blocking for any lengthy period of time. @zerbina this is the message I alluded to in the other PR. |
You lose nothing by disabling the concepts stuff. Why not take that route in the service of progress; you can still take the same time to re-enable it later. Does it really matter how many PRs this work goes into or how long it takes? The main thing is that you're improving the code for everyone. Good job. |
Apologies, I posted the above, comment in the wrong PR -- 1AM tiredness. Should have posted it here. Nevertheless, thank you all very much for the encouragement and support. Correct PR: #505 |
# IDEA: the new node kind ``nkSymSuggestion`` (or similar) could be | ||
# introduced. It communicates to sem that the symbol must be checked | ||
# for reachability (i.e. that the defining scope is reachable from the | ||
# usage site) first. This would allow to move the scope check out of the | ||
# sanitizer, in which case it no longer needs access to an ``PContext`` | ||
# instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check my Understanding of the Problem
If I understand things correctly, what a macro can do to break visibility is around shadowed or overloaded symbols, some examples:
# example 1
let foo = 10
block:
let foo = 11
echo foo # <-- a macro could choose to make this the outer foo
In the above a macro wants to emit an ast thingy, that is some path from the position of foo
in echo foo
to it's destination (relative to it) and/or some way to define the path from a top level/containing scope module/proc body, etc.
# example 2
proc foo(a: int) = echo "singular"
proc foo(aaaaaa: varargs[int]) = echo "plural"
foo 1 # <-- a macro could make this dispatch on the varargs and get plural
In the above a macro wants to emit an ast thingy, that is some way to unambiguously refer to a symbol that may or may not be overloaded.
Solution Idea(s)
A solution for the first example might look like an nkIdentQuery "start" "path" "parts" "to" "navigate"
. This might turn into something very akin to jq
, xpath
, etc... but I don't mind. To be useful the macro API likely needs a facility to get paths, eg:
proc getIdentPath(tree, child: NimNode): NimNode{nkIdentPath}
, andproc getIdentPathFrom(tree, start, dest: NimNode): NimNode{nkIdentPath}
In both those cases all nodes must be contained within tree
.
A few "special" paths can be considered if we find them useful, like "{module}"
(I just randomly chose {}
to mean interpolate) for the containing module. Or {proc}
for the containing proc, etc. Bonus points if we can use labels.
A solution for the the second example might be always give stable identifiers for routines that are derivable, even for a human running the algorithm by hand in at least simple cases, and then a macro author can refer to it either by formulating the identifier manually or querying for it. So perhaps it could be an nkIdentQuery <to be determined>
as well, but it'll need a bit more of a think on the algorithm, it's basically name mangling. 😆
Aside, Kotlin has a facility like this, I didn't bring it up until now because I don't want to get caught up in syntax, it might not cover all cases, but does a fair few:
- https://kotlinlang.org/spec/expressions.html#this-expressions (even though it's
this
I'm just focusing on the identifier resolution) - https://kotlinlang.org/docs/returns.html (note the 3rd example under
Return to labels
where things get implicit labels) - erlang is untyped, so they get away with disambiguating functions based on arity, so their imports look like
-import(somemodule, [foo/1, foo/2]).
; that's not specific enough for us I'm afraid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct in that the examples refer to AST that the macro would generate and that the problem lies in how to allow the macro to refer to the mentioned entities (which are not the ones sem would normally choose)? If yes, then the I think the current way would be genSym
, which has the problem of producing a symbol instead of an unresolved identifier. genSym
'ed symbols are a blocker for the sanitizer right now.
My original idea was to make genSym
produce a unique identifier instead of an sfGenSym
symbol, but maybe that's a solution orthogonal to nkIdentQuery
.
If you're referring to the IDEA
comment: that was me thinking about ways to get around the PContext
dependency. Apologies if I'm completely misunderstanding what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct in that the examples refer to AST that the macro would generate and that the problem lies in how to allow the macro to refer to the mentioned entities (which are not the ones sem would normally choose)? If yes, then the I think the current way would be
genSym
, which has the problem of producing a symbol instead of an unresolved identifier.genSym
'ed symbols are a blocker for the sanitizer right now.
Not just AST a macro generates, but I think example 1 is better solved by not having the issue in the first place. In other words, genSym
is bad, but generating a unique identifier is what we want. At which point, we only get generated symbols for things the compiler has already figured out.
For example 2, that's a user (import, call site, etc) or macro trying to unambiguously refer to the symbol. The unique identifier doesn't solve that.
My original idea was to make
genSym
produce a unique identifier instead of ansfGenSym
symbol, but maybe that's a solution orthogonal tonkIdentQuery
.
After the unique identifier part which solves example 1 well, outside of letting the user disambiguate, but let's not bother yet. I think we're down to example 2, where query or "unambiguous fully qualified path" is required.
If you're referring to the
IDEA
comment: that was me thinking about ways to get around thePContext
dependency. Apologies if I'm completely misunderstanding what you mean.
It's related, but I think that's an extra bonus from solving other issues.
In order to be able to eventually remove `considerQuotedIdent`, the sanitizer does a bit more than just sanitize and evaluates identifier construction expressions.
In addition, explicitly allow the type slot to be empty (`parseTypeExpr` is going to reject empty nodes).
In addition, `process` now dispatches to `parsePragmaList` for processing `nkPragma`.
* basic processing for formal and generic parameter lists * lambda-like expressions work
The implementation currently supports: * records (list, case, when) * enums * tuples * type-classes * var, ptr, ref, distinct, proc
While not entirely correct (yet), doing so is better than calling `process`.
The result is now properly initialized and `parsePragmaList` is used instead of the outdated `parsePragma`.
A typed `nkEmpty` was previously put into the name slot for `nkObjConstr` nodes. While `semTypeNode` allows this (it likely shouldn't), the sanitizer does not.
I don't think I'll continue with this, as I no longer think it's the right direction. Using a dedicated pass to make sure the AST adheres to the grammar makes sense, as it allows |
This PR implements a transformation pass that turns untrusted AST as output from a macro into untyped AST. In this context, "untyped" currently means that the resulting AST:
nkSym
,nkOpenSymChoice
, andnkClosedSymChoice
contains only nodes that are output by the parserThe translation tries to approximate constructs present in typed AST with their untyped AST counterpart where possible. For example a
Conv (Type "int") (Sym "x")
is turned intoCall (Sym "int") (Sym "x")
.For each input symbol node, it is validated that the symbol is in scope (reachable from the current scope). If it is, the symbol is also used in the output AST -- it's turned into an identifier node otherwise, so that semantic analysis can decide what to do with it.
The overall goal is to move syntax and grammar checking out of the semantic analysis procedures by preventing ill-formed AST from ever reaching them. This should simplify sem a bit, as it can now focus on semantic analysis and no longer has to also validate the structure of the AST (separation of concerns).
Illformed AST
error can now also be encoded asnkError
nodes and thecheckSonsLen
procedure becomes obsolete.Removing the syntax/grammar checks from sem is not done as part of this PR and should happen as a follow up.
Details
For the symbol reachability validation, it is first checked if the symbol is present in the current scope (i.e. the one where the macro is expanded) or one of the scopes enclosing it. Then, it is tested whether the symbol is part of the top-level scope of the module it belongs to. If that's also not the case, the symbol is treated as not reachable. Since instantiated generics are not added to the symbol table, their symbols are never treated as reachable, and thus turned into identifier by the sanitizer.
As part of fixing the issues that are blocking the MIR, I've looked a bit more into
sem
recently, and one thing that I identified as adding a significant amount of complexity is that due to macros,sem
cannot trust the AST it operates on. This inspired me to look into solutions, which lead me to experimenting with this sanitizer layer.Notes for Reviewers
gensym
handling is not figured out yet