-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Editorial: Use the dfs stack instead of [[DFSAncestorIndex]] #3637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
First some background - strongly connected components affect execution in the module system in only subtle ways:
That's it! Otherwise strongly connected components are indeed relatively unobservable to end users. But correctly classifying strongly connected components is a fundamental invariant. Now for (1) we don't formally propagate errors through the entire strongly connected component, we actually only propagate errors up the current stack of a strongly connected component. Your simplification based on this observation does seem to be unobservable so far as (1) is concerned. But, as mentioned when we discused this, your change very much does break the Tarjan algorithm's detection in being able to identify the strongly connected components correctly per Tarjan's algorithm. Simple counter example where A [0, 0] Under your scheme you would not correctly classify the two separate strongly connected components { B, C } and { D, E }, and instead classify both B and D as having the same ancestor index 1 since you are incorrectly conflating the component root index with the stack index. That is you would transition all 4 modules together instead of handling them separately. This would also incorrectly set the cycle root in the evaluation code. The comparison between sccAncestorIndex and moduleIndex is also invalid as the stack indexing diverges from the module indexing for large parallel graphs. As a side note - the onus is on you here to prove your modification or Tarjan's algorithm handles the edge cases, not on me to point out the counter examples. |
I understand, that's why I split the change in small commits each of them trying to carefully explain why the change is correct 😅 Assuming that the example you mean is the following, which yields those [dfs index, dfs ancestor index] pairs and those two SCCs: ![]() This is what happens with the updated algorithm:
The updated algorithm correctly detects the three SCCs: BC, DE, and A. Note that, both before and after this PR, the SCC detection rule is not "a set of module with the same associated number is a SCC". Already before these changes, multiple modules in the same SCC can have different [[DFSAncestorIndex]] values. For example (interactive):
Even though D is in the same SCC as B/C/E. Instead, the rule to tell that a module X is not a SCC root is that from X's outgoing edges you can reach a node with index lower than X. Then, if a module is a SCC root (so if the lowest index you can reach from X is X's index), the contents of the SCC is the list of modules in the stack starting from X.
Yes, and that's fine. When you have large parallel graphs (that are not in the same SCC), one of them is going to be processed before the other. What happens in the first of them has no effect at all on what SCCs there are in the second one. We can forget that we actually already used some indexes for that first completed graph part, and we can reuse them. Edit: I'll publish a forked version of https://nicolo-ribaudo.github.io/es-module-evaluation with these changes. |
I have a more precise proof that using the stack depth is equivalent to using the discovery index. Consider a module graph that contains a module M, with the stack at the time M is first discovered (right after pushing M to it) being « s0, s1, ..., sn-1, M = sn ». For a given module X, the discovery index of X is D(X) and the depth of the stack right before pushing it is S(X). Next(X) is the module that is discovered right after X (that is, D(Next(X)) = D(X)+1). [*] There are two possible cases. One is the trivial one, when we are going down the left-most branch of the DFS tree, so D(s0) = 0, D(s1) = 1, ..., D(sn) = n. In this case obviously D(si)=S(si) for each i. If we are not in that situation, then there is a smallest j such that D(sj) ≠ S(sj) (that is, D(sj) ≠ j). j > 0, because s0 is the root module of the graph, which is pushed in the stack first and removed at the end of the whole process. Thus we have D(s0)=S(s0)=0, D(s1)=S(s1)=1, ..., D(sj-1)=S(sj-1)=j-1. Now, we know that sj ≠ Next(sj-1) (otherwise D(sj)=D(sj-1)+1=(j-1)+1=j, which violates the condition on which we chose j). We also know that sj is not reachable from Next(sj-1), otherwise Next(sj-1) would be on the stack between sj-1 and sj (which is impossible, since they are by definition consecutive in the stack). Let { O1...Ok } be the set of all k modules reachable from Next(sj-1), Next(sj-1) included, ordered such that D(O1) < ... < D(Ok).
O1...Ok are all discovered before sj, so
Given [1] and [2], Evaluation(s0) is equivalent to first doing Evaluation(O1) (which evaluates all and only the modules { O1...Ok }) followed then by Evaluation(s0). This has no effect on evaluation order, SCC detection, or on the modules in the stack when discovering a module not in the { O1...Ok } set. The effect of Evaluation(O1) is that when we will then do Evaluation(s0) all the modules { O1...Ok } will have already been evaluated, thus we skip them in the DFS tree. Let's call D'(...) the discovery indexes of this Evaluation(s0) that happens after Evaluation(O1). We have that for each module P:
Thus we have that:
Now, we repeat from [*]:
|
Took a brief look at this today, didn't have time for a full review again but some initial thoughts:
This isn't true - in the simplest case if sj has a dependency on sj-1, then it cyclically can reach back to Next(sj-1). Unless you have another definition of reachability in mind? What would really help me is to see how your algorithm can classify strongly connected components such that it fully detects them all despite not tracking dfs ancestor index. It might also help here to implement the algorithm and have it spit out the list of strongly connected components and we could then fuzz on that various edge cases. Specifically the problem is not the execution order - it is correctly identifying the strongly connected components. It does seem to me likely that execution order is likely unaffected here, while I'm still pretty sure strongly connected component detection is. Otherwise happy to do it the slow way as well in carefully reviewing your approach, it just will take me more time to dig into this review thoroughly. |
Thinking about this some more... the important point is that per your first counter example case, because we skip over modules already linked, any reaching back to formerly visited graphs is always excluded from cycle analysis and correctly by strongly connected component isolation. That was the part I missed - that linked modules don't reflect their numberings anymore, since they are already discarded as we're done with their strongly connected component detection per your mathematical argument. Definitely important to review further carefully, but this may well actually be correct. I don't mean to sound so surprised, but it would be a very impressive simplification...! |
@guybedford Thanks for taking a look again :) I updated https://nicolo-ribaudo.github.io/es-module-evaluation to match this PR, and to list the SCCs as they are "completed". This is an example with some interesting cycles. That demo uses the The version in the
In that case we would have that Next(sj-1) is potentially reachable from sj, not that sj is potentially reachable from Next(sj-1): ![]()
![]() |
To clarify - I mean in the combined cycle cases where both cycles in your images appear. Another interesting cycle case is when successive direct dependencies each import a higher stack item. So say A -> B -> C -> D, and then D -> E1, E2, E3, E4, where E1 -> D, E2 -> C, E3 -> B. The stack just continues to append along the dependencies so they still get transitioned together so I think that proves out the last of it? |
Even in this case, when we visit sj we would still have Next(sj-1) on the stack.
|
To detect strongly connected components in module graphs, we need to use a number that increases as we go further away from the graph root on any branch of the DFS tree. This is so that when a node has a child that (1) is still not finalized yet, and (2) has a lower number than the node itself, we know that the child is an ancestor of the node and thus we are in a strongly connected component. Strongly connected components roots are thus nodes from which it is not possible to reach a non-finalized node with a lower number. For this Tarjan's algorithm uses the discovery index of each node, but other indexes that have the same property are: - the depth of the node in the DFS tree - the index of the node in the stack that contains yet-to-be-finalized nodes. This commit updates the SCC discovery logic to use this last option, rather than carrying around the discovery index.
…ower values Instead, only update it at the end of InnerModuleLinking if the current module is a non-root node of a SCC. We do not need to update its value in the loop through a module's dependencies because, even if there was a loop leading back to the node currently being processed, the [[DFSAncestorIndex]] value propagated back across the loop would not cause the module's [[DFSAncestorIndex]] to decrease.
Rather than returning ~unused~ and then reading it from the [[DFSAncestorIndex]] slot. Also return the index for modules that are not participating in cycles (returning the module index itself), to avoid having a separate path in the InnerModuleLinking loop.
The lowest possible [[DFSAncestorIndex]] will already be propagated back to the SCC root through _one_ of the SCC branches. InnerModuleLinking does not need modules to have their actually lowest possible [[DFSAncestorIndex]] set to know that they are a non-root node of a SCC: it's enough to know that from that node it's possible to reach _one_ other node with a lower index, and that's signaled by the return value.
[[DFSAncestorIndex]] index is only set once per module during a given traversal process, and it represents the module's index in _stack_. Instead explicitly storing it, we can compute it given _stack_ when needed (i.e. when we process an edge that "closes" a loop).
347b9ca
to
6649775
Compare
This is an attempt at removing [[DFSAncestorIndex]], similarly to how #3625 removed [[DFSIndex]], by slightly modifying Tarjan's algorithm to use the stack depth instead of the module discovery index.
Even though they do not represent how I came up with this different approach, I tried to split the changes in 5 separate commits that can be reviewed one-by-one for correctness. The rest of the commits to the same change for InnerModuleEvaluation.
This changes the editorial aspects but does not simplify implementations: they'd still need to keep track of a per-traversal number (just now it's the stack length rather than the discovery length) and a per-module number (unless they prefer to do an
O(number of nodes in a SCC)
loop for each cycle detected while traversing the graph, to find an index in a list).Again, like for #3635, this change is not objectively good and I'm curious to se what editors think about it.
You can find a demo of the updated algorithm at https://nicolo-ribaudo.github.io/es-module-evaluation/#s=ICBBCiBCIEMKICBECiBFIEkKRiAgSgpIIEc%3D&c=QSAtPiBCCkIgLT4gQwpCIC0%2BIEQKQyAtPiBBCkMgLT4gRApEIC0%2BIEUKRSAtPiBGCkYgLT4gRwpHIC0%2BIEgKRyAtPiBFCkggLT4gRgpEIC0%2BIEkKSSAtPiBKCkogLT4gSQ%3D%3D&a=&f=
(Note: this first commit is the most complex one. A proof of its correctness is in #3637 (comment))