documentation: Explain solution for match case deduplication

The document also explains other solutions and the motivation for all of those to exist or have existed in the form of constraints. Co-authored-by: Ali Mohammad Pur <[email protected]>
SerenityOS · Oct 23, 2024 · b1163d6 · b1163d6
1 parent 1ae4f72
commit b1163d6
Showing 1 changed file with 293 additions and 0 deletions.
diff --git a/documentation/pattern-match-avoid-duplication.md b/documentation/pattern-match-avoid-duplication.md
@@ -0,0 +1,293 @@
+# How Jakt avoids code duplication in pattern matching codegen
+
+## Current solution
+
+Typechecker walks the patterns to get the type information out of them as it
+did before, but instead of adding binding variables to a scope, it adds
+them to a `BindingKeyBuilder`. Typechecker "submits" bindings to the builder,
+which then produces a `BindingKey` that tells us whether the set of bindings
+that were declared already exist within the same pattern group.
+
+Since we don't want to be passing a lot of state regarding the match to helper
+functions like `.typecheck_match_variant`, `BindingKeyBuilder` only contains a
+reference to the slice of cases that have been declared and refer to the same
+`ParsedMatchBody`. Remember that it could be multiple due to Constraint 1.
+
+To enforce that the set of cases checked by `BindingKeyBuilder` are the correct
+set, and that context is not lost when reading through the long loop bodies of
+`.typecheck_match`, all the state regarding match typecheck is in
+`MatchBuilder`. This struct has two functionalities:
+
+1.  It will not let us create a `BindingKeyBuilder` until we provide a proof
+    that we have started a case. Such proof tracks the start of the case
+    slice that `BindingKeyBuilder` should use. This is in place to avoid copy-paste
+    mishaps, since the logic for traversing the matches is big and changes depending on
+    the type to be matched.
+
+2.  When finishing a pattern, we give it a `BindingKey` along a match body and
+    said pattern, and it will register the pattern into the `CheckedMatchCase`
+    that the `BindingKey` tells it to. The parsed body will be (re-)typechecked
+    against the bindings for this finished pattern only when the bindings of the
+    new pattern have different types than the rest of the previously checked
+    patterns.
+
+`BindingKeyBuilder` keeps track of what bindings were declared by the
+typechecker, in case we have to generate a new `CheckedMatchBody`, and the set
+of candidates that could match the set of bindings that have been submitted.
+
+Since `BindingKeyBuilder` works in a binding-per-binding basis, we could have
+multiple sets that match the partial set of bindings. The `finish()` method is
+called when the build is done, so we know that there are no more bindings
+flying in and we can make a final pass on our candidates, selecting only the
+`CheckedMatchCase` that has the same amount of bindings as what we've gathered
+so far.
+
+The usage of `BindingKeyBuilder` helps us reduce the amount of dictionaries
+that are created and deleted during the typechecking phase, since builders that
+stay in their `Empty` state do not contain references to empty
+dictionaries.
+
+
+## Constraints
+
+The solution above is designed to feed codegen with more information on the
+structure of matches, avoid duplicate work whenever possible and keep
+correctness while doing so.
+
+We can summarize these features in two constraints, discussed in this section
+after a brief on pattern matching terminology.
+
+### Terminology
+
+
+```jakt
+match <expr> {
+    <pat1>(<binding1>...) | <pat2>(<binding2>...) => <body1>
+    <pat3> => <body2>
+}
+```
+
+- `expr` is the "matched value", or "value". It is the value we wish to branch
+on.
+- `pat1`, `pat2` and `pat3` are "patterns". They are different structures that
+the matched value may conform to.
+- `binding1` and `binding2` are "bindings". They are variables that extract
+different parts of the structure that the value is checked to conform to. We
+can then use those variables inside `<body1>`.
+- `body1` and `body2` are "bodies". They are the pieces of code that we want to
+execute on certain cases.
+- `pat1` and `pat2` form a "pattern group". Matching either of them will branch
+to the code specified in `<body1>`.
+- Pattern groups and bodies together form a "case". Each case has a single body
+and a group of patterns.
+
+
+### Constraint 1: Multiple binding types
+
+Jakt allows us to use bindings of different types in the same pattern group, as
+long as the body typechecks against both "versions" of the bindings. For an
+illustration, take this valid Jakt sample:
+
+```jakt
+enum TypedPair {
+    I32(a: i32, b: i32)
+    U64(a: u64, b: u64)
+}
+
+
+fn get_pair() -> TypedPair { ... }
+
+fn main() {
+    match get_pair() {
+        I32(a, b) | U64(a, b) => {
+            let result = a + b
+            eprintln("result: {}", result)
+        }
+    }
+}
+```
+
+Notice that the first body of the match accepts both types of `a` and `b`
+(`i32` and `u64`):
+
+- `a + b` typechecks as long as the type of `a` implements `Add` with `b` as its RHS.
+- `eprintln` accepts both result types.
+
+According to the rule above, this sample should typecheck and always print the
+added result.
+
+We want to generate the minimal amount of code that contains both versions of
+the branching, meaning that paths for both `I32` and `U64` should get their
+`result` to be `i32` and `u64` respectively.
+
+When generating the code for that to work, even though we're lenient on the
+user, we still have to separate the code path that uses `u64` from the code
+path that uses `i32`. Since we generate C++ code, we can rely on C++'s
+substitution power for things like calls to overloaded functions, but wherever
+we have to fully specify the type of `a`, `b` or `result` we will have to
+generate two versions of the bodies.
+
+### Constraint 2: Multiple patterns, same binding types, only one body
+
+Take the following samples:
+
+#### Constraint 2.1: Multiple patterns with bindings
+```jakt
+enum Foo {
+    Bar(i32)
+    Baz
+}
+
+fn get_foo() -> Foo { ... }
+
+fn main() {
+    match get_foo() {
+        Bar(x) | Baz default(x = 1i32) {
+            // ... long code using `x`
+        }
+    }
+}
+```
+
+#### Constraint 2.2: Multiple patterns with no bindings
+```jakt
+enum Answer {
+    Is42
+    IsNot42(actual_value: u32)
+    IsNot42AndNoValue
+}
+
+fn get_answer() -> Answer { ... }
+
+fn main() {
+    match get_answer() {
+        Is42 => {
+            // long body 1
+        }
+        IsNot42 | IsNot42AndNoValue => {
+            // long body 2
+        }
+    }
+}
+```
+
+In both cases, we want Jakt to ideally only generate the long bodies once.
+
+## Previous solutions
+
+This is not the first iteration of pattern matching codegen, and the
+constraints above are modelled against features and problems of previous
+iterations. In this section we explain how each iteration works at a high level
+and how they motivate the constraints.
+
+### First solution
+
+The first implementation ran through all of the cases, typechecking the body
+for each of the patterns and submitting a new case per pattern, with their
+respective body.
+
+When generating code for a `match`, codegen sees a `CheckedExpression::Match`
+containing a list of `CheckedMatchCase`s. Each `CheckedMatchCase` contains exactly one
+pattern and one body that uses the bindings declared within said pattern.
+
+The problem with this solution is that it effectively duplicates the body as
+many times as patterns exist in the same group. However, the overhead on
+typechecking the body was not the main issue: Codegen would receive the bodies
+already duplicated, severely limiting what could be done at the codegen stage.
+Take into account that the codegen stage has the most complete version of the
+program that Jakt can have, since it needs all the concrete types that are used
+in order to generate working C++.
+In #1597 we can see that a ~300 line Jakt function
+(`typecheck_binary_operation`) is compiled to ~5500 lines of C++ code! That
+happened because the bodies for each case are large, and the bodies were
+duplicated many times.
+
+Constraint 2 ensures that similar functions to `typecheck_binary_operation` do
+not make the C++ compiler sift through repeating bodies if it's not
+necessary.
+
+### A partial improvement (#1597)
+
+PR #1597 addressed Constraint 2.1 by keeping the shared relationship between
+the patterns inside a pattern group and the body they jump to. This enables
+codegen to find pattern groups that have no bindings on them and easily group
+them by concatenating `case` labels or concatenating boolean expressions with
+`||`.
+
+Now each `CheckedMatchCase` contains a list of `CheckedMatchPattern` and one
+body that uses the bindings declared within said pattern. The implementation
+typechecks all bodies like the last version, but only selects the last checked
+body.
+
+The problem with this approach is that now the typechecker portion of this
+solution does not respect Constraint 1, since the last checked body is not
+enough to represent the case if bindings from the same parsed pattern group
+have different binding types.
+
+This causes `tests/codegen/match_enum_same_body_different_impl.jakt` to fail,
+since it codegens a call to `maybe_templated<JaktArray<i32>>` on all cases, when
+the first two cases of the enum should use the overloaded versions of
+`maybe_templated`.
+
+### Latest solution
+
+With this iteration, Constraint 2.1 is not addressed yet, but it satisfies
+Constraint 1 and Constraint 2.2. Typechecker now keeps the grouping, but
+re-generates the bodies when the types don't match.
+
+Each `CheckedMatchCase` contains a list of `CheckedMatchPattern`, a set of
+bindings `[String:VarId]` and a body where those bindings are used. All the
+patterns in this `CheckedMatchCase` are known to have the same set of bindings
+as the case.
+
+#### Why is this solution better?
+
+From the typechecker's side, it's obvious: it avoids creating `Scope`s and
+`CheckedMatchBody`s and `CheckedExpressions` and you name it. A quick glance to
+performance counters while typechecking selfhost [^1] reveals that this
+solution has a memory usage peak that is 20 MB lower than the previous
+solution, and is around ~14% faster.
+The memory usage improvement happens partly because `Scope`s are refcounted
+(`class`) and they are eternally stored in their respective `Module.scopes`
+array. Since we don't create them, they don't linger anymore.
+
+From codegen's side, it's certainly better than #1597 since it addresses
+Constraint 1, which is more important than Constraint 2 because now codegen is
+correct. As discussed below, this solution will generate "false duplicates"
+because it backs off without analyzing the exact problem at hand. The amount of
+duplicates that it will generate with respect to #1597 due to having to comply
+with Constraint 1 is not easy to estimate.
+
+#### Why is it not ideal?
+
+The reason Constraint 1 fails with #1597's solution is very specific: The types
+of the bindings are mentioned inside the body.
+
+An ideal solution would not duplicate bodies that don't mention the types
+inside the body implementation when driving them into codegen, or at least
+avoid duplicating the parts that don't depend on the binding types
+syntactically. That will generate even less C++ code, although the final result
+will still have to go through code duplication through template instantiations
+or overloads.
+
+This solution would require a lot of machinery that is not yet in place to
+typecheck the body and extract what variables (external to the body) each
+statement depends on, and decide how much of it has to be duplicated in either
+codegen or typechecker, or a mix of both. I didn't go with this solution
+because it seemed too complex of a change with not that big of a benefit.
+Perhaps we can place it in the backburner and take its potential into account
+when doing a similar extraction system in other places.
+
+I still haven't measured or calculated how many "false duplicates" the
+non-ideal algorithm will generate, but I can estimate that whatever amount that
+is it will not be very different from the ideal solution, since the difference
+boils down to the cases that have patterns with different binding types but
+could use some method of overloading in codegen (`auto` in lambda and/or
+function overloading) that is significant to the compile time of that piece of
+code, meaning that the separation made things worse.
+
+---
+
+[^1]: Used the program from https://github.com/andrewrk/poop with the following command:
+    `poop -d 15000 './stage1_main -R runtime selfhost/main.jakt' './build/bin/jakt_stage2 -R runtime selfhost/main.jakt'`.
+    `./stage1_main` is stage1 from https://github.com/SerenityOS/jakt/commit/d65f014cc54b986f629fe676d914af01d442b9f7