Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caseing on values of built-in types #6602

Open
effectfully opened this issue Oct 28, 2024 · 5 comments
Open

Caseing on values of built-in types #6602

effectfully opened this issue Oct 28, 2024 · 5 comments

Comments

@effectfully
Copy link
Contributor

effectfully commented Oct 28, 2024

We really want to have faster Data processing, but as per #6225 this is quite hard. Matching on integers is very inefficient too as one needs to perform repeated calls to IfThenElse and EqualsInteger (the possibility of using SOPs for matching on integers to solve the issue was brought up here). Matching on built-in booleans is also inefficient, particularly from the size point of view.

I feel like we can kill all these birds with a single stone: allow case to work on values of built-in types.

Bool

Take booleans, we could allow

case b
  branchFalse
  branchTrue

where

  • b is a Bool
  • branchFalse is executed if b is False
  • branchTrue is executed if b is True

(I'm making branchFalse come before branchTrue not only because false corresponds to 0 and true corresponds to 1, but also because the false case is more likely to terminate execution, hence being pretty-printed an average program would be more readable with the throwing clause appearing first. But we can make the ordering the same as in IfThenElse instead if we feel strongly about it, I personally don't care much)

This is as close as it gets to extending the UPLC AST with a constructor for IfThenElse without actually doing that. It should also be easy to implement, since Case is currently defined as

    | Case !ann !(Term name uni fun ann) !(Vector (Term name uni fun ann))

and all we need is to teach the CEK machine how to extract Terms from a Vector (plus check its length) and how to handle them depending on the type tag of the given constant.

Note how we would be able to address everything from #6598 without introducing any new builtins. For example if we had

assertAndContinueOrTrace :: string -> bool -> a -> a

and wanted to write force (assertAndContinueOrTrace msg b (delay x)), we could instead simply write

case b
  (trace msg error)
  x

This AST is of roughly the same size, but no force-delay bookkeeping is required, just like the additional builtin. The latter code is concise and on-point.

Integer

For integers we'll allow

case i
  branch0
  branch1
  branch2
  ...
  branchN

reducing to branch_i where i is the scrutinee or failing if i is not in [0, N].

This alone will optimize Data processing, because one will be able to write (pseudocode)

let (i, args) = unConstrData d
in case i
  (applyToList branch0 args)
  (applyToList branch1 args)
  ...
  (applyToList branchN args)

This is clearly unideal, since we very much want to bypass the unConstrData and applyToList noise, but this is a good start and an improvement in the right direction.

Note that associating case branches with hardcoded integers (0, 1 ... n) isn't much of a limitation, since one can recover full generality by adding an elemIndex builtin:

-- Same as `elemIndex` in Haskell except the first argument is a function taking the length of the
-- given list and returning something -- it's only called when the given element isn't found and
-- will normally be either `\n -> n` or `\_ -> -1`.
elemIndex :: (Integer -> Integer) -> a -> [a] -> Integer

For example the following:

case (elemIndex id 42 [5, 42, 1])
  branch5
  branch42
  branch1
  branchElemNotFound

will evaluate to branch42.

Functioning case on integers with the addition of elemIndex will also allow us to have Deterministic universal almost-unique Plutus Constructors if we ever feel like it.

Data

Ultimately we want case to handle Data.Constr the same way that it handles Term.Constr. There's still the limitation discussed in the "What exactly breaks if we add the unsafe constrTermFromConstrData?" section of #5777, but that one we should be able to address via multi-lambdas/multi-applications (see #6225). Which will help SOPs as well I believe, since multi-lambdas should be faster than iterated lambdas, plus UPLC is untyped, so there's still a possibility of getting the same mismatched semantics issue with SOPs as with Data.

So in the end we want

case (Constr 1 [a, b, c])
  branch0
  (\[x, y, z] -> body1)
  branch2

to evaluate to

(\[x, y, z] -> body) [a, b, c]

which is a multi-lambda \[x, y, z] -> multi-applied to [a, b, c].

I.e. that case means (pseudocode)

let x = a
    y = b
    z = c
in body1

and branch selection with Data will work the same way as with SOPs: for the Constr i args scrutinee the ith branch gets picked, so given that i is 1 in the example above, we pick the 1th branch and drop the 0th and the 2th ones (we only have the 2th one to make the example a bit more rich, nothing would change if it wasn't there). Unlike SOPs we

Note that casing on Data is orthogonal to pattern matching builtins, since the latter allow one to handle all constructors of Data, while case over Data only handles the Constr constructor in a specific SOP-like way.

Lists and pairs

I'm not sure how we should handle those in case-expressions. Just replicate pattern matching builtins? Or allow to match against multiple patterns like in Haskell (e.g. for lists [], [x], [x, y], x:y:z:xs etc)? Do we then want to change Case to contain proper patterns rather than arbitrary terms?

All of that is to be figured out, but this is way less important than speeding up processing of Data, so it can definitely wait.

Elaborate case to builtin calls?

Why reimplement case for booleans if we already have IfThenElse? Can we just elaborate case to IfThenElse under the hood? I believe the answer is yes, but it's awkward: IfThenElse expects values while we'll need to pass terms in there, which is doable as we can simply wrap terms into values (using VDelay for example), but then we need to unwrap the result, plus we want all the BuiltinRuntime nonsense (that is required for builtins but not case) to get inlined away and persuading GHC is always non-trivial, so I feel like it's much easier to just reimplement a few lines of logic than attempt a clever general solution requiring an order of magnitude more effort to implement and maintain.

Summary

It seems like this path will allow us to iteratively improve the status quo w.r.t. performance and ultimately resolve the long-standing issue of Data processing being slow, which is among the most pressing issues with UPLC at the moment.

@MicroProofs
Copy link

Great this is exactly what I would like to make compilation simpler and more efficient.

@colll78
Copy link
Contributor

colll78 commented Oct 30, 2024

Very exciting stuff! It's nice to see more support around directly interacting with BuiltinData.

@wadler
Copy link

wadler commented Nov 9, 2024

(i) Thanks for thinking along these lines. Obviously, additional instructions to speed up data processing are a good idea.

(ii) Since there are a finite number of built-in types, an alternative would be to add a new case for each: caseBool, caseInteger, caseData, etc. Presumably that would be slightly more efficient, because it will not be necessary at runtime to determine what kind of data is the motive of the case. (Here I use the following terminology: case motive branch_0 ... branch_n. Everyone uses branch, but motive is not so widely known.) Is there some reason for lumping them all into one instruction?

(iii) Your explanation of the meaning of case for Integer is incomplete. In case motive branch_0 ... branch_n what happens if the motive is less than zero or greater than n? Do you intend for branch_n to be a default?

(iv) Your explanation of the meaning of case for Data is incomplete. You gave only this example

 case (Constr 1 [a, b, c])
    branch0
    (\[x, y, z] -> body1)
    branch2

You don't explain the role of branch0 or branch2, nor how the term is supposed to behave in general.

@effectfully
Copy link
Contributor Author

Since there are a finite number of built-in types, an alternative would be to add a new case for each: caseBool, caseInteger, caseData, etc. <...> Is there some reason for lumping them all into one instruction?

Yes:

  • we currently only have 4 bits to encode AST constructors, so going beyond that will either require complicated gymnastics with reserving, say, 1111 to mean "and now expect more bits for the extended set of AST constructors" -- which would be somewhat annoying to maintain and slower than necessary. Or it'll require us to create a new Plutus version with a completely different binary encoding and then we need to maintain two different binary encodings, so again annoying to maintain
  • if we add caseBool, caseInteger, caseData to the AST, we'll have to update all the code doing something with the AST, of which there's a lot: all transformation passes, all analysis passes (e.g. uniques, purity), serialization/deserialization, parsing/pretty-printing etc. This is all doable, but it'll further delay the change, for which users have been asking us for years already probably. Plus it'll again increase maintenance costs
  • adding all those functions as separate constructors is pretty much giving up on the original design of UPLC being an extensible but conservative untyped lambda calculus. Which is perhaps fine, but then we should do it consistently and give up on the current extensibility entirely, because it's weird to have caseBool in the AST and not booleans themselves

Presumably that would be slightly more efficient, because it will not be necessary at runtime to determine what kind of data is the motive of the case.

Slightly faster at runtime, slightly slower to decode, who knows if it'll pay out. But dispatching on type of the scrutinee is just a bit of constant overhead, the problem that we're addressing here is a ridiculous amount of existing overhead. So I don't think we need a perfect solution here, just one that solves the problem.

If it some point we decide that we want to hardcode builtins into the AST, we can do it one fell swoop and create an entire new UPLC, I don't personally see much value in breaking the current design just to save some constant costs here and there. We can get much more out of giving up on extensibility if we want to.

(iii) Your explanation of the meaning of case for Integer is incomplete. In case motive branch_0 ... branch_n what happens if the motive is less than zero or greater than n? Do you intend for branch_n to be a default?

My bad, sorry, I've now specified that the match should fail.

(iv) Your explanation of the meaning of case for Data is incomplete. You gave only this example

 case (Constr 1 [a, b, c])
    branch0
    (\[x, y, z] -> body1)
    branch2

You don't explain the role of branch0 or branch2, nor how the term is supposed to behave in general.

Made it this:

So in the end we want

case (Constr 1 [a, b, c])
  branch0
  (\[x, y, z] -> body1)
  branch2

to evaluate to

(\[x, y, z] -> body) [a, b, c]

which is a multi-lambda \[x, y, z] -> multi-applied to [a, b, c].

I.e. that case means (pseudocode)

let x = a
    y = b
    z = c
in body1

and branch selection with Data will work the same way as with SOPs: for the Constr i args scrutinee the ith branch gets picked, so given that i is 1 in the example above, we pick the 1th branch and drop the 0th and the 2th ones (we only have the 2th one to make the example a bit more rich, nothing would change if it wasn't there). Except SOPs don't use saturated lambdas and pattern matching on Data has to (for reasons discussed in #5777 and #6225).

Hope this is clearer. Some of that context is in the #5777 and #6225 and I didn't want to duplicate it here. Sorry for any confusion caused!


Just to nitpick:

(Here I use the following terminology: case motive branch_0 ... branch_n. Everyone uses branch, but motive is not so widely known.)

We normally refer to x in case x <...> as "scrutinee" in docs and papers, which seems to be common. A motive, as defined by McBride at least, is the type you're eliminating into, i.e. in our case the type of the body of each of the branches (which has to be the same for all branches).

@wadler
Copy link

wadler commented Nov 11, 2024

Thanks for the response.

The name "scrutinee" is specific and easier to understand than "motive"; good choice. You may be right that I've misunderstood the latter term.

Thanks for the clarification of why you are not defining separate operators for each sort of scrutinee. That should make its way into the design document. I agree with your reasoning, especially the point about not wanting to overuse the four bits available to encode AST constructors.

Thanks also for the clarification of the meaning of the various operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants