This repository implements a technique for embedding pattern matching into a Haskell EDSL, with the intention of integrating the ideas into the Haski compiler.
EDIT (2021-06-25): Precise implementation details in this repository may differ slightly from that of the Haski compiler. In general though, the technique used is the same.
Specifically, a framework is provided that allows the user to:
-
Define how to convert some input type (e.g.
Int
) into a user-defined ADT. -
Use the above definition to convert into, and deconstruct using pattern matching, values of the user-defined ADT.
The technique is demonstrated in a simple expression language E
along with
a compiler targeting C.
Example programs can be found in the src/E/Examples
directory.
The library can be built with Stack:
stack build
To generate documentation:
stack haddock
To compile one of the example programs, first run GHCi:
stack repl
Then, apply the printProg
function to an E
expression:
# A dummy argument may need to be provided to the program; the compiled
# programs don't read arguments.
GHCi> printProg (needsWatering 0)
The expression language E
is a straightforward deep embedding, supporting
simple arithmetics and some basic bitwise operations. The constructors are
given by the GADT found in E.Core. Functions for compiling
E
programs can be found in E.Compile. The generated C
does not handle input; its main()
function simply runs with a hardcoded
value. E
is intended to represent a generic embedded language, so that
techniques that work for E
should also work for other embedded languages.
This section will walk through the hypothetical example use case presented in E.Examples.Watering. The example is chosen with the target domain of Haski in mind (IoT-devices).
An example of a typical smart device is a sensor which can collect data from its environment, and send the data to some application which can process it. Imagine that we are growing plants, and that we want to hook up a sensor to collect some information about the air temperature and soil humidity, so that a connected application can determine whether the plants need watering.
For low-level devices such as this, it is common for the sent data to be encoded in a format that might be a bit inconvenient to work with. For this example, imagine the data is encoded within a 16-bit integer, and that it can take one of two forms:
If control bit C is set:
| C = 1 | 7 | 8 |
^ ^ ^------------ Temperature (degrees Celsius)
| +-------------------------- Humidity (percentage)
+---------------------------------- Control bit
If control bit C is not set:
| C = 0 | 15 |
^ ^------------------------- Error code
+---------------------------------- Control bit
The data is encoded within the 16-bit integer, but has two intepretations (given by the specification of the sensor). Which one to use depends on whether the most significant bit, the control bit, is set:
-
If the control bit is set, then the first 8 bits of the input represent the temperature in degrees Celsius as a signed 8-bit integer. The following 7 bits represent the humidity percentage of the soil as a 7-bit unsigned integer (a bit of a nonsensical representation to be fair, as it only spans the 1% - 99% range).
-
If the control bit is not set, then the sensor is signals an error, and the remaining 15 bits represent some sort of error code.
Now, say we wanted to perform some logic based on the input sensor data. For
example, we might want a function to return True
if the soil humidity is
sufficiently low, or if the temperature is sufficiently high.
The type signature for such a function could look something like this:
needsWater :: E Word16 -> E Bool
The problem is that we need to perform a bunch of bitwise masking and shifting to extract the data we want, and things could clutter up the function body quite a bit, which might obscure our application logic. Besides, the form of the sensor data will not change while the program runs; it is always going to be one of the two encodings.
Therefore, we might want to define a Haskell data type to model the two encodings of the sensor data.
type Temp = E Word8 -- Unsigned 8-bit integer
type Humidity = E Int8 -- Signed 8-bit integer
type ErrorCode = E Word16 -- Unsgined 16-bit integer
data SensorData
= Sensor Temp Humidity
| Error ErrorCode
toSensorData :: E Word16 -> SensorData
{- ... -}
Our data type SensorData
has two constructors, corresponding to the two
different encodings for our sensor data. The fields of the constructors
correspond to the encoded data; the integer types used (Int8
, Word8
,
Word16
) are the smallest sizes of multiples of 8 that can fit the data. We
could now define a toSensorData
conversion function to contain most of the
dirty bitwise operations, so that the application logic stays clean.
Now that we have our custom data type, which is more convenient to work with
than the 16-bit binary encoding, we'd like to define needsWater
similar
to this:
needsWater :: E Word16 -> E Bool
needsWater x = case toSensorData x of
Sensor temp humidity -> temp >. 30 ||. humidity <. 25
Error _errorCode -> valE False -- Placeholder; do something sensible here
The operators >.
, ||.
, and <.
are variants of the normal >
, ||
, and
<
, but operating on values in the expression language. valE :: a -> E a
simply brings a value into the expression languge.
The problem with an implementation like this is that our pattern matching in
the case
-expression is not on the embedded level, like our temperature- and
humidity-checking logic is. When we compile needsWater
, the result of
toSensorData x
will be either a Sensor
value or an Error
value. What
we want is to inspect the value of toSensorData x
during the runtime of the
generated target language program (C), but currently we instead inspect it
during the runtime of our host language program (Haskell). What we need is
a representation of the case
-expression inside the expression language,
as a constructor of the E
type.
To work around this problem, we use the framework provided and modify our
needsWater
program to the following:
{-# LANGUAGE LambdaCase #-}
needsWatering :: E Word16 -> Estate (E Bool)
needsWatering x = match x $ \case
Sensor temp humidity -> temp >. 30 ||. humidity <. 25
Error _errorCode -> valE False -- Do something sensible here
We note that we call a function match
in place of toSensorData
and that
we no longer use the built-in Haskell case of
-construct, instead
providing a function defined using a lambda-case (enabled by the
LambdaCase
languge pragma). The return type is also slightly different; Estate
is
simply a specialization of the State
monad and is used to generate unique
identifiers when compiling the E
AST:
type Estate = State Env
data Env = {- ... -}
To understand how this works, we first take a look at the signature for the
match
function, here slightly simplified:
match :: Partition a p => E a -> (p -> E b) -> Estate (E b)
match
takes value of type E a
as its first argument, this is the
scrutinee. The second argument is the function that can perform pattern
matching, but it takes a value of type p
, not E a
. When match
is
applied, it will transform the E a
value to a p
value; this
transformation is indicated by the Partition a p
constraint and
corresponds to the toSensorData
function from the earlier example.
{-# LANGUAGE MultiParamTypeClasses #-}
class Partition a p where
partition :: [E a -> (E Bool, p)]
A Partition a p
instance defines how to transform values from some input
type a
to a sum type p
. Since normal Haskell data
-style types are sum
types, this means that we can define instances of Partition
to transform
values of other types into instances of our user-defined types, such as
SensorData
. Because sum types can have multiple constructors, an instance
of Partition a p
must also define conditions to determine which
constructor to use, depending on the value of the input (of type a
).
Note that there are no actual constraints on the type variables a
and p
;
they can be any types, in theory. However, seeing as the point of Partition
is to
define transformation and construction of data, it might be hard though to
implement anything useful for uninhabited types (such as Void
), for example.
partition :: [E a -> (E Bool, Estate p)]
takes no arguments and returns a
list of functions where each function represents one constructor of the sum
type p
, or one "branch" of a case of
-construct. Each function takes a
value of the input type a
and returns a pair:
-
The first half of the pair is a predicate on the input value and affects program control-flow in the target language; the
E Bool
expression corresponds to the condition in the generated if-statement. -
The second half of the pair constructs a value of the sum type
p
which can be dependent on the input value. This provides the functionality oftoSensorData
from earlier.
The following would be a possible definition of a Partition
instance for
our example use case:
{-# LANGUAGE BinaryLiterals #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE NumericUnderscores #-}
instance Partition Word16 SensorData where
partition =
[ \ v -> ( testBitE 15 v
, Sensor (castE (v &. tempMask))
(castE (v &. humidityMask >>. 8))
)
, \ v -> ( notE (testBitE 15 v)
, Error (v &. errorMask)
)
]
where
-- Language extensions BinaryLiterals and NumericUnderscores enables
-- nicer literals for binary values.
tempMask, humidityMask, errorMask :: E Word16
tempMask = 0b0000_0000_1111_1111
humidityMask = 0b0111_1111_0000_0000
errorMask = 0b0111_1111_1111_1111
In each of the corresponding E Bool
fields in our pairs, we implement the
check for the control bit in our input data, and apply constructors Sensor
and Error
in the second half of the pair accordingly. testBitE
checks
whether a given bit is set, and castE
and >>.
are straightforward
C translations of type-casting and bitwise shift-right operations.
When constructing our needsWatering
program, match
will apply all the
functions given by partition
to a placeholder value ESym
, thus creating
values of SensorData
using both our constructors:
-- Simplified snippet from the definition of 'match`
let branches = map ($ ESym) $ partition @a @p
Then, when generating the C code, the compiler can figure out how to replace
the ESym
values with that of the scrutinee.
Below is the corresponding C output (slightly edited for clarity) for the
pattern match and match
usage in needsWatering
:
uint16_t _scrut1;
bool v0(uint16_t arg) {
_scrut1 = arg;
bool res;
if (((1 << 15) & _scrut1) != 0) {
res = ((int8_t) (_scrut1 & 255)) > 30 || ((uint8_t) ((_scrut1 & 32512) >> 8)) < 25;
} else if (!(((1 << 15) & _scrut1) != 0)) {
res = false;
} else {
fprintf(stderr, "No match on: `_scrut1`\n");
exit(1);
}
return res;
}
We can see that the right-hand sides of the lambda-case function from
needsWatering
has been applied in some form on the scrutinee, in combination
with whatever transformation was applied by partition
.
On a quick note, one might wonder why the variable _scrut
is necessary,
seeing as it immediately gets assigned the function argument arg
. Why not
use arg
directly? The reason is only an implementation convenience to avoid
scoping errors that can otherwise occur when nestling calls to match
. The
current compiler creates single-parameter function defintions to handle each
call to match
. In the generated C code, this means that a variable from an
outer match
is invisible inside the function body generated from an
inner match, even though this differs from the Haskell scoping:
f :: E Int -> Estate (E Bool)
f x = matchM x $ \case
A2 b n -> b
A1 n -> match n $ \case
A1 _ -> b -- <- b refers to a field in the outer match.
A2 t _ -> t
Our needsWatering
example looks fairly OK at this point, but there exists a
potential issue related to our technique of pre-applying both partition
and
the lambda function on which match
is applied. The issue manifests when
fields of constructors are used multiple times in the body of the match
call.
Consider a different (contrived) example, and the generated C code (again, edited for clarity):
contrived :: E Int -> Estate (E Bool)
contrived v = match v $ \case
T1 n -> n + n >. n
T2 -> valE False
data T = T1 (E Int) | T2
instance Partition Int T where
partition =
[ \ v -> (v >=. 0, T1 (v * 10 - 2 * 10 - 2))
, \ v -> (v <. 0, T2)
]
int _scrut1;
bool v0(int arg) {
_scrut1 = arg;
bool res;
if (_scrut1 >= 0) {
// Duplicate computations!
res = (((((_scrut1 * 10) - (2 * 10)) - 2)
+ (((_scrut1 * 10) - (2 * 10)) - 2))
> (((_scrut1 * 10) - (2 * 10)) - 2));
} else if (_scrut1 < 0) {
res = false;
} else {
fprintf(stderr, "No match on: `_scrut1`\n");
exit(1);
}
return res;
}
In the body of the match
call, we only performed two operations, +
and
>.
. Yet, because we reference the expression n
multiple times, its entire
definition (as defined by partition
) shows up multiple times as well! This
is a problem in several aspects. For one, it clutters the code and
highlights the lack of a relation between the fields of our constructors and
corresponding variables in the generated code. More importantly, however, it
leads to extra, unnecessary computation in the generated code.
What we would like is for each field of our constructors to be assigned to a
variable in the generated code, so that they can be reused. For this to
happen, the compiler needs to know when an expression corresponds to a field
of a constructor, for example by having expressions tagged with some kind of
dedicated "tag" constructor. Implementing this is a bit tricky though; the
match
function itself does not know enough about the type of the value it
operates on to meaningfully inspect each constructor field, though it might
be possible through generic programming techniques such as GHC generics or
similar.
Instead, the simplest way would be to apply a tag constructor EField
in the
defintion of partition
, where the value is constructed:
EField :: ArgId -> E a -> E a
-- Simplified type of ArgId to demonstrate the idea; the actual type is
-- slightly different, but conceptually the same.
type ArgId = String
instance Partition Int T where
partition =
[ \ v -> (v >=. 0, T1 (EField "T1_1" (v * 10 - 2 * 10 - 2)))
, \ v -> (v <. 0, T2) -- ^ -- Tag constructor
]
EField
wraps an expression with a tag, indicating to the compiler that it
should assign the entire sub-expression to a variable, and use that variable
in the future if the expression is referenced again. This works, but it has
the downside of introducing additional boilerplate to the user, which is both
unergonomic and potentially error-prone, especially so for larger types and
for constructor with more fields. The EField
tag only serves its purpose
internally anyway, so it should be ideally be hidden away from the user.
Here is where we can use smart constructors to hide the tagging of field expressions. A smart constructor is a normal function that wraps the construction of a value, possibly performing additional work that can be hidden away from the user.
With a smart constructor, we could instead implement partition
like this:
instance Partition Int T where
partition =
[ \ v -> (v >=. 0, _T1 (v * 10 - 2 * 10 - 2))
, \ v -> (v <. 0, T2)
]
-- Smart constructor for constructor T1.
_T1 :: E Int -> T
_T1 a = let t1 = EField "T1_1"
in T1 (t1 a)
Now the implementation of partition
is minimally impacted, as we switch
from writing T1
to _T1
. Of course, we now instead have an entire separate
smart constructor function that needs to be written instead, so we have only
really moved the problem elsewhere.
Thankfully, these smart constructors follow a pattern, and the smart
constructor name, type, number of parameters, and number tags to create, can
all be derived from the type definition of value that it constructs. While
the tagging would be difficult to do in match
due to operating on a
polymorphc type, we can generate our smart constructors using Template
Haskell.
With this, we can add a single line to our data type declaration to generate smart constructors for it:
{-# LANGUAGE TemplateHaskell #-}
data T = T1 (E Int) | T2
$(mkConstructors ''T)
instance Partition Int T where
partition =
[ \ v -> (v >=. 0, _T1 (v * 10 - 2 * 10 - 2))
, \ v -> (v <. 0, _T2) -- Not necessary (0 fields) but looks uniform :)
]
As a small detail, the generated code from the TH function is slightly different from our hand-written one:
_T1 :: E Int -> Estate T
_T1 v0 = do
tag1 <- newFieldTag
pure $ T1 (tag1 a)
newFieldTag
generates a unique tag name, which will be unique in the entire
program. This results in the smart constructor having the return type
Estate a
instead of simply a
, which also changes the definition of
partition
accordingly:
partition :: [E a -> (E Bool, Estate p)]
-- ^^^^^^
Using the smart constructors, our generated code now looks like this instead:
int _scrut1;
int _scrut1_field1;
bool v0(int arg) {
_scrut1 = arg;
bool res;
_scrut1_field1 = ((_scrut1 * 10) - (2 * 10)) - 2;
if (_scrut1 >= 0) {
res = (_scrut1_field1 * _scrut1_field1) > _scrut1_field1;
} else if (_scrut1 < 0) {
res = false;
} else {
fprintf(stderr, "No match on: `_scrut1`\n");
exit(1);
}
return res;
}
The duplicate computations from earlier are gone and the relation to the
original E
program is a bit clearer as well!