Here you can find definitions for words that are commonly used in the compiler along with links to the codebase. Check https://www.roc-lang.org/tutorial if you want to know about general Roc terms. Feel free to ask for a term to be added or add one yourself!
Contributor note: definitons should be roughly ordered like in a tutorial, e.g. Parser should be explained before Canonicalization.
Command Line Interface. The entrypoint of the compiler that brings together all
functionality in the Roc toolset and makes it accessible to the user through the
terminal, e.g. roc build main.roc
.
- new compiler: src/main.zig
- old compiler: crates/cli/src/main.rs
A .roc file forms one module.
Types of modules:
- app (example): Applications are combined with a platform and compiled into an executable.
- module (example): Provide types and functions which can be imported into other modules.
- package (example): Organises modules to share functionality across applications and platforms.
- platform (example): Provides memory management and effects like writing to files, network communication,... to interface with the outside world. Detailed explanation.
- hosted (example): Lists all Roc types and functions provided by the platform.
Implementation:
- new compiler:
- old compiler:
(Intermediate Representation)
A memory optimization technique where only one copy of each distinct value is stored in memory, regardless of how many times it appears in a program or IR. For example, a function named foo
may be called many times in a Roc file, but we store foo
once and use an index to refer to foo
at the call sites.
Uses of interning:
- new compiler: collections/SmallStringInterner.zig, ident.zig, ModuleEnv.zig, tokenize.zig, ...
- old compiler: small_string_interner.rs, mono_module.rs, format.rs, ...
- There are many more uses of interning, I recommend searching for "interner" (case-insensitive).
Any text in a Roc source file that has significant content, but is not a Roc Str like "Hello". Used for variable names, record field names, type names, etc. .
During tokenization all identifiers are put into a deduplicated collection and given an ID. That ID is used in IRs instead of the actual text to save memory.
Identifier in the compiler:
- new compiler:
- Ident
- Ident tokenization: check the functions
chompIdentLower
andchompIdentGeneral
, and their uses. - Ident parsing: search
Ident
- old compiler:
- IdentStr
- module/ident.rs
- parsing: search "identifier" (case-insensitive)
A specific word that has a predefined meaning in the language, like crash
, if
, when
, ... .
Many keywords can not be used as a variable name.
We have an overview of all Roc keywords.
Keywords in the compiler:
An operator is a symbol or keyword that performs a specific operation on one or more operands (values or variables) to produce a result.
Some examples: +
, =
, ==
, >
. A table of all operators in Roc.
+
is an example of binary operator because it works with two operands, e.g. 1 + 1
. Similarly !
(e.g. !Bool.false
) is a unary operator.
Operators in the compiler:
- New compiler: search
Op
in tokenize.zig - Old compiler: search
operator_help
in expr.rs
Syntax within a programming language that is designed to make things easier to read or express. It allows developers to write code in a more concise, readable, or convenient way without adding new functionality to the language itself.
Desugaring converts syntax sugar (like x + 1
) into more fundamental operations (like Num.add(x, 1)
).
A table of all operators in Roc and what they desugar to
Desugaring in the compiler:
- New compiler: canonicalize.zig (WIP)
- Old compiler: desugar.rs
A compiler phase is a distinct stage in the process the compiler goes through to translate high-level source code into machine code that a computer can execute. Compilers don’t just do this in one big step, they break it down into several phases, each handling a specific task. Some examples of phases: tokenization, parsing, code generation,... .
The process of breaking down source code into smaller units called tokens. These tokens are the basic building blocks of a programming language, such as keywords, identifiers, operators, and symbols. The input code is scanned character by character and is grouped into meaningful sequences based on the language's syntax rules. This step makes parsing simpler.
Example source code:
module []
foo : U64
Corresponding tokens:
KwModule(1:1-1:7),OpenSquare(1:8-1:9),CloseSquare(1:9-1:10),Newline(1:1-1:1),
Newline(1:1-1:1),
LowerIdent(3:1-3:4),OpColon(3:5-3:6),UpperIdent(3:7-3:10),Newline(1:1-1:1)
New compiler:
Old compiler:
- We did not do a separate tokenization step, everything happened in the parser.
(Abstract Syntax Tree)
An AST organizes and represents the source code as a tree-like structure. So for the code below:
module []
foo : U64
The AST is:
(file
(module (1:1-1:10))
(type_anno (3:1-4:4)
"foo"
(tag (3:7-3:10) "U64")))
It captures the meaning of the code, while ignoring purely syntactic details like parentheses, commas, semicolons,... . Compared to raw source code, this structured format is much easier to analyze and manipulate programmatically by the next compiler phase.
The AST is created by the parser.
New compiler:
- See the
Node
struct in this file. - You can see examples of ASTs in the .txt files in this folder.
Old compiler:
- See
FullAst
here - Some tests
- Many snapshot tests
(mono, specialization)
Monomorphization, also known as type specialization, is the process of creating a distinct copy
of each instance of a generic function or value based on all specific usages in a program.
For example; a function with the type Num a -> Num a
may only be called in the program with a
U64
and a I64
. Specialization will then create two functions with the types U64 -> U64
and
I64 -> I64
.
This trades off some compile time for a much better runtime performance, since we don't need to
look up which implementation to call at runtime (AKA dynamic dispatch).
Related Files:
-
new compiler:
-
old compiler: