implement a new CGIR and C code generator #1625

zerbina · 2025-10-02T02:10:04Z

Summary

Add an all new CGIR, together with a new code generation architecture
using it and a new C code generator.

Details

New CGIR

Core design decisions for the language:

structured control-flow: makes translation to Wasm/asm.js easier
syntax that resembles NimSkull/C: familiar to the someone that
already knows NimSkull and its AST
statically and strongly typed: makes debugging and analysis easier
narrow(er) operations: keep decision making out of the code generators
built-in simple exception handling

Core design decisions for the IR:

packed AST for storage; terminals are stored out-of-band and
are interned
every node carries source location information, to make high
precision debug information possible
one IR for everything: types, statements, expressions

The IR also comes with a grammar and type checker, to help with
debugging, troubleshooting, and codifying the static semantics. It's
always built into the compiler, but due to its overhead, has to be
enabled at run-time by passing -d:validateCgir to the compiler.

The old CGIR is still used for the JS and VM code generators, and thus
has to be kept for now.

New Architecture

The intertwined MIR -> CGIR -> C translation is replaced with
separate MIR -> CGIR and CGIR -> C translation steps. This allows
reusing the MIR -> CGIR parts for other code generators.

Code generator may only support a subset/dialect of the CGIR, which the
MIR -> CGIR facilitates by accepting a set of code generator
capabilities.

As a preparation for incremental compilation, the CGIR has support for
being split multiple units (i.e., modules), though this feature is not
actually used right now.

Breaking Changes

features: TLS emulation is not supported anymore. The switch still
exists, but enabling emulation now causes an error
features: header generation is not supported anymore
C FFI: field symbols cannot be used in emit statements anymore

Changes To The Produced C Code

C scopes reflect the NimSkull-level scopes
immutable pass-by-reference parameters are marked with NIM_NOALIAS
error handling in .compilerprocs is not omitted anymore
except for returning array values, out parameters are not
used anymore
large set operations are implemented as runtime procedures
updating the TFrame instances doesn't use C macros anymore
non-inline routines are emitted in the order they're processed
by sem (which is roughly the order they appear in the source code)
RTTI for nominal types is defined in the nominal type's home module
RTTI for structural types is defined in the project's entry module
RTTI is initialized using static initializer expressions

To-Do

upstream all changes that can be moved out into separate PRs
make a separate fix for some of the --expandArc-output related test failures

Notes For Reviewers

The code is quite old and went through multiple major refactors (for example, mir2cg once used the subTree approach to tree construction). I've made multiple Q/A passes over it, but given the time I've spent working on the changes, it's likely that I've become blind to some issues.

Points Of Interest:

cgir2; contains the type definitions and traversal code for the CGIR
validation; implements all validation logic. It's not pretty, as both the grammar and type checker are crammed into the module
mir2cg; implements the MIR -> CGIR translation. Given the size of the language that falls out of the MIR stage, this module is enormous, with some of its logic moved into companion modules (rtti, mirflow, and mirtypes2cg)
cgen; implements both the CGIR -> C translation. The pass for generating a C translation unit description from a CGIR module is also located here
cgbackend; implements the generic backend

Imported object types that contain a pointer to themselves weren't handled properly, leading to the interior types symbols pointing to the internal object type, not the imported type.

The section specifier wasn't at the start of the string, meaning it was placed into the procedure section.

Also adjust the related specification test.

It's not used for anything, nor will it be needed.

Instead of a dedicated tagged union type, tagged union support is now provided by allowing union fields to be associated with tag fields.

It's misplaced, but injecting the initialization earlier is currently not possible.

There's no field at position -1 in PType-based type representation, which previously caused rendering to crash.

Allows for static cleanup of the environment in some cases.

The new shape is a lot more robust and also easier to parse.

Lookup in generic instance types won't return the expected field ID otherwise.

Nim Debug Information files are not generated anymore, making the module for creating them obsolete.

Both concepts are now represented via zero-length arrays

This is meant as an accommodation for the C code generator.

* remove the array-in-struct wrapping; only emit array typedefs * remove the obsolete `cnkPtrToArrayTy` and `cnkFlexField` handling * translate pointers-to-inline-array to pointers-to-element types; taking the address of an array lets the lvalue "decay" * handle inline array types properly (they only appear in field declarations) * use a common procedure for emitting non-function C declarations

They're arrays underneath, which cannot be passed by value in C (without extra code generator support).

Now that CGIR arrays are translated to C arrays, using normal assignments for CGIR arrays no longer works.

A proper memcopy can only be omitted when the source expression operand is a proper `Expr` (otherwise taking the address is not possible), so the `genAsgn` overload taking an `Expr` value has to be used whenever a memcopy might be necessary.

Not pretty, but it's required now that CGIR arrays are translated directly to C arrays.

The parameter list for functions without any parameters must be `void` prior to C20.

zerbina · 2025-10-14T01:27:18Z

In C, arrays are not first class types: they're implicitly converted to pointers to their first element, array declarations decay to pointers in parameter positions, and returning arrays from a procedure (without a pointer indirection) is not possible.

To keep mirgen simpler, I had previously opted to implement them as C arrays wrapped in structs in cgen (with mir2cg always emitting identified array types, never inlining them). This worked very well, but has one major flaw that I didn't consider: imported types.

{.emit: "/*TYPESECTION*/ struct Foreign { int x[2]; };".}

type Foreign {.importc: "struct Foreign", nodecl.} =
  x: array[2, cint]

In the context of the declaration x, array[2, cint] actually refers to a C array. There's no distinction between NimSkull vs. C arrays in neither the source language nor the MIR, so C arrays can be used where NimSkull arrays are expected, and vice versa.

There are multiple ways to address this problem, but for now, I've simply opted for using the same approach as the previous code generator, namely to translate NimSkull/MIR arrays to C arrays directly and accommodate for the array limitation in the code generator (mostly mir2cg). Having pass-by-value arrays would have allowed for some sink improvements and the struct wrapping also fixed:

type Obj = object
  x: ptr array[2, array[2, Obj]]
# the C code generated for `Obj` doesn't compile

so this is quite unfortunate.

Still, the changes also allowed for simplifying the CGIR type system a bit, by removing the dedicated pointer-to-array type and flexible struct fields (both are subsumed by zero-length arrays, inspired by LLVM).

Replacing the return type of .tailcall procedures with the `Continuation` is tricky to do in `mir2cg`, as it would happen during type translation, which doesn't have access to a mutable type environment (and neither should it).

zerbina · 2025-10-14T02:01:06Z

I've split out all commits tagged with [upstream], except for 4ff9ec3, e9e50c3, 4a185e3. These changes are either too small, require too many changes of code made obsolete by this PR, or cannot be easily tested/explained.

Emit event handling may cause new entities being registered with the MIR environment, which too have to be queued for processing.

Only parts of the test matrix fail (the ones using `--tlsemulation:on`), so the test is simply disabled wholesale.

When the slice length is zero, the array pointer must not be accessed.

`.compilerproc`s are treated as never raising when used in source code, leading to the necessary error handling being omitted.

An `if` followed by a `scope` can still "contain" defs who are not "scoped", as it's possible for there to still be defs between the end of the scope and the end of the `if`. Aside: using a pseudo basic-block representation instead of a real one was a major mistake. Oh well.

Quite an edge case, with undefined behaviour.

zerbina · 2025-10-15T02:32:01Z

Some time measurements performed on Windows using

hyperfine --warmup 1 "<exe> --compileOnly --verbosity:0 --hints:off --warnings:off compiler/nim.nim"

on the compiler sources at devel. The C compiler used is MinGW GCC 11.1.0.

Compiler	Built By	Time Taken
`3550429`	devel	14.574 s ± 0.143 s
devel	devel	15.337 s ± 0.144 s
`3550429`	`3550429`	15.240 s ± 0.167 s

These results show that the new backend is significantly faster at producing C code, but with the produced C code being a lot slower. A surface-level profiling at some earlier point during development only yielded that everything became a little slower, and I haven't taken a deeper look at it yet.

My guess is that that the .compilerproc error handling or the out parameter change are the culprit.

zerbina added 30 commits October 1, 2025 23:56

[upstream] typemaps: properly hash proc types

01730f9

[upstream] mirtypes: properly handle recursive imported types

f1e1ca6

Imported object types that contain a pointer to themselves weren't handled properly, leading to the interior types symbols pointing to the internal object type, not the imported type.

[upstream] mirgen: handle views into varargs properly

4ff9ec3

[upstream] sigmatch: use proper container for all varargs

4a302fc

[upstream] unreachable_elim: fix bug

b5dac87

[upstream] modulelowering: split top-level emits

bad38cc

[upstream] tests: update tstatic_with_converter

e0faeeb

The section specifier wasn't at the start of the string, meaning it was placed into the procedure section.

[upstream] manual: update the top-level emit documentation

7696fbe

Also adjust the related specification test.

[upstream] mirtypes: remove offset tracking

0a09a45

It's not used for anything, nor will it be needed.

[upstream] mirtypes: remove tagged union types

90d7653

Instead of a dedicated tagged union type, tagged union support is now provided by allowing union fields to be associated with tag fields.

[upstream] mirtypes: make the embedded state part of fields

94e06a2

[upstream] mirpasses: implement type header init injection

e9e50c3

It's misplaced, but injecting the initialization earlier is currently not possible.

[upstream] mirtypes: implement tagged union branch query

978d26a

[upstream/finish] rtchecks: implement bound checks

ef88353

utils: fix MIR rendering failing for type header access

17f22ea

There's no field at position -1 in PType-based type representation, which previously caused rendering to crash.

[upstream] extccomp: fix CC hint not working as intended

ddb8ac2

[upstream] vm: fix adding openArray to NimNode crashing

7851c59

[upstream] liftdestructors: use proper nil check

e9f173f

mirtypes: add paramType function

d961854

[upstream] sem: fix typing with tuples containing views

7772f6f

[upstream] lambdalifting: mark env as final

a267332

Allows for static cleanup of the environment in some cases.

[upstream] mirtypes: add dedicated proc signature types

c0f1070

[upstream] tailcall_elim: fix callConv lookup

4e4da5a

[upstream] proto_mir: consider C aliases for conversions

4a185e3

[upstream] mirgen: use a better shape for mOffsetOf

7d64ccd

The new shape is a lot more robust and also easier to parse.

[upstream] types: consider resolved type classes

e3f8eab

mirtypes: don't use canonical type in lowerField

622540c

Lookup in generic instance types won't return the expected field ID otherwise.

containers: fix parameter types of []=

2b95ac4

remove the previous C code generator

b95b383

remove ndi.nim

146bb07

Nim Debug Information files are not generated anymore, making the module for creating them obsolete.

zerbina added 12 commits October 14, 2025 00:32

mir2cg: remove leftover debug code

bdedb8e

cgir: remove flexible members and the ptr-to-array type

863d5c6

Both concepts are now represented via zero-length arrays

mirtypes2cg: use anon array types as pointer targets

3b793d5

This is meant as an accommodation for the C code generator.

types: also use pass-by-reference for large sets

3e49148

They're arrays underneath, which cannot be passed by value in C (without extra code generator support).

mir2cg: use nimCopyMem for array assignments

0a8da13

Now that CGIR arrays are translated to C arrays, using normal assignments for CGIR arrays no longer works.

mir2cg: return arrays via an out parameter

03a51db

Not pretty, but it's required now that CGIR arrays are translated directly to C arrays.

cgen: translate nullary procs correctly

88af3fd

The parameter list for functions without any parameters must be `void` prior to C20.

cboot: add a missing semicolon

0188271

validation: fix equal for array types

6912fc3

validation: rename elemType to pointeeType

ef18c68

zerbina added 3 commits October 14, 2025 01:32

Merge branch 'devel' into new-code-gen-architecture

d081ee7

post merge fix-up

fef9ec6

mirtypes: use Continuation type as .tailcall return type

7d91561

Replacing the return type of .tailcall procedures with the `Continuation` is tricky to do in `mir2cg`, as it would happen during type translation, which doesn't have access to a mutable type environment (and neither should it).

zerbina added 8 commits October 14, 2025 15:31

more post-merge fixups

333996a

backends: run discovery after bekEmit events

fe81c12

Emit event handling may cause new entities being registered with the MIR environment, which too have to be queued for processing.

tests: disable ttlsemulation.nim

1aead73

Only parts of the test matrix fail (the ones using `--tlsemulation:on`), so the test is simply disabled wholesale.

cgir: only allow left-shift for uint values

9ef76b2

mir2cg: update mShlI translation

21c6350

mir2cg: fix mnkToSlice translation

a2a5346

When the slice length is zero, the array pointer must not be accessed.

chcks: use sysAssert directly in chckBounds

81ca324

`.compilerproc`s are treated as never raising when used in source code, leading to the necessary error handling being omitted.

zerbina force-pushed the new-code-gen-architecture branch from ae5ed40 to 453370a Compare October 14, 2025 21:07

zerbina added 3 commits October 14, 2025 23:58

mir2cg: support constant union constructions

7478e03

Quite an edge case, with undefined behaviour.

mir2cg: add missing ptrcast to procval->closure lowering

7e84d69

mir2cg: fix wrong alignment being passed to alignedDealloc

3550429

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement a new CGIR and C code generator #1625

implement a new CGIR and C code generator #1625

Uh oh!

zerbina commented Oct 2, 2025 •

edited

Loading

Uh oh!

zerbina commented Oct 14, 2025 •

edited

Loading

Uh oh!

zerbina commented Oct 14, 2025 •

edited

Loading

Uh oh!

zerbina commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

implement a new CGIR and C code generator #1625

Are you sure you want to change the base?

implement a new CGIR and C code generator #1625

Uh oh!

Conversation

zerbina commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

New CGIR

New Architecture

Breaking Changes

Changes To The Produced C Code

To-Do

Notes For Reviewers

Uh oh!

zerbina commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zerbina commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zerbina commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zerbina commented Oct 2, 2025 •

edited

Loading

zerbina commented Oct 14, 2025 •

edited

Loading

zerbina commented Oct 14, 2025 •

edited

Loading