WIP: rewrite the C code generator #1333

zerbina · 2024-06-05T00:45:08Z

This PR is a from-scratch rewrite of the C code generator. It's the
successor to #424, incorporating many ideas and lessons-learned from
the latter. The concrete goals are to:

significantly speed up the C code generation phase
having a significantly simpler and more modular implementation
getting rid of the architectural concessions that had to be made for
the legacy C code generator
improving the output C code (smaller amount, more efficient, easier
to inspect)

A big focus is on using a data-oriented design. The planned
architecture is as follows:

the code generator gathers the MIR bodies of all alive entities
each MIR body is translated to an intermediate representation (the
CIR)
the intermediate representation is an AST-like IR closely resembling
C's syntax
the code generator itself doesn't handle tracking of dependencies
(types, procs, etc.) -- the orchestrator does
once the CIR for all bodies in the program is generated, they're
rendered into textual C code and assembled into complete C files,
which are then written to disk

Using an intermediate IR has multiple benefits over directly
translating to text:

smaller memory footprint, especially for larger expressions with many
duplicate identifiers
less temporary string allocations; one single character buffer can be
used at the end
the IR it can store semantic information (e.g., referenced procedure,
types, etc.)
rendering is decoupled from translation (more modular, and the code
generator doesn't have to worry about formatting)
small peephole syntax optimizations are possible (e.g., collapsing
(*a).x to a->x)

The new code generator operates directly on the MIR, no full in-between
IR (like the CGIR) is used. Prior to code generation, the MIR is
lowered to a degree where:

translating it to C code is straightforward
the code generator doesn't have to discover or synthesize new alive
entities

The PR is a work in progress. While the broad design and direction is
likely final already, many details are most likely going to change.

There's not much to it. The code could be shortened a bit using templates, but that can happen at a later point. The definition of `CodeGenEnv` is hand-waved into the future.

They're meant to be easy to use and have low overhead.

All relevant C code generator modules are suffixed with a "2", in order to make room for the new modules. They're not yet removed, so that their code can still be referenced easily.

The general structure is similar to the old `cbackend`, but with two important differences: * the global and per-module types are owned by orchestrator now, not `cgendata` * the output (i.e., the C files) are funnelled through a dedicated type (`Output`)

It works much like the previous version, but with more generalized support for header files. Compare to before, all the write-to-disk management is now fully handled by the orchestrator, not the code generator (i.e., `cgen`). The compiler compiles again (but the result cannot compile the compiler, for obvious reasons).

zerbina · 2024-06-05T00:50:47Z

For helping me in the development, I've added a small profiling utility: the measure module, which provides the measure template. Once the PR is finished, I'll remove it again.

Various key procedure are instrumented with measure template, counting the number of runs, time taken, and - optionally - the number of allocations/deallocations. When compilation is done, the counters are both echoed and written to an SQLite database.

`CNode` erroneously used a raw `uint32` for `ident`.

The simplest solution for now. Moving them to a separate type might be better, but that can happen later.

Some field names were outdated.

This also includes some mid-end processing, like destructor call optimizations, in order to get a better relative feel for where time is spent.

The orchestrator will need it to concatenate partial MIR bodies.

The MIR environment is owned by the `CodeGenEnv` now.

Simple: if assembling produced some code, append it to output list, otherwise don't. In other words, much like before, no C file is created for modules that don't result in any code.

Everything only needed within a single module is stored in `BModule`, things that are shared are stored globally (in `BModuleList`). This keeps the scopes of local entities small, and will make it easy to free memory early (by destroying a `BModule` instance once the C code for it has been generated).

Some details are still missing, but the general flow is there. CIR is generated for the various entities, which is then put into either the global or module-local AST. When all CIR was generated, `assemble` gathers everything the TU needs into a single place and renders the result.

The genX procedure are expected to output at least *something*, otherwise sadness ensues, so an empty block is temporarily emitted.

zerbina · 2024-06-14T19:07:37Z

I've implemented the basic code orchestration flow. It's rather simple, especially compared to before: the orchestrator runs common backend processing (backends), which produces the events driving code generation (processEvent). Once all events were processed (and thus all code generated), the TU's dependencies (types, procedure declarations, inline procedures, etc.) are gathered, and everything is rendered as C code.

Except for registering some new identifiers, the new C code generator itself doesn't modify any global state -- it simply takes a MIR body and outputs the CIR for it. No new entities are registered with the MIR environment. This makes it possible to handle inline procedures fully within the orchestrator, which in turn renders the complicated "first seen in module" tracking in backends obsolete.

Finally, the new orchestrator also addresses the issue of ostensibly small changes in one module causing many modules (sometimes the whole project) to be recompiled, something which got exacerbated when the code generation orchestrator architecture was first introduced.

zerbina · 2024-06-24T19:40:31Z

The next big blocker is the missing type IR for the MIR. It's the basis for lowering seq, string, openArray, and set, and there are also some type-related architectural unknowns (e.g., how type dependencies of a procedure are tracked exactly) that can only be figured out with the MIR type IR in place.

With the MIR's type IR in place, further work on both cgen and the various type-related lowerings that the new cgen needs can happen concurrently. Lowering seq, string, etc. doesn't depend on the new code generator, meaning that it can implemented in separate PRs.

zerbina · 2024-08-08T22:39:01Z

The list of things that need to be moved outside the code generator (via separate PRs) before the rewrite can be completed:

creation of RTTI objects
RTTI header initialization for objects and refs
lowering complex finallys into gotos (mirgen: lower finally #1468)
turning non-integer casts into memcpys (mir: lower cast with MIR pass #1411)
lowering of set operations
lowering of seq and string operations
lowering of openArray operations
stack-trace handling (i.e., injecting nimln, nimFrame, and popFrame calls)

zerbina added 7 commits June 4, 2024 23:46

sketch out the initial design for the C IR

03b1de1

implement the CIR formatter

2ba2693

There's not much to it. The code could be shortened a bit using templates, but that can happen at a later point. The definition of `CodeGenEnv` is hand-waved into the future.

add some temporary profiling facilities

6db0d2a

They're meant to be easy to use and have low overhead.

get a clean slate

daa5e54

All relevant C code generator modules are suffixed with a "2", in order to make room for the new modules. They're not yet removed, so that their code can still be referenced easily.

restore the IC integration

a014eb6

zerbina added refactor Implementation refactor compiler/backend Related to backend system of the compiler simplification Removal of the old, unused, unnecessary or un/under-specified language features. labels Jun 5, 2024

zerbina added this to the C backend rework milestone Jun 5, 2024

zerbina mentioned this pull request Jun 9, 2024

mir: remove the tree delimiter nodes #1334

Merged

zerbina added 12 commits June 14, 2024 14:37

Merge branch 'devel' into rework-the-c-code-generator

9ed6bf0

cir: use CIdentifier

ecd2b30

`CNode` erroneously used a raw `uint32` for `ident`.

sketch out the basic cgen interface

f72ef94

cgendata: store the entity names in CodeGenEnv

0c6761d

The simplest solution for now. Moving them to a separate type might be better, but that can happen later.

cformat: make the module compile

3f25ef2

Some field names were outdated.

instrument some key procedures

17398c3

This also includes some mid-end processing, like destructor call optimizations, in order to get a better relative feel for where time is spent.

mirbodies: implement append

4383a81

The orchestrator will need it to concatenate partial MIR bodies.

cbackend: remove MirEnv instance from BModuleList

351d099

The MIR environment is owned by the `CodeGenEnv` now.

cbackend: append assembled C code to output

d6dec74

Simple: if assembling produced some code, append it to output list, otherwise don't. In other words, much like before, no C file is created for modules that don't result in any code.

cgen: emit some placeholder AST

8fde6c1

The genX procedure are expected to output at least *something*, otherwise sadness ensues, so an empty block is temporarily emitted.

This was referenced Jun 26, 2024

lower string/float case statements with MIR pass #1360

Merged

remove the .goto pragma/feature #1363

Merged

mirpasses: split assignments with MIR pass #1366

Merged

Merge branch 'devel' into rework-the-c-code-generator

9d4d5b2

zerbina mentioned this pull request Oct 17, 2024

WIP: a re-implementation of the compiler backend #424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: rewrite the C code generator #1333

WIP: rewrite the C code generator #1333

zerbina commented Jun 5, 2024

zerbina commented Jun 5, 2024

zerbina commented Jun 14, 2024

zerbina commented Jun 24, 2024

zerbina commented Aug 8, 2024 •

edited

Loading

WIP: rewrite the C code generator #1333

Are you sure you want to change the base?

WIP: rewrite the C code generator #1333

Conversation

zerbina commented Jun 5, 2024

zerbina commented Jun 5, 2024

zerbina commented Jun 14, 2024

zerbina commented Jun 24, 2024

zerbina commented Aug 8, 2024 • edited Loading

zerbina commented Aug 8, 2024 •

edited

Loading