Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super efficient way of embedding files or binary data in OCaml executables #21

Open
kayceesrk opened this issue Feb 19, 2024 · 4 comments

Comments

@kayceesrk
Copy link
Contributor

  • There is ocaml-crunch that takes files, splits them into chunks, generates a source OCaml module with an API. It's not super efficient: files have to be parsed at compile time, and recomposed when accessed. Room for improvement, don't generate OCaml code including the data to be embedded.

  • Idea: implement that with magic from the compiler, Dune, or cppo. Make files available as Bytes, in a module created at compile time.

  • C23 will have an #embed preprocessor directive (under the hood, it's the linker's job). No parsing of data, available as a static const array of bytes. Checkout how Rust or Golang are doing it.

@kayceesrk kayceesrk converted this from a draft issue Feb 19, 2024
@hannesm
Copy link

hannesm commented May 18, 2024

I somehow came across this issue, and there are some other utilities available in OCaml:

I use(d) both (actually all three approaches, so crunch as well) in different applications, and they seem to work nicely. With caravan you've to hope that the linker (or strip) isn't working against you.

And from your description of ocaml-crunch, I'd appreciate a PR that removes the chunking. The reason why there is an API is that it is meant for entire directories being embedded (not only a single file). Plus the mirage-kv API is satisfied, so it can act as a key-value store.

Now, I don't know about your mileage, why "super efficient" is crucial.

@kayceesrk
Copy link
Contributor Author

CC @MisterDA who suggested the task originally.

@reynir
Copy link

reynir commented May 21, 2024

I experimented with a malfunction-based replacement of crunch. My motivation was that crunch could IIRC be a bit slow due to parsing of very large string literals. I got stuck with how to use it with e.g. dune. Unfortunately, I think the code is on the drive of my now-dead laptop (possibly recoverable).

@MisterDA
Copy link

caravan isn't portable to macOS or Windows. ppx_blob is more-or-less conceptually equivalent to crunch.

A first step could be to expose binary data as string or bytes (think embedded assets, CSS, js, images…). Without special support, the compiler has to lex/parse a potentially long string, handling escape sequences along the way.
A second step could be to expose it as an array of integers (think neural network weights, numerical data…). This is more interesting, because usually each integer is parsed by the compiler and generates a node in the AST with location information, which easily becomes extremely costly.

Loosely related, there was ocaml/ocaml#10654 using incbin to add debug information to complete bytecode executable, sidestepping the C compiler, which cannot either handle debug info as strings or integer arrays.

The interface could be exposed as an extension node, taking a filename as a parameter, just like ppx_blob, but the compiler should not process the binary data as part of its AST. The C standard defines the parameters limit, prefix, suffix and if_empty (#embed). We'd need another parameter selecting the type of which the data should be exposed as, say string, bytes, int array, …

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

4 participants