Pronouned like fiasco, but with a t instead of an c
(F)ormulas (I)n (AST) (O)ut
A Language-Agnostic modern Wilkinson's formula parser and lexer.
This library is in test and actively changing.
Formula parsing and materialization (making a model matrix from a formula) is normally done in a single library.
Python, for example, has patsy/formulaic/formulae which all do parsing & materialization.
R's model.matrix also handles formula parsing and design matrix creation.
There is nothing wrong with this coupling. I wanted to try decoupling the parsing and materialization. I thought this would allow a focused library that could be used in multiple languages or dataframe libraries. This package has a clear path, to parse and/or lex formulas and return structured JSON metadata. Note: Technically an AST is not returned. A simplified/structured intermediate representation (IR) in the form of json is returned. This json IR ought to be easy for many language bindings to use.
- Make a Rust sandbox
cargo new try_fiasto - cd
try_fiasto cargo add fiastocargo add serde_json- try an example formula
- cargo run
use fiasto::parse_formula;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Test with a simpler formula
let simple_input = "y ~ x + z";
println!("Testing with simpler formula:");
println!("Input: {}", simple_input);
println!();
let simple_result = parse_formula(simple_input)?;
println!("FORMULA METADATA (as JSON):");
println!("{}", serde_json::to_string_pretty(&simple_result)?);
Ok(())
}Testing with simpler formula:
Input: y ~ x + z
FORMULA METADATA (as JSON):
{
"all_generated_columns": [
"y",
"intercept",
"x",
"z"
],
"all_generated_columns_formula_order": {
"1": "y",
"2": "intercept",
"3": "x",
"4": "z"
},
"columns": {
"x": {
"generated_columns": [
"x"
],
"id": 2,
"interactions": [],
"random_effects": [],
"roles": [
"Identity"
],
"transformations": []
},
"y": {
"generated_columns": [
"y"
],
"id": 1,
"interactions": [],
"random_effects": [],
"roles": [
"Response"
],
"transformations": []
},
"z": {
"generated_columns": [
"z"
],
"id": 3,
"interactions": [],
"random_effects": [],
"roles": [
"Identity"
],
"transformations": []
}
},
"formula": "y ~ x + z",
"metadata": {
"family": null,
"has_intercept": true,
"has_uncorrelated_slopes_and_intercepts": false,
"is_random_effects_model": false,
"response_variable_count": 1
}
}The library exposes a clean, focused API:
parse_formula()- Takes a Wilkinson's formula string and returns structured JSON metadatalex_formula()- Tokenizes a formula string and returns JSON describing each token "Only two functions?! What kind of library is this?!" An easy to maintain library with a small surface area. The best kind.
The parser returns a variable-centric JSON structure where each variable is described with its roles, transformations, interactions, and random effects. This makes it easy to understand the complete model structure and generate appropriate design matrices. wayne is a python package that can take this JSON and generates design matrices for use in statistical modeling.
- Comprehensive Formula Support: Full R/Wilkinson notation including complex random effects and intercept-only models
- Variable-Centric Output: Variables are first-class citizens with detailed metadata
- Advanced Random Effects: brms-style syntax with correlation control and grouping options
- Intercept-Only Models: Full support for
y ~ 1andy ~ 0formulas with proper metadata generation - Multivariate Models: Full support for
bind(y1, y2) ~ xformulas with multiple response variables - Pretty Error Messages: Colored, contextual error reporting with syntax highlighting
- Robust Error Recovery: Graceful handling of malformed formulas with specific error types
- Language Agnostic Output: JSON format for easy integration with various programming languages
- Comprehensive Documentation: Detailed usage examples and grammar rules
- Comprehensive Metadata: Variable roles, transformations, interactions, and relationships
- Automatic Naming For Generated Columns: Consistent, descriptive names for transformed and interaction terms
- Efficient tokenization: using one of the fastest lexer generators for Rust (logos crate)
- Fast pattern matching: using match statements and enum-based token handling. Rust match statements are zero-cost abstractions.
- Minimal string copying: with extensive use of string slices (
&str) where possible
- Cross-Platform Model Specs: Define models once, implement in multiple frameworks
I can't think of every kind of formula that could be parsed. I do have a checklist to start with.
To my knowldege the brms formula syntax is the most complex and possibly the most complete.
I would like to start with this as a baseline then continue to extend as needed.
I also offer a clean_name for each parameter. This will all a materializer to use a simpler name for the parameter.
Polynomials for example would result in names like x1_poly_1 or x1_poly_2 as opposed to [s]^2. I keep clean_names in camel case.
y ~ 1 -> y ~ 1 (null model with intercept)
y ~ 0 -> y ~ 0 (null model without intercept)
bind(y1, y2) ~ x -> bind(y1, y2) ~ x (multivariate response model)
y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2) - 1 -> y ~ x1 * x2 + s(z) + (1 + x1 | 1) + (1 | g2) - 1
y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2), sigma ~ x1 + (1|g2) -> y ~ x1 * x2 + s(z) + (1 + x1 | 1) + (1 | g2) and sigma ~ x1 + (1 | g2)
y ~ a1 - a2^x, a1 + a2 ~ 1, nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1
a2 ~ 1
y ~ a1 - a2^x, a1 ~ 1, a2 ~ x + (x|g), nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1
a2 ~ x + (x | g)
y ~ a1 - a2^x, a1 ~ 1 + (1 |2| g), a2 ~ x + (x |2| g), nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1 + (1 | 2 | g)
a2 ~ x + (x | 2 | g)
y ~ a1 - a2^x, a1 ~ 1 + (1 | gr(g, id = 2)), a2 ~ x + (x | gr(g, id = 2)), nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1 + (1 | gr(g, id = 2))
a2 ~ x + (x | gr(g, id = 2))
mvbind(y1, y2) ~ x * z + (1|g)
y1 ~ x * z + (1 | g)
y2 ~ x * z + (1 | g)
y ~ x * z + (1+x|ID1|g), zi ~ x + (1|ID1|g))
y ~ x * z + (1 + x | ID1 | g)
zi ~ x + (1 | ID1 | g)
y ~ mo(x) + more_predictors)
y ~ mo(x) + more_predictors
y ~ cs(x) + more_predictors)
y ~ cs(x) + more_predictors
y ~ cs(x) + (cs(1)|g))
y ~ cs(x) + (cs(1) | g)
y ~ person + item, disc ~ item)
y ~ person + item
disc ~ item
disc ~ item
y ~ me(x, sdx))
y ~ me(x, sdx)
Specify predictors on all parameters of the wiener diffusion model the main formula models the drift rate 'delta'
rt | dec(decision) ~ x, bs ~ x, ndt ~ x, bias ~ x)
rt | dec(decision) ~ x
bs ~ x
ndt ~ x
bias ~ x
rt | dec(decision) ~ x, bias = 0.5)
rt | dec(decision) ~ x
bias = 0.5
mix <- mixture(gaussian, gaussian)
mix <- mixture(gaussian, gaussian)
y ~ 1, mu1 ~ x, mu2 ~ z, family = mix)
y ~ 1
mu1 ~ x
mu2 ~ z
y ~ x, sigma2 = "sigma1", family = mix)
y ~ x
sigma2 = sigma1
(y ~ 1) +nlf(sigma ~ a * exp(b * x), a ~ x) + lf(b ~ z + (1|g), dpar = "sigma") + gaussian()
y ~ 1
sigma ~ a * exp(b * x)
a ~ x
b ~ z + (1 | g)
(y1 ~ x + (1|g)) + gaussian() + cor_ar(~1|g) + bf(y2 ~ z) + poisson()
y1 ~ x + (1 | g)
autocor ~ arma(time = NA, gr = g, p = 1, q = 0, cov = FALSE)
y2 ~ z
(y1 ~ 1 + x + (1|c|obs), sigma = 1) + gaussian()
y2 ~ 1 + x + (1|c|obs)) + poisson()
bmi ~ age * mi(chl)) + bf(chl | mi() ~ age) + set_rescor(FALSE)
bmi ~ age * mi(chl)
chl | mi() ~ age
y ~ eta, nl = TRUE) + lf(eta ~ 1 + x) + nlf(sigma ~ tau * sqrt(eta)) + lf(tau ~ 1)
y ~ eta
eta ~ 1 + x
sigma ~ tau * sqrt(eta)
tau ~ 1
(y1 ~ x + (1|g) + (y2 ~ s(z))
y1 ~ x + (1 | g)
y2 ~ s(z)
y ~ x + (1 | g), fill = "mean"
For detailed documentation, see gr() Function Documentation.