fiasto

Pronouned like fiasco, but with a t instead of an c

(F)ormulas (I)n (AST) (O)ut

A Language-Agnostic modern Wilkinson's formula parser and lexer.

⭕ In Testing

This library is in test and actively changing.

Motivation - Decouple the parsing and materialization

Formula parsing and materialization (making a model matrix from a formula) is normally done in a single library. Python, for example, has patsy/formulaic/formulae which all do parsing & materialization. R's model.matrix also handles formula parsing and design matrix creation.

There is nothing wrong with this coupling. I wanted to try decoupling the parsing and materialization. I thought this would allow a focused library that could be used in multiple languages or dataframe libraries. This package has a clear path, to parse and/or lex formulas and return structured JSON metadata. Note: Technically an AST is not returned. A simplified/structured intermediate representation (IR) in the form of json is returned. This json IR ought to be easy for many language bindings to use.

Getting Started

Make a Rust sandbox cargo new try_fiasto
cd try_fiasto
cargo add fiasto
cargo add serde_json
try an example formula
cargo run

Usage

use fiasto::parse_formula;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Test with a simpler formula
    let simple_input = "y ~ x + z";
    println!("Testing with simpler formula:");
    println!("Input: {}", simple_input);
    println!();

    let simple_result = parse_formula(simple_input)?;
    println!("FORMULA METADATA (as JSON):");
    println!("{}", serde_json::to_string_pretty(&simple_result)?);

    Ok(())
}

Testing with simpler formula:
Input: y ~ x + z

FORMULA METADATA (as JSON):
{
  "all_generated_columns": [
    "y",
    "intercept",
    "x",
    "z"
  ],
  "all_generated_columns_formula_order": {
    "1": "y",
    "2": "intercept",
    "3": "x",
    "4": "z"
  },
  "columns": {
    "x": {
      "generated_columns": [
        "x"
      ],
      "id": 2,
      "interactions": [],
      "random_effects": [],
      "roles": [
        "Identity"
      ],
      "transformations": []
    },
    "y": {
      "generated_columns": [
        "y"
      ],
      "id": 1,
      "interactions": [],
      "random_effects": [],
      "roles": [
        "Response"
      ],
      "transformations": []
    },
    "z": {
      "generated_columns": [
        "z"
      ],
      "id": 3,
      "interactions": [],
      "random_effects": [],
      "roles": [
        "Identity"
      ],
      "transformations": []
    }
  },
  "formula": "y ~ x + z",
  "metadata": {
    "family": null,
    "has_intercept": true,
    "has_uncorrelated_slopes_and_intercepts": false,
    "is_random_effects_model": false,
    "response_variable_count": 1
  }
}

🎯 Simple API

The library exposes a clean, focused API:

parse_formula() - Takes a Wilkinson's formula string and returns structured JSON metadata
lex_formula() - Tokenizes a formula string and returns JSON describing each token "Only two functions?! What kind of library is this?!" An easy to maintain library with a small surface area. The best kind.

Output Format

The parser returns a variable-centric JSON structure where each variable is described with its roles, transformations, interactions, and random effects. This makes it easy to understand the complete model structure and generate appropriate design matrices. wayne is a python package that can take this JSON and generates design matrices for use in statistical modeling.

Features

Comprehensive Formula Support: Full R/Wilkinson notation including complex random effects and intercept-only models
Variable-Centric Output: Variables are first-class citizens with detailed metadata
Advanced Random Effects: brms-style syntax with correlation control and grouping options
Intercept-Only Models: Full support for y ~ 1 and y ~ 0 formulas with proper metadata generation
Multivariate Models: Full support for bind(y1, y2) ~ x formulas with multiple response variables
Pretty Error Messages: Colored, contextual error reporting with syntax highlighting
Robust Error Recovery: Graceful handling of malformed formulas with specific error types
Language Agnostic Output: JSON format for easy integration with various programming languages
Comprehensive Documentation: Detailed usage examples and grammar rules
Comprehensive Metadata: Variable roles, transformations, interactions, and relationships
Automatic Naming For Generated Columns: Consistent, descriptive names for transformed and interaction terms
Efficient tokenization: using one of the fastest lexer generators for Rust (logos crate)
Fast pattern matching: using match statements and enum-based token handling. Rust match statements are zero-cost abstractions.
Minimal string copying: with extensive use of string slices (&str) where possible

Use Cases:

Cross-Platform Model Specs: Define models once, implement in multiple frameworks

Goals

I can't think of every kind of formula that could be parsed. I do have a checklist to start with.

To my knowldege the brms formula syntax is the most complex and possibly the most complete.

I would like to start with this as a baseline then continue to extend as needed.

I also offer a clean_name for each parameter. This will all a materializer to use a simpler name for the parameter.

Polynomials for example would result in names like x1_poly_1 or x1_poly_2 as opposed to [s]^2. I keep clean_names in camel case.

1. Intercept-only, no-intercept, and multivariate models:

y ~ 1 -> y ~ 1 (null model with intercept) y ~ 0 -> y ~ 0 (null model without intercept) bind(y1, y2) ~ x -> bind(y1, y2) ~ x (multivariate response model)

2. Mixed effects models:

y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2) - 1 -> y ~ x1 * x2 + s(z) + (1 + x1 | 1) + (1 | g2) - 1

3. Predict `sigma`:

y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2), sigma ~ x1 + (1|g2) -> y ~ x1 * x2 + s(z) + (1 + x1 | 1) + (1 | g2) and sigma ~ x1 + (1 | g2)

4. Non-lienar models:

y ~ a1 - a2^x, a1 + a2 ~ 1, nl = TRUE)

y ~ a1 - a2^x a1 ~ 1 a2 ~ 1

5. predict a1 and a2 differently

y ~ a1 - a2^x, a1 ~ 1, a2 ~ x + (x|g), nl = TRUE)

y ~ a1 - a2^x a1 ~ 1 a2 ~ x + (x | g)

6. correlated group-level effects across parameters

y ~ a1 - a2^x, a1 ~ 1 + (1 |2| g), a2 ~ x + (x |2| g), nl = TRUE)

y ~ a1 - a2^x a1 ~ 1 + (1 | 2 | g) a2 ~ x + (x | 2 | g)

7. alternative but equivalent way to specify the above model

y ~ a1 - a2^x, a1 ~ 1 + (1 | gr(g, id = 2)), a2 ~ x + (x | gr(g, id = 2)), nl = TRUE)

y ~ a1 - a2^x a1 ~ 1 + (1 | gr(g, id = 2)) a2 ~ x + (x | gr(g, id = 2))

8. Define a multivariate model

mvbind(y1, y2) ~ x * z + (1|g)

y1 ~ x * z + (1 | g) y2 ~ x * z + (1 | g)

9. Define a zero-inflated model also predicting the zero-inflation part

y ~ x * z + (1+x|ID1|g), zi ~ x + (1|ID1|g)) y ~ x * z + (1 + x | ID1 | g) zi ~ x + (1 | ID1 | g)

10. Specify a predictor as monotonic

y ~ mo(x) + more_predictors) y ~ mo(x) + more_predictors

for ordinal models only specify a predictor as category specific

y ~ cs(x) + more_predictors) y ~ cs(x) + more_predictors

Add a category specific group-level intercept

y ~ cs(x) + (cs(1)|g)) y ~ cs(x) + (cs(1) | g)

Specify parameter 'disc'

y ~ person + item, disc ~ item) y ~ person + item disc ~ item disc ~ item

Specify variables containing measurement error

y ~ me(x, sdx)) y ~ me(x, sdx)

Specify predictors on all parameters of the wiener diffusion model the main formula models the drift rate 'delta'

rt | dec(decision) ~ x, bs ~ x, ndt ~ x, bias ~ x) rt | dec(decision) ~ x bs ~ x ndt ~ x bias ~ x

fix the bias parameter to 0.5

rt | dec(decision) ~ x, bias = 0.5) rt | dec(decision) ~ x bias = 0.5

Specify different predictors for different mixture components

mix <- mixture(gaussian, gaussian) mix <- mixture(gaussian, gaussian) y ~ 1, mu1 ~ x, mu2 ~ z, family = mix) y ~ 1 mu1 ~ x mu2 ~ z

Fix both residual standard deviations to the same value

y ~ x, sigma2 = "sigma1", family = mix) y ~ x sigma2 = sigma1

Use the '+' operator to specify models

(y ~ 1) +nlf(sigma ~ a * exp(b * x), a ~ x) + lf(b ~ z + (1|g), dpar = "sigma") + gaussian() y ~ 1 sigma ~ a * exp(b * x) a ~ x b ~ z + (1 | g)

Specify a multivariate model using the '+' operator

(y1 ~ x + (1|g)) + gaussian() + cor_ar(~1|g) + bf(y2 ~ z) + poisson()

y1 ~ x + (1 | g) autocor ~ arma(time = NA, gr = g, p = 1, q = 0, cov = FALSE) y2 ~ z

Specify correlated residuals of a gaussian and a poisson model

(y1 ~ 1 + x + (1|c|obs), sigma = 1) + gaussian() y2 ~ 1 + x + (1|c|obs)) + poisson()

model missing values in predictors

bmi ~ age * mi(chl)) + bf(chl | mi() ~ age) + set_rescor(FALSE) bmi ~ age * mi(chl) chl | mi() ~ age

model sigma as a function of the mean

y ~ eta, nl = TRUE) + lf(eta ~ 1 + x) + nlf(sigma ~ tau * sqrt(eta)) + lf(tau ~ 1) y ~ eta eta ~ 1 + x sigma ~ tau * sqrt(eta) tau ~ 1

Multivariate models

(y1 ~ x + (1|g) + (y2 ~ s(z)) y1 ~ x + (1 | g) y2 ~ s(z)

Fill method

y ~ x + (1 | g), fill = "mean"

For detailed documentation, see gr() Function Documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
brain_storm		brain_storm
examples		examples
img		img
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
debug_lexer.rs		debug_lexer.rs
temp_test.rs		temp_test.rs

License

alexhallam/fiasto

Folders and files

Latest commit

History

Repository files navigation