Controlling every detail of LLM training, by building from the ground up.
- Mixture of experts architecture defined in
llm/moe.py. No optimizations yet. - Loss functions (CE, DPO, GRPO/GSPO). TODO double check.
- Optimizer (non distributed) and learning rate scheduler (warmup, cosine annealing, post-annealing)
uv pip install nvidia-cutlass-dsl tritonuv run maturin develop --release --manifest-path tokenizer/Cargo.toml