Add manual induction heads by klei22 · Pull Request #746 · ReaLLMASIC/ReaLLM-Forge

klei22 · 2026-02-16T17:39:36Z

This pull request introduces support for manual induction heads in Infinite Attention, allowing for more flexible configuration of attention head types and dimensions. The main changes involve updating configuration files and classes to accept new parameters, modifying the attention module to implement manual induction heads, and updating the forward pass logic to integrate these heads into the attention computation.

Manual Induction Head Support for Infinite Attention

Configuration and Argument Updates:

Added n_induction_head and n_ind_head_dim parameters to GPTConfig and CLI argument parsing, enabling specification of the number and dimensionality of manual induction heads. [1] [2]
Updated explorations/infinite_manual_induction_dim_sweep.yaml to define sweep experiments for manual induction head dimensionality and head ratios.

Attention Module Implementation:

Modified variations/attention_variations.py to initialize additional projection layers (c_attn_k_ind, c_attn_v_ind, c_proj_ind) for manual induction heads when enabled, and validated required parameters. [1] [2]
Updated the forward pass to compute Q, K, V, and attention outputs for manual induction heads, including normalization, masking, and integration with the main attention output. [1] [2] [3] [4]

Copilot

Pull request overview

This PR adds manual induction head support to the Infinite Attention mechanism, enabling models to use specialized attention heads where queries and keys share the same projection (tied-Wk). The implementation includes configuration updates, new projection layers for induction heads, and integration into the attention computation pipeline.

Changes:

Added n_induction_head and n_ind_head_dim parameters to configuration classes and CLI, allowing specification of manual induction head count and dimensions
Modified InfiniteHeadAttention to initialize separate projection layers (c_attn_k_ind, c_attn_v_ind, c_proj_ind) and compute attention for manual induction heads alongside regular heads
Created YAML configuration sweep to explore various ratios of manual induction heads to regular heads and different head dimensions

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
gpt_conf.py	Added configuration fields for manual induction head parameters
train_args.py	Added CLI arguments for specifying induction head count and dimensions
variations/attention_variations.py	Implemented manual induction head support with initialization, forward pass computation, and output integration
explorations/infinite_manual_induction_dim_sweep.yaml	Defined sweep experiments testing different head ratios and dimensionalities

Comments suppressed due to low confidence (2)

variations/attention_variations.py:1325

The post-activation L2 normalization (line 1322) and cproj_scale division (line 1325) are applied to the regular attention output y but not to the manual induction output y_ind before they are combined. This could lead to scale mismatches when combining the two outputs (line 1366), especially when post_act_l2_norm is enabled or cproj_scale is not 1.0. Consider applying these transformations to y_ind as well before combining.

        if self.post_act_l2_norm:
            y = y / y.norm(dim=-1, keepdim=True).clamp_min(1e-6)

        if self.cproj_scale is not None and self.cproj_scale != 1.0:
            y = y / self.cproj_scale

variations/attention_variations.py:1231

Manual induction heads (q_ind, k_ind) are not receiving rotary position encodings or QK normalization, while the regular attention heads (q, k) are. This creates an inconsistency in positional information and normalization between the two types of heads. Consider whether manual induction heads should also receive these transformations, especially rotary embeddings which encode positional information that may be important for induction behavior.

        # Apply Rotary Position Encodings
        if (self.rotary_emb_q is not None) and (self.rotary_emb_k is not None):
            q = self.rotary_emb_q(q)
            k = self.rotary_emb_k(k)

        # Apply QK Norm
        if self.use_qk_norm:
            q = q / (q.norm(dim=-1, keepdim=True) + 1e-6)
            k = k / (k.norm(dim=-1, keepdim=True) + 1e-6)

        if self.use_v_norm:
            v = v / (v.norm(dim=-1, keepdim=True) + 1e-6)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-16T17:43:50Z