SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

Abstract

SCOUT (Segment Compression for Optimized Utility in Transformers) is a hybrid long-sequence model that combines local mixing (via Mamba or sliding-window attention) with sparse attention over compressed checkpoints. Instead of attending to every past token, SCOUT compresses fixed-size segments into summary representations and only attends to these checkpoints. This design preserves much of the expressivity of full attention while scaling sub-quadratically in compute and memory.

Introduction

Transformers have achieved state-of-the-art performance across domains but remain bottlenecked by the quadratic cost of self-attention.

Linear state-space models compress history into a recurrent state, but suffer from fading memory over long sequences.
Hybrid models mix fast local layers with occasional full attention, but still retain quadratic bottlenecks.
Sparse attention methods reduce costs via structured sparsity, but rely on fixed, input-agnostic patterns.

SCOUT addresses this challenge by combining fast linear token mixers with sparse attention over compressed checkpoints, enabling sub-quadratic complexity while preserving global context.

Figure 1: SCOUT architecture with different types of token mixer (Mamba or SWA).

Method

SCOUT achieves sub-quadratic attention complexity by combining linear token mixing with sparse attention over compressed memory.

Each layer consists of:

Token Mixer (Mamba or SWA): Encodes tokens with local context in linear time.
Checkpoint Compression: Periodically extracts compressed memory slots that summarize past segments, enabling sparse attention to recover long-range dependencies.
Feedforward Networks (MLPs): Standard transformations before and after checkpoint attention.

This design preserves efficiency while maintaining both local and global context, eliminating the need for full attention layers.

Training on FineWeb-Edu

This repository provides code and configuration for pretraining SCOUT variants using the FineWeb-Edu dataset.

Codebase Overview

Our training infrastructure is primarily adapted from the Samba and TinyLlama codebases. The modeling code is based on Qwen2 and resides in the model/ folder. The training script (train.py) and related utilities are under train/. Data processing code lives under data/. We also include lm-evaluation-harness as a submodule for evaluation.

Pretraining from Scratch

Data Preparation

Download the FineWeb-Edu dataset to your chosen directory using the script below:

python data/load_dataset.py --source_path /path/to/Fineweb-edu

This script will save the dataset into .jsonl shards in the specified path. Then, use the provided tokenizer and packing script to process the data into training-ready format.

python data/prepare_dataset.py --source_path /path/to/fine --tokenizer_path data/llama  --destination_path data/slim --split validation --percentage 1.0
python data/prepare_dataset.py --source_path /path/to/fine --tokenizer_path data/llama  --destination_path data/slim --split train --percentage 1.0

Training

We provide bash launch scripts in train/scripts/. Here's an example script to train SCOUT-SWA-1.57B with 100B data on 8 GPUs:

#!/bin/bash

# Get the project root directory (assuming this script is in train/scripts/)
PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
NUM_GPUS=8

# Training config
NAME="SCOUT_SWA_1.59B"
MODEL="${PROJECT_ROOT}/configs/scout_swa_1.57b.json"
CONFIG="1024x2k_100B"  # For 1B scale
MICRO_BATCH_SIZE=8
EVAL_ITERS=15
LR=3e-4

# Paths
OUTPUT_ROOT="${PROJECT_ROOT}/train"
TRAIN_DATA="${PROJECT_ROOT}/datasets/fineweb-edu/100B/fla_tokenized"
VALIDATION_DATA=None
SAVE_DIR="${PROJECT_ROOT}/save/"

# Run training
torchrun --nproc_per_node=${NUM_GPUS} --master_port=29500 ${OUTPUT_ROOT}/pretrain.py \
    --train_data_dir ${TRAIN_DATA} --val_data_dir ${VALIDATION_DATA} --output_root ${SAVE_DIR} \
    --exp_name ${NAME} --model_name ${MODEL} --eval_iters ${EVAL_ITERS} \
    --learning_rate ${LR} --micro_batch_size ${MICRO_BATCH_SIZE} --train_config ${CONFIG}

You can modify:

MODEL to switch architectures (e.g., configs/scout_swa_470m.json)
CONFIG to control token budget and sequence length (e.g., "128x4k_15B" for 15B tokens, global batch size 128, sequence length 4k)
NAME to name each experiment run

Launch training using:

sh train/scripts/example_script.sh

Evaluation

We use lm-evaluation-harness for evaluating trained models on zero-shot benchmarks.

Installation

cd lm-evaluation-harness
pip install .

Example: General Benchmarks

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes 1 lm_eval --model hf \
  --model_args pretrained=model_path,tokenizer=fla-hub/transformer-1.3B-100B,dtype=bfloat16 \
  --tasks wikitext,lambada_openai,piqa,hellaswag,arc_easy,arc_challenge,mmlu,commonsense_qa \
  --batch_size 16 \
  --num_fewshot 0 \
  --output_path ./results/general/

Example: LongBench Evaluation

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes 1 lm_eval --model hf \
  --model_args pretrained=model_path,tokenizer=fla-hub/transformer-1.3B-100B,dtype=bfloat16 \
  --tasks longbench_e \
  --batch_size 1 \
  --output_path ./results/longbench/ \
  --show_config \
  --trust_remote_code \
  --gen_kwargs max_new_tokens=512,do_sample=False

Acknowledgements

This project builds upon the foundational work of several exceptional open-source initiatives. We gratefully acknowledge their contributions:

📚 Citation

If you use SCOUT in your research, please cite:

@article{jafari2025scout},
  title={SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers},
  author={Jafari, Aref and Fan, Yuhe and Jamialahmadi, Benyamin and Farinneya, Parsa and Chen, Boxing and S. Tahaei, Marzieh},
  journal={arXiv preprint arXiv:2509.00935},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
data		data
eval		eval
lm-evaluation-harness		lm-evaluation-harness
model		model
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SCOUT.png		SCOUT.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

Abstract

Introduction

Method

Training on FineWeb-Edu

Codebase Overview

Pretraining from Scratch

Data Preparation

Training

Evaluation

Installation

Example: General Benchmarks

Example: LongBench Evaluation

Acknowledgements

📚 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ArefJafari/SCOUT

Folders and files

Latest commit

History

Repository files navigation

SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

Abstract

Introduction

Method

Training on FineWeb-Edu

Codebase Overview

Pretraining from Scratch

Data Preparation

Training

Evaluation

Installation

Example: General Benchmarks

Example: LongBench Evaluation

Acknowledgements

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages