Extract neuron activations to identify which neurons encode some state.
uv pip install torch transformers numpy matplotlib tqdm
uv run src/find_activations.py- Load model
- Register forward hook on a model layer
- Extract activations for select prompts (eg: 8 uncertain + 8 certain)
- Identify selective neurons using statistical analysis
The following plots are generated for visual inspection of the results:
- State-specific activations across all neurons
- Top selective neurons
- Effect size distribution (Cohen's d)
- Highest activation distribution
- Activation pattern heatmap
- Correlation map (top selective neurons)
class ActivationExtractor:
def hook_fn(self, module, input, output):
# Capture hidden states during forward pass
hidden_states = output[0] # (batch, seq, hidden_dim)
self.activations = hidden_states.mean(dim=1) # Average over sequenceThis hooks into the transformer layer and collects the layer-specific activations during inference.
Model: GPT-2 Small
- 12 transformer layers
- 768 hidden dimensions (neurons per layer)
- 124M total parameters
Activation Source:
- Hook registered on
model.transformer.h[layer_idx] - Extracts hidden states:
(batch_size, sequence_length, 768) - Averages over sequence:
(batch_size, 768)