โœ๏ธ Enter Your Sentence

Type or paste any sentence to visualize how the transformer model processes it with attention

Try:

๐ŸŽจ Visualization Mode

Lower = sharper, Higher = smoother
[CLS] 0
The 1
cat 2
sat 3
on 4
the 5
mat 6
[SEP] 7
[CLS]
The
cat
sat
on
the
mat
[SEP]
[CLS]
100.0%
50.4%
8.3%
5.5%
5.9%
7.1%
7.7%
9.3%
The
51.5%
100.0%
57.7%
4.0%
7.2%
6.3%
2.5%
1.5%
cat
0.6%
53.6%
100.0%
54.4%
2.1%
0.8%
5.0%
3.6%
sat
0.6%
7.4%
54.8%
100.0%
58.4%
7.1%
4.9%
4.1%
on
3.8%
7.9%
4.1%
56.7%
100.0%
54.9%
4.1%
9.7%
the
6.9%
3.4%
10.0%
3.2%
58.6%
100.0%
57.8%
5.8%
mat
0.8%
9.6%
2.7%
6.0%
10.0%
52.9%
100.0%
58.4%
[SEP]
0.9%
4.0%
9.7%
8.4%
7.8%
1.3%
53.9%
100.0%

Head 1 โ€” Syntactic relationships (subject-verb)

Layer1 / 6
Head1 / 4
Tokens8
Threshold0.10

๐Ÿง  The Most Complete Free Transformer Explainer Tool

๐Ÿ‘๏ธ

5 Visualization Modes

Heatmap, Attention Arcs, Bipartite Graph, Radial Layout, and Information Flow. Each reveals different aspects of how the model processes your text.

๐Ÿ—๏ธ

Architecture Lab

Interactive layer-by-layer explorer with animated data flow. Click any layer โ€” Embedding, Attention, FFN, Normalization โ€” to see exactly what computation happens.

๐Ÿ”ข

Math Playground

Hands-on softmax, Q/K/V attention, and positional encoding calculators. Drag sliders and instantly see how formulas behave. The best way to truly understand transformers.

โš–๏ธ

Model Comparator

Side-by-side comparison of BERT, GPT-2, GPT-3, T5, and Claude across parameters, layers, attention heads, context length, and training data.

Transformer Explainer 2026 โ€” Interactive Attention Mechanism Visualizer & Architecture Lab

Free professional Transformer Explainer tool for 2026. Type any sentence and see a real-time interactive visualization of how transformer-based AI models process language using self-attention. Visualize multi-head attention weights as heatmaps, arc diagrams, bipartite graphs, or radial layouts. Explore the complete transformer architecture layer by layer โ€” from token embedding and positional encoding through multi-head attention, feed-forward networks, and layer normalization. Perfect for AI researchers, NLP developers, students, and anyone learning how BERT, GPT, T5, or Claude-style models work under the hood.

How the Attention Mechanism Works โ€” Visualized Interactively

The attention mechanism is the core innovation of transformer models. When the model processes the word "bank" in the sentence "The bank by the river bank was flooded", it needs to understand which other words give context for what "bank" means. Self-attention computes a score for every pair of tokens by comparing the Query vector of one token against the Key vector of every other token. These scores are then passed through a softmax to create attention weights โ€” probabilities that sum to 1. The final representation of each token is a weighted average of all Value vectors, weighted by these attention probabilities. This Transformer Explainer tool makes this process visually interactive: type your sentence, select any token, and see exactly how much attention it pays to every other token across multiple layers and attention heads.

Multi-Head Attention and What Different Heads Learn

Modern transformers use multiple attention heads running in parallel (typically 8 to 96 heads). Research into BERT's attention heads revealed that different heads specialize: some track subject-verb syntactic dependencies, others resolve pronoun co-reference, some focus on adjacent tokens for local context, and others capture long-range semantic relationships. This specialization emerges purely from training with no explicit supervision โ€” the model learns which aspects of language each head should track. Our Architecture Lab shows all attention heads for your chosen layer, and the Model Comparator shows how head count varies from BERT Base (12 heads) to GPT-3 (96 heads). Use the interactive head selector to isolate any individual head and see its unique attention pattern on your sentence.

Embedding Space and Positional Encoding Visualization

Before attention is computed, each token is converted to a dense vector representation (embedding) and combined with a positional encoding. Our Embedding Space tab visualizes these 64 to 768-dimensional vectors as color-coded heatmaps โ€” rows are tokens, columns are embedding dimensions. The Math Playground's Positional Encoding calculator lets you drag a position slider (0โ€“511) and watch the sinusoidal values change across dimensions. The formula PE(pos, 2i) = sin(pos/10000^(2i/d)) and PE(pos, 2i+1) = cos(...) ensures that relative positions can be easily computed by the attention mechanism, and that the model can generalize to sequences longer than those seen during training.