The notebook in this repository implements attention head weight inspection as described in the paper Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls.
The notebook builds a 2‑layer / 4‑head self‑attention stack, runs it on a short “digit” sequence shaped like the paper’s setup (digits of (a), a “(*)”, digits of (b), then output slots (c_0,c_1,c_2)), and prints per‑head attention weights for the query positions (c_1) and (c_2).
To make the heads’ behavior obvious without training, the token embeddings are hand‑crafted so that:
- at (c_1): Head 0 focuses on ((a_1, b_0)); Head 1 focuses on ((a_0, b_1));
- at (c_2): Head 2 focuses on ((a_2, b_0)); Head 3 focuses on ((a_1, b_1)).
This mirrors the “cache & retrieve” intuition—different heads pull different digit pairs—seen in the 2‑layer/4‑head ICoT model analyzed in the paper (they note this is the smallest architecture where ICoT solves (4\times4) multiplication). (Background on ICoT itself: stepwise internalization and KD variants.)
Because we constructed disjoint 2‑D subspaces per head and set the projections to identity, the dot‑products (hence the softmax attention) line up so that:
-
At (c_1):
- Head 0’s top keys include (a_1) and (b_0).
- Head 1’s top keys include (a_0) and (b_1).
- Heads 2–3 have near‑uniform weights here (they’re “idle” on (c_1)).
-
At (c_2):
- Head 2’s top keys include (a_2) and (b_0).
- Head 3’s top keys include (a_1) and (b_1).
- Heads 0–1 are near‑uniform here.
That illustrates head specialisation on different digit pairs—akin to the attention “tree” (cache and retrieve) the paper reverse‑engineers in the successful 2‑layer/4‑head ICoT model.
- This is a didactic code: no training is implemented. For a trainable ICoT see https://github.com/da03/Internalize_CoT_Step_by_Step from the same authors.
- In the paper’s trained model, layer‑1 heads cache pairwise products at earlier tokens, and layer‑2 heads retrieve the needed pairs for the current output digit, forming a sparse, binary‑tree‑like pattern.
- The choice of 2 layers / 4 heads matches the smallest configuration the authors report where ICoT solves (4\times4) multiplication; ICoT itself is introduced and studied in the cited works.