GitHub - artefactory-uk/multiplication-icot

The notebook in this repository implements attention head weight inspection as described in the paper Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls.

The notebook builds a 2‑layer / 4‑head self‑attention stack, runs it on a short “digit” sequence shaped like the paper’s setup (digits of (a), a “(*)”, digits of (b), then output slots (c_0,c_1,c_2)), and prints per‑head attention weights for the query positions (c_1) and (c_2).

To make the heads’ behavior obvious without training, the token embeddings are hand‑crafted so that:

at (c_1): Head 0 focuses on ((a_1, b_0)); Head 1 focuses on ((a_0, b_1));
at (c_2): Head 2 focuses on ((a_2, b_0)); Head 3 focuses on ((a_1, b_1)).

This mirrors the “cache & retrieve” intuition—different heads pull different digit pairs—seen in the 2‑layer/4‑head ICoT model analyzed in the paper (they note this is the smallest architecture where ICoT solves (4\times4) multiplication). (Background on ICoT itself: stepwise internalization and KD variants.)

Output

Because we constructed disjoint 2‑D subspaces per head and set the projections to identity, the dot‑products (hence the softmax attention) line up so that:

At (c_1):
- Head 0’s top keys include (a_1) and (b_0).
- Head 1’s top keys include (a_0) and (b_1).
- Heads 2–3 have near‑uniform weights here (they’re “idle” on (c_1)).
At (c_2):
- Head 2’s top keys include (a_2) and (b_0).
- Head 3’s top keys include (a_1) and (b_1).
- Heads 0–1 are near‑uniform here.

That illustrates head specialisation on different digit pairs—akin to the attention “tree” (cache and retrieve) the paper reverse‑engineers in the successful 2‑layer/4‑head ICoT model.

Notes

This is a didactic code: no training is implemented. For a trainable ICoT see https://github.com/da03/Internalize_CoT_Step_by_Step from the same authors.
In the paper’s trained model, layer‑1 heads cache pairwise products at earlier tokens, and layer‑2 heads retrieve the needed pairs for the current output digit, forming a sparse, binary‑tree‑like pattern.
The choice of 2 layers / 4 heads matches the smallest configuration the authors report where ICoT solves (4\times4) multiplication; ICoT itself is introduced and studied in the cited works.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
icot.ipynb		icot.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Output

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Output

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages