Your Linear Layers are Secretly Fast Weights

A unifying view of fast weights and Gradient Descent

Rant on Hybrid Anything

Heavy aircraft carrying cruiser Kiev, USSR, 1985.
Heavy aircraft carrying cruiser Kiev, USSR, 1985.

Before we begin, let's take a little detour into Post World War II warship designs. There was once a fierce debate, "Aircraft Carriers or Battleships?" We all know how it ended. The last battleship USS Missouri (BB-63) turned into a floating museum. Yet, before that, there was a glimpse of hope. "Why not a hybrid ship?" asked the young naval officer.

Thus was born the aircraft cruiser, a strange fusion of heavy guns and a flight deck. In theory, it could dominate both sea and sky; in practice, its guns shook the deck so badly that aircraft could barely take off. The concept soon proved unworkable, and these hybrids faded into history — bold, ambitious, and ultimately doomed experiments of naval design.

So Why Hybrid Models?

Modern LLMs are largely based on Self-Attention Transformers. I said largely because of recent models that incorporated both attention layers and linear attention (fast weight) variants. For example, Qwen3-Next mixes Attention Layers with Gated DeltaNet Layers in 1:3 ratio. These hybrid designs are intended to provide efficiency without sacrificing too much on modeling capability along sequence dimension. Ultimately, like many of the hybrid systems, it becomes a matter of trade-off. Compromises were made here and there, yet hard to make everyone happy.

Setting efficiency aside, another purpose of fast weight is to enable models to learn continously. Imagine the model simply generates and reads in new tokens and its parameters get updated by those tokens at the same time. In principle, this is somewhat achieveable with the current sequence-level fast weight designs. However, it still feels limited, unnatural and more importantly inelegant.

Then, "What are we going to do?" asked the young padawan. It turns out the answer has been hidden in the plain sight. We just don't know how to tap into its power, yet.

Your Linear Layers are Secretly Fast Weights

Imagine a Linear Layer as follows,

$$\mathcal{F}(x) = (W_0 + \Delta W) x$$

Running the good old backpropgation:

$$\Delta W = \sum_i e_i \otimes x_i, \text{where } e_i \text{ is the loss/output gradient}$$

Doesn't this look familiar? Precisely, with a little alegbra, gradient descent on linear layer can be seen as running fast weight updates on the whole training dataset

$$\mathcal{F}(x) = W_0x + \sum_i e_i (x_i^T x) = W_0x + \text{LinearAttn}(E, X, x)$$

We have fast weight memory built-in within our models all these times and we know the mechanism of its update. So why don't we utilize them? Why do we train our models and then freeze the parameters?
One obvious reason is that backpropgation is expensive. Modeling and approximating e is expensive. In the context of transformers, e is not only dependent on current token, but also all the future tokens and potentially the final reward (in RL). This incurs exploding complexity, which makes it difficult to approximate/learn. What makes it worse, these future tokens are generally not available during inference time, unless the sequence is fully unrolled. As a compromise, we instead predict e by projecting input x with a linear layer and call it "key". That gives the basic formulation of most of forward fast weights.

$$\mathcal{F}(x) = W_0x + \sum_i W_ex_i (x_i^T x)$$

This is obviously very problematic, since we are trying to approximate a rich and complex function with a simple linear layer. Fundamentally, it just does not make sense. One intuitive fix is actually to use MLP layers, which is in theory a universal function approximator. However, to reach the complexity of the backward function, the hidden layer width would have been too large and we might be better off running backpropagation. Remember, there is no such thing as a free lunch .

$$\mathcal{F}(x) = W_0x + \sum_i \mathcal{F'}(x_i) (x_i^T x)$$

Miscellaneous and Caveats

Of course, above is an overly simplified view of training time fast weights. During training, there are a few more factors we would need to consider, for instance, momentum, learning rate scheduler and weight decays, which modify how fast weight memory is updated. In addition, different optimizers use different update rules, which also dictates how the fast weight memory gets updated. Explicitly writing them out would be an interesting practice, which might provide incredible insights to both optimizer and memory design.

References

  1. Schlag, I., Irie, K., & Schmidhuber, J. (2021, July). Linear transformers are secretly fast weight programmers . In International conference on machine learning (pp. 9355-9366). PMLR.
  2. Schmidhuber, J. (1992). Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1), 131–139.
  3. Yang, S., Kautz, J., & Hatamizadeh, A. (2024). Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464.
  4. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., ... & Qiu, Z. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388.
  5. Irie, K., Csordás, R., & Schmidhuber, J. (2022, June). The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In International Conference on Machine Learning (pp. 9639-9659). PMLR.
  6. Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., & Wei, F. (2022). Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. arXiv preprint arXiv:2212.10559.
  7. Komatsuzaki, A. [@arankomatsuzaki]. (2023, February 6).
    Actually, gradient descent can be seen as attention that applies beyond the model's context length! Let me explain why 🧵 👇 (1/N)
    Ref: https://arxiv.org/abs/2202.05798
    https://arxiv.org/abs/2212.10559
    .
    X. https://x.com/arankomatsuzaki/status/1622666312219598864
  8. Wolpert, D. H., & Macready, W. G. (2002). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82.

Citation

If you would like to cite this blog post, you can use the following BibTeX entry:


@misc{liang2025fastweights,
  title = {Your Linear Layers are Secretly Fast Weights},
  author = {Liang, Kaizhao},
  year = {2025},
  url = {https://kyleliang919.github.io/The_Ultimate_Fast_Weights}
}