Multiple recent experiments show extremely small transformers (hundreds of parameters) can learn to perform long addition on fresh test data, with information‑theoretic checks ruling out memorization. That suggests model architectures can discover compact algorithmic representations, not just statistical associations.
— If transformers can internalize algorithms at tiny scale, capability forecasts, interpretability research, safety timelines, and the economics of on‑device AI all need revising.
Alexander Kruel
2026.02.25
100% relevant
GitHub papers reporting a ~777‑parameter and a 456‑parameter transformer solving 10‑digit addition, plus an information‑theoretic analysis (cited in the article).
← Back to All Ideas