Resources for LLM training.
https://www.awesomepython.org/?q=training
Physics of Language Models is a research framework and emerging field of study, pioneered by Zeyuan Allen-Zhu and collaborators, that seeks a scientific, principled understanding of how and why large language models (LLMs) behave the way they do — analogous to how physics explains the natural world with universal laws rather than surface-level observations.
https://physics.allen-zhu.com/home
https://www.youtube.com/playlist?list=PLIZhMKKbVX6JmdngPRKvAS4u4L97odbGp
https://arxiv.org/search/?searchtype=all&query=Allen-Zhu+Physics+of+Language+Models&abstracts=show&size=50&order=-submitted_date
Summary: Shows that transformer-based models can learn hierarchical context-free grammars and that their hidden states and attention patterns reflect dynamic programming-like structure understanding. https://arxiv.org/abs/2305.13673
Summary: Investigates how language models solve grade-school math, identifying hidden model processes and distinguishing genuine reasoning from memorization. https://arxiv.org/abs/2407.20311
Summary: Studies how incorporating error-correction data into pretraining helps language models improve reasoning accuracy directly, without multi-round prompting. https://arxiv.org/abs/2408.16293
Summary: Shows that reliable knowledge extraction by LLMs depends on diversity in pretraining data and that knowledge must be sufficiently augmented during training to be practically retrievable. https://arxiv.org/abs/2309.14316
Summary: Demonstrates that while LLMs can retrieve stored knowledge well, they struggle with basic manipulation tasks like classification or comparison unless augmented with chain-of-thought mechanisms. https://arxiv.org/abs/2309.14402
Summary: Establishes scaling laws for LLM knowledge storage, finding a ~2 bits of factual knowledge capacity per parameter and how architecture, data, and training affect it. https://arxiv.org/abs/2404.05405
Summary: Introduces Canon layers — lightweight components that improve information flow in sequence models — and shows how they enhance reasoning, knowledge manipulation, and architectures through controlled synthetic pretraining tasks. https://arxiv.org/abs/2512.17351