Archives 2025 06 Jul Understanding multi GPU Parallelism paradigms 14 Jun Attention and Transformer Imagined 22 Jan Transformer showdown MHA vs MLA vs nGPT vs Differential Transformer2024 07 Jun Rethink LoRA initialisations for faster convergence