Understanding multi GPU Parallelism paradigms

We’ve been talking about Transformers all this while. But how do we get the most out of our hardware? There are two different paradigms that we can talk about here. One case where your model happil...

Jul 6, 2025 Attention, Transformer, FFNN, GPU, Parallelism, vLLM, Inference

Attention and Transformer Imagined

An intuitive build up to Attention and Transformer

Jun 14, 2025 Attention, Transformer, FFNN, Math

Transformer showdown MHA vs MLA vs nGPT vs Differential Transformer

Comparing various transformer architectures like MHA, GQA, Multi Latent Attention, nGPT, Differential Transformer.

Jan 22, 2025 Transformer, Architectures

LoRA Fine tuning, modification, analysis and findings

Rethink LoRA initialisations for faster convergence

A better initialisation for LoRA to make convergence faster

Jun 7, 2024 LoRA, Fine tuning, LLM

Understanding multi GPU Parallelism paradigms

Attention and Transformer Imagined

Transformer showdown MHA vs MLA vs nGPT vs Differential Transformer

Rethink LoRA initialisations for faster convergence

Trending Tags