
The MathemaTricks behind FlashAttention
Fast and memory efficient exact attention
I write visual, math-backed explanations of attention, fine-tuning, GPU kernels, and the engineering details behind modern deep learning systems.

Fast and memory efficient exact attention

LoRA imagined from the ground up

An intuitive build up to Mixture of Experts

A practical guide to data, tensor, and pipeline parallelism for LLM inference and training

An intuitive build up to Attention and Transformer

Comparing various transformer architectures like MHA, GQA, Multi Latent Attention, nGPT, Differential Transformer.

A better initialization for LoRA to make convergence faster