Notes on LLM systems

First-principles explanations of attention, fine-tuning, GPU kernels, and the engineering details behind modern deep learning systems.

Systems for LLM RL

Foray into the systems challenges and approaches for LLM RL

A brief introduction to reinforcement learning for LLMs

Fast and memory efficient exact attention

LoRA imagined from the ground up

An intuitive build up to Mixture of Experts

A practical guide to data, tensor, and pipeline parallelism for LLM inference and training

An intuitive build up to Attention and Transformer

Comparing various transformer architectures like MHA, GQA, Multi Latent Attention, nGPT, Differential Transformer.

A better initialization for LoRA to make convergence faster