First-principles notes on LLMs, GPU kernels, and model training

HOME
CATEGORIES
TAGS
ARCHIVES
ABOUT

Home Tags

Tags

Tags

Mixture of Experts1

Recently Updated

Rethink LoRA initializations for faster convergence
Transformer showdown MHA vs MLA vs nGPT vs Differential Transformer
Attention and Transformer Imagined
Understanding multi-GPU Parallelism paradigms
Exploring the Mixture of Experts

Trending Tags

LLM Math Training Transformer Attention GPU Inference Systems Architecture FFN

© 2026 Datta Nimmaturi. Some rights reserved.

Built with Jekyll. Source.

Trending Tags

LLM Math Training Transformer Attention GPU Inference Systems Architecture FFN