Tags activations1 Attention2 Data Parallelism1 Differential Transformer1 FFNN3 Fine tuning1 GPU1 GQA1 kv cache1 LLM1 LoRA1 Math2 memory1 MHA1 Mixture of Experts1 MLA1 Multi Latent Attention1 nanoformer1 nGPT1 Parallelism1 Pipeline Parallelism1 Tensor Parallelism1 trainig1 Transformer3 vLLM1