Tags activations1 Attention2 Data Parallelism1 Differential Transformer1 FFNN2 Fine tuning1 GPU1 GQA1 kv cache1 LLM1 LoRA1 Math1 memory1 MHA1 MLA1 Multi Latent Attention1 nanoformer1 nGPT1 Parallelism1 Pipeline Parallelism1 Tensor Parallelism1 trainig1 Transformer2 vLLM1