Tags activations1 Attention2 Data Parallelism1 Differential Transformer1 FFNN3 Fine tuning1 Finetuning1 GPU1 GQA1 kv cache1 LLM1 LoRA2 Math3 memory1 MHA1 Mixture of Experts1 MLA1 Multi Latent Attention1 nanoformer1 nGPT1 Parallelism1 Pipeline Parallelism1 Tensor Parallelism1 trainig1 Training1 Transformer4 vLLM1