
Exploring the Mixture of Experts
An intuitive build up to Mixture of Experts

An intuitive build up to Mixture of Experts

We’ve been talking about Transformers all this while. But how do we get the most out of our hardware? There are two different paradigms that we can talk about here. One case where your model happil...

An intuitive build up to Attention and Transformer

Comparing various transformer architectures like MHA, GQA, Multi Latent Attention, nGPT, Differential Transformer.

A better initialisation for LoRA to make convergence faster