Mixture of Experts (MoE) models are here. Two hours ago Mistral released their 8x 7b MoE. GPT-4 is widely rumored to be MoE. Likely, this is the future of LLMs. Below a short summary & my reading list for the topic
The core advantage of MoE is performance. In a normal LLM, every weight is used to calculate the next token, for MoE only a subset of the weights are used. For example Mistral 8x 7b has 56b weights (8 experts x 7b weights each), but only 14b weights (i.e. two experts) are used for each token. That gives you a 4x speed-up vs using all weights (and a 5x speed-up vs a 70b model).
The selection of the experts happens for each token. This is different from routing where you take an entire query and select which model (from a set of models) to send it to.
Reading List:
1. Noam Shazeer & Gang at Google on MoE. Short and pretty readable. Uses it with LSTM (not Transformers) but has the key ideas. https://lnkd.in/dHiSn5cZ
2. Stanford CS25 on MoE. Good intro lecture on MoE.
https://lnkd.in/dcZvisSF
3. Implementation/GitHub. I haven't run it myself, but this is pretty readable and gives you an idea.
https://lnkd.in/dT22DDxx
Tracker for Mistral's 8x 7b MoE model from today is on their Twitter Feed. There is no inference code available yet that I know of.
The core advantage of MoE is performance. In a normal LLM, every weight is used to calculate the next token, for MoE only a subset of the weights are used. For example Mistral 8x 7b has 56b weights (8 experts x 7b weights each), but only 14b weights (i.e. two experts) are used for each token. That gives you a 4x speed-up vs using all weights (and a 5x speed-up vs a 70b model).
The selection of the experts happens for each token. This is different from routing where you take an entire query and select which model (from a set of models) to send it to.
Reading List:
1. Noam Shazeer & Gang at Google on MoE. Short and pretty readable. Uses it with LSTM (not Transformers) but has the key ideas. https://lnkd.in/dHiSn5cZ
2. Stanford CS25 on MoE. Good intro lecture on MoE.
https://lnkd.in/dcZvisSF
3. Implementation/GitHub. I haven't run it myself, but this is pretty readable and gives you an idea.
https://lnkd.in/dT22DDxx
Tracker for Mistral's 8x 7b MoE model from today is on their Twitter Feed. There is no inference code available yet that I know of.