Best of Guido Appenzeller - LinkedIn Posts by Guido Appenzeller

Mixture of Experts (MoE) models are here. Two hours ago Mistral released their 8x 7b MoE. GPT-4 is widely rumored to be MoE. Likely, this is the future of LLMs. Below a short summary & my reading list for the topic

The core advantage of MoE is performance. In a normal LLM, every weight is used to calculate the next token, for MoE only a subset of the weights are used. For example Mistral 8x 7b has 56b weights (8 experts x 7b weights each), but only 14b weights (i.e. two experts) are used for each token. That gives you a 4x speed-up vs using all weights (and a 5x speed-up vs a 70b model).

The selection of the experts happens for each token. This is different from routing where you take an entire query and select which model (from a set of models) to send it to.

Reading List:

1. Noam Shazeer & Gang at Google on MoE. Short and pretty readable. Uses it with LSTM (not Transformers) but has the key ideas. https://lnkd.in/dHiSn5cZ

2. Stanford CS25 on MoE. Good intro lecture on MoE.
https://lnkd.in/dcZvisSF

3. Implementation/GitHub. I haven't run it myself, but this is pretty readable and gives you an idea.
https://lnkd.in/dT22DDxx

Tracker for Mistral's 8x 7b MoE model from today is on their Twitter Feed. There is no inference code available yet that I know of.

▿ Show more

Super interesting research paper from Percy Liang and crew at Stanford University. In essence long context windows for transformers don't work (yet). In practice, this means vector databases are here to stay.

LLM's like Llama or GPT-4 have a limited context size (i.e. context window). The paper finds that LLMs with long context windows perform poorly unless the important information is at the beginning or the end of the input. Typical LLM input today is max 15-60 pages of text (8k-32k tokens), so the relevant information needs to be on the first or last page. This makes long context windows much less useful.

Vector databases avoid this issue by retrieving only relevant chunks of text via search, and feed a smaller amount of data into the LLM. This is already the dominant architecture today for cost reasons. This result is that this architecture is here to stay.

Paper: https://lnkd.in/gwA-XrDH

Our previous post on LLM architecture using embeddings and vector databases:
https://lnkd.in/gKrTzzeN

▿ Show more

Guido Appenzeller

Best Posts by Guido Appenzeller on LinkedIn

Related Influencers

Alberto Porras Moreno

Jérémie Coinon

Casper Kirketerp-Møller

Clifton Sellers