3 Comments
User's avatar
Amit's avatar

"Taalas hard-wired models provide ~10x faster inference at ~10x less power. This seems tricky to adapt to a mixture of experts architecture, like that used in current frontier models, although maybe one could hard-wire every model in the ensemble somehow."

This shows a misunderstanding of what the mixture of experts architecture is, it is not an ensemble of models

Nuño Sempere's avatar

I see, thanks! What would be the short and sweet but correct paragraph? And what's the conclusion vs this architecture? Is there some way you can hardwire parts of a bigger model using this technology and have it work out?

Tim Dingman's avatar

A MoE model is a single model where a certain section, the feed-forward network, is made of smaller networks ("experts"). Each expert is not a standalone model.

Haven't thought deeply about it but there's no reason in principle you couldn't etch a MoE model onto a chip in the same way they etched a dense model. Practically speaking though, the advantage of a MoE is that you swap experts in and out of memory all the time, only using a small fraction of them per token. You can't do that if you etch the model into the chip, so you lose the main benefit of MoE.