"Taalas hard-wired models provide ~10x faster inference at ~10x less power. This seems tricky to adapt to a mixture of experts architecture, like that used in current frontier models, although maybe one could hard-wire every model in the ensemble somehow."
This shows a misunderstanding of what the mixture of experts architecture is, it is not an ensemble of models
I see, thanks! What would be the short and sweet but correct paragraph? And what's the conclusion vs this architecture? Is there some way you can hardwire parts of a bigger model using this technology and have it work out?
A MoE model is a single model where a certain section, the feed-forward network, is made of smaller networks ("experts"). Each expert is not a standalone model.
Haven't thought deeply about it but there's no reason in principle you couldn't etch a MoE model onto a chip in the same way they etched a dense model. Practically speaking though, the advantage of a MoE is that you swap experts in and out of memory all the time, only using a small fraction of them per token. You can't do that if you etch the model into the chip, so you lose the main benefit of MoE.
"Taalas hard-wired models provide ~10x faster inference at ~10x less power. This seems tricky to adapt to a mixture of experts architecture, like that used in current frontier models, although maybe one could hard-wire every model in the ensemble somehow."
This shows a misunderstanding of what the mixture of experts architecture is, it is not an ensemble of models
I see, thanks! What would be the short and sweet but correct paragraph? And what's the conclusion vs this architecture? Is there some way you can hardwire parts of a bigger model using this technology and have it work out?
A MoE model is a single model where a certain section, the feed-forward network, is made of smaller networks ("experts"). Each expert is not a standalone model.
Haven't thought deeply about it but there's no reason in principle you couldn't etch a MoE model onto a chip in the same way they etched a dense model. Practically speaking though, the advantage of a MoE is that you swap experts in and out of memory all the time, only using a small fraction of them per token. You can't do that if you etch the model into the chip, so you lose the main benefit of MoE.