This page is a paper highlight, which is something I want to do more. It's just a page dedicated to a very cool paper that I found.
I was thinking about the low MFU of MoE training and how to solve it. I find it very unsatisfactory that this is an unsolved problem and that all this compute is going to waste when compared to a dense model.
## The problem
The issue here is that you have 2 options, fetch all the experts to the die and select between them for each token, or route each token into a per-expert buffer and then do all the matmuls that way. The second one to my (very limited) knowledge is the way to go, however, it takes time to route the tokens where they need to go, especially in expert parallel scenarios.
![[MoE.excalidraw.png]]
Once the router matmul and top-k selections etc. are made, only then can the tokens *begin* moving to their destination. This means the GPU is sitting basically idle until the tokens are routed to where they need to go. And if you're doing expert parallelism, this means waiting on tokens from *other* GPUs, which takes even longer.
## The solution
![[scmoe.excalidraw.png]]
My initial thought was why can't we just move the router to before the self-attn and fuse the kernels for the data movement part? Fetch the vectors to the SM, do the rmsnorm or matmul whatever, and then send it straight to the expert buffers. Sure, it means the model can't route the tokens after the self attn, which is annoying but this is probably fine for anything but the first layer and frankly that's probably fine too.
![[scmoe.png]]
Well it turns out, someone tried that, and it does actually work! Introducing "[Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts](https://arxiv.org/abs/2404.05019)" And not only that, they also did it better! They added alternated between dense and sparse MoEs, that way the router can be after the self-attn mechanism and thus conditioned on the input.
![[Pasted image 20251030030752.png]]
And there you have it, perfect compute-communication overlap!
## Conclusion
What I wonder then is did deepseek do this with their one shared expert? Does anyone who actually read everything they published know this? They could just be overlapping the computation of the one shared expert with the communication/routing for the other experts and that would also work, though mildly less powerful than doing it sequentially like in ScMoE.
**Update**: This paper was actually used on a production model! [LongCat-Flash](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat)
*"featuring an innovative Mixture-of-Experts (MoE) architecture. ... To achieve advanced training and inference efficiency, we employ a **shortcut-connected architecture that expands computation**-communication overlap window, achieving over 100 tokens per second (TPS) for inference cost-effectively"*
*"As communication overhead becomes a bottleneck during MoE model scaling, we incorporate the Shortcut-connected MoE (ScMoE) design to expand the computation-communication overlap window"*