oh this took me too long to figure out — the "zero computation experts" they have a (mostly) regular MoE router, but some of the experts are actually nothing at all. So the MoE router sometimes entirely skips experts
oh this took me too long to figure out — the "zero computation experts" they have a (mostly) regular MoE router, but some of the experts are actually nothing at all. So the MoE router sometimes entirely skips experts
the "shortcut-connected MoE" part is solving a more complex problem than it seems on the surface the problem is the hand-off between attention & MoE causes communication overhead (e.g. expert is located on a different GPU) ScMoE re-orders the pipeline, better utilizing compute
Yeah this part was really interesting, in part because it explains why modes like V3 and K2 use a shared expert. They're doing it far the same reason, Meituan just added in a second shared expert per layer in series.
My understanding is that there's one shared expert per layer. And for every second layer there's MoE with routed experts (some of them zero-computation). And MoE results are integrated with a one layer delay, so the next attention layer can't yet utilize them.
So it's like augmenting a dense transformer with routed MoE parts that are processed in parallel and are one step behind the faster dense processing. So you get the additional smarts from those experts but miss their data from the very next layer in order to hide their (communication) delays.
I think it makes sense to think about this the other way around. The MoE is the main thing, and the dense layers are what you do while you're twiddling your thins because the data takes time to move.
But without the dense layers you would have two attention blocks in sequence. So those dense parts seem more fundamental. They are the ones that are so important they are processed always. And they are the only ones that can provide additional info to the very next attention block.
So I would rather think it as a way to splitting processing of shared vs. routed experts. You have both, as with V3, but you hide the communication delays by integrating the slower shared experts with a one layer delay.
The amount of computation stays the same but you just avoid waiting. As far as I understand it, V3 avoid waiting by using smart pipelining. So it has a system level solution for the same problem that keeps GPUs busy by processing parts of other queries while moving data between steps.
And with that it avoid the possible negative effects of not having data from the routed experts on the very next attention layer. It's more efficient under heavy load. Good total throughput even if processing of a single query is slower than this solution.
You could just not have the second dense-only attention block in there. This part explains how the whole point of shared experts is to make use of the time your MoE is moving data around, and the change here from shared expert designs like V3 uses is that they add a mini attn step & another FFN one
The starting point (Standard MoE) in the ScMoE paper is already an alternating pattern of dense and MoE blocks. They basically keep that alternating pattern but add a delay for integrating the routed expert part of MoE for hiding it's overhead.
So the starting point seems to be an already optimized architecture that doesn't put the costlier MoE to every layer. V3 also has a small optimization along those lines, as it doesn't have MoE for the first 3 layers. And similarly attention blocks may have alternating patterns of simpler impls.
What counts as a layer here becomes ambiguous, you could describe it either way. But the paper itself call the layer the big thing that includes two dense FFN steps.
Might be easier to think that way, because then you have a stack of same kinds of blocks, instead of alternating kinds of blocks.
Also one thing I haven't seen mentioned is that this is model Is a remarkably sparse MoE. V3: 256 shared experts, 8 active K2: 384 shared experts, 8 active Longcat-Flash: 512 experts, 8 active on average, up to 12
fwiw openai oss was like 1/32nd active
Yeah, 128/4 for the same ratio as V3 but in a significantly smaller model.
yeah ngl i had not understood the reason for the shared experts
i feel like this is the sort of shit you see when the US Government locks down compute bandwidth but not compute itself. We saw something similar with DeepSeek slinging their own PTX instead of CUDA to get around the nerfed comms
Localized constraints often lead to innovation unrelated to the constraint. For example, JIT production took off in Japan due to its high cost of warehousing, but it led to a faster rate of development iteration (less inventory to burn through) & fewer components affected by a given defect.
China: hungry teams + permissive (relative) govt treatment + foreign imposed highly specific constraints that translate to efficiency gains US: 4 near infinite companies with mngmnt/policy that shifts every 3 minutes that lock out competition while academia is being rug pulled by govt... EU:🤡...