Post by Tim Kellogg / Redsky

the "shortcut-connected MoE" part is solving a more complex problem than it seems on the surface the problem is the hand-off between attention & MoE causes communication overhead (e.g. expert is located on a different GPU) ScMoE re-orders the pipeline, better utilizing compute

aug 31, 2025, 1:20 pm • 10 0 • view

Yeah this part was really interesting, in part because it explains why modes like V3 and K2 use a shared expert. They're doing it far the same reason, Meituan just added in a second shared expert per layer in series.

aug 31, 2025, 3:18 pm • 4 0 • view

My understanding is that there's one shared expert per layer. And for every second layer there's MoE with routed experts (some of them zero-computation). And MoE results are integrated with a one layer delay, so the next attention layer can't yet utilize them.

aug 31, 2025, 3:57 pm • 1 0 • view

So it's like augmenting a dense transformer with routed MoE parts that are processed in parallel and are one step behind the faster dense processing. So you get the additional smarts from those experts but miss their data from the very next layer in order to hide their (communication) delays.

aug 31, 2025, 4:04 pm • 1 0 • view

I think it makes sense to think about this the other way around. The MoE is the main thing, and the dense layers are what you do while you're twiddling your thins because the data takes time to move.

aug 31, 2025, 4:06 pm • 1 0 • view

But without the dense layers you would have two attention blocks in sequence. So those dense parts seem more fundamental. They are the ones that are so important they are processed always. And they are the only ones that can provide additional info to the very next attention block.

aug 31, 2025, 4:10 pm • 0 0 • view

So I would rather think it as a way to splitting processing of shared vs. routed experts. You have both, as with V3, but you hide the communication delays by integrating the slower shared experts with a one layer delay.

aug 31, 2025, 4:11 pm • 0 0 • view

The amount of computation stays the same but you just avoid waiting. As far as I understand it, V3 avoid waiting by using smart pipelining. So it has a system level solution for the same problem that keeps GPUs busy by processing parts of other queries while moving data between steps.

aug 31, 2025, 4:14 pm • 0 0 • view

And with that it avoid the possible negative effects of not having data from the routed experts on the very next attention layer. It's more efficient under heavy load. Good total throughput even if processing of a single query is slower than this solution.

aug 31, 2025, 4:17 pm • 0 0 • view

You could just not have the second dense-only attention block in there. This part explains how the whole point of shared experts is to make use of the time your MoE is moving data around, and the change here from shared expert designs like V3 uses is that they add a mini attn step & another FFN one

aug 31, 2025, 4:17 pm • 0 0 • view

The starting point (Standard MoE) in the ScMoE paper is already an alternating pattern of dense and MoE blocks. They basically keep that alternating pattern but add a delay for integrating the routed expert part of MoE for hiding it's overhead.

aug 31, 2025, 4:25 pm • 1 0 • view

So the starting point seems to be an already optimized architecture that doesn't put the costlier MoE to every layer. V3 also has a small optimization along those lines, as it doesn't have MoE for the first 3 layers. And similarly attention blocks may have alternating patterns of simpler impls.

aug 31, 2025, 4:28 pm • 1 0 • view

What counts as a layer here becomes ambiguous, you could describe it either way. But the paper itself call the layer the big thing that includes two dense FFN steps.

aug 31, 2025, 4:04 pm • 2 0 • view

Might be easier to think that way, because then you have a stack of same kinds of blocks, instead of alternating kinds of blocks.

aug 31, 2025, 4:06 pm • 1 0 • view

Also one thing I haven't seen mentioned is that this is model Is a remarkably sparse MoE. V3: 256 shared experts, 8 active K2: 384 shared experts, 8 active Longcat-Flash: 512 experts, 8 active on average, up to 12

aug 31, 2025, 3:24 pm • 3 0 • view

fwiw openai oss was like 1/32nd active

aug 31, 2025, 3:32 pm • 2 0 • view

Yeah, 128/4 for the same ratio as V3 but in a significantly smaller model.

aug 31, 2025, 3:37 pm • 1 0 • view

yeah ngl i had not understood the reason for the shared experts

aug 31, 2025, 3:20 pm • 2 0 • view

i feel like this is the sort of shit you see when the US Government locks down compute bandwidth but not compute itself. We saw something similar with DeepSeek slinging their own PTX instead of CUDA to get around the nerfed comms

aug 31, 2025, 1:22 pm • 13 0 • view

Localized constraints often lead to innovation unrelated to the constraint. For example, JIT production took off in Japan due to its high cost of warehousing, but it led to a faster rate of development iteration (less inventory to burn through) & fewer components affected by a given defect.

aug 31, 2025, 2:04 pm • 6 0 • view

China: hungry teams + permissive (relative) govt treatment + foreign imposed highly specific constraints that translate to efficiency gains US: 4 near infinite companies with mngmnt/policy that shifts every 3 minutes that lock out competition while academia is being rug pulled by govt... EU:🤡...

sep 1, 2025, 11:01 am • 2 0 • view

Replies