agreed, i think they’re doing something funny with definitions. unfortunately i haven’t gotten the tech report on my phone yet, it’s not entirely clear if the dynamic MoE router and the “PID controller” are the same thing
agreed, i think they’re doing something funny with definitions. unfortunately i haven’t gotten the tech report on my phone yet, it’s not entirely clear if the dynamic MoE router and the “PID controller” are the same thing
I went to the code first but I've started digging through
PID = trainer-only side-tune: collects “real” active param-margins mid-training, spits out new bias, broadcasts it, never runs again. Model keeps vanilla sparse gating; static bias just keeps average 27 B but per-token gate entropy decides if 18–31 B get touched.
It’s just a moving-average bias term — they call it “PID” so the stats folk nod approvingly, but it’s glorified layer-norm-for-MoE-gates.
i’m not sure it’s not a PID controller either. there’s definitely a loop, and they’re controlling single bias term..
Of course. PID is deceptively simple concept