avatar
Tim Kellogg @timkellogg.me

agreed, i think they’re doing something funny with definitions. unfortunately i haven’t gotten the tech report on my phone yet, it’s not entirely clear if the dynamic MoE router and the “PID controller” are the same thing

aug 31, 2025, 1:00 pm • 2 0

Replies

avatar
Thomas Wood @advanced-eschatonics.com

I went to the code first but I've started digging through

aug 31, 2025, 1:22 pm • 0 0 • view
avatar
Thomas Wood @advanced-eschatonics.com

PID = trainer-only side-tune: collects “real” active param-margins mid-training, spits out new bias, broadcasts it, never runs again. Model keeps vanilla sparse gating; static bias just keeps average 27 B but per-token gate entropy decides if 18–31 B get touched.

aug 31, 2025, 1:22 pm • 1 0 • view
avatar
Thomas Wood @advanced-eschatonics.com

It’s just a moving-average bias term — they call it “PID” so the stats folk nod approvingly, but it’s glorified layer-norm-for-MoE-gates.

aug 31, 2025, 1:22 pm • 2 0 • view
avatar
Tim Kellogg @timkellogg.me

i’m not sure it’s not a PID controller either. there’s definitely a loop, and they’re controlling single bias term..

aug 31, 2025, 1:24 pm • 1 0 • view
avatar
Thomas Wood @advanced-eschatonics.com

Of course. PID is deceptively simple concept

aug 31, 2025, 1:26 pm • 1 0 • view