we do a lot of things to try and fix this -- layer norms, batch norms (maybe; they work but we don't know why), skip connections -- but once the model starts developing polysemanticity, everything becomes entangled in a way which is very difficult to unstick.