maybe i read to fast, but i saw that part about choosing embedding dimensions based on number of attention heads pretty sure that’s only relevant for text generation trouble is “embeddings” can be either input or output, and the post seems to use it both ways without clarifying. a bit confusing