And isn’t the summarized conversation always fed back into the model, regardless of the version? No LLM maintains state as far as I’m aware
And isn’t the summarized conversation always fed back into the model, regardless of the version? No LLM maintains state as far as I’m aware
With KV caching, you don't need to recompute the entire conversation history each time. The model caches the computed key-value pairs for all previous tokens in the conversation. If they don't store the cache, then yeah, it all has to be computed, but most will store it for some time.
And so the argument is that by routing queries to different models within a single conversation/session the KV caching needs to, at a minimum, be done more frequently if not every time?
yeah, exactly