Instead of faithfully noting the byte stream frequencies, it would say "could I shave off some storage by getting it close?". Then the reconstructed output would diverge from the input and the model would shrink sharply. LLM are large Markov chains that lossily compress the internet.