Post by Eerke Boiten / Redsky

No correlation. One LLM doing better on a benchmark than another may still mean it gets more things wrong outside the benchmark. There's no way to calculate the real world accuracy of an LLM, it would be exhaustive testing with no ground truth to check against for nearly all inputs.

sep 3, 2025, 7:30 pm • 9 3

Replies

I think you can definitely sample real world use-case inputs and outputs

sep 3, 2025, 7:57 pm • 0 0 • view

No you can't - there's no way to do it meaningfully, what would make your samples representative? (And that's if you strip away the layer that makes LLMs stochastic, if you leave it on you need to get into repetition and frequency of correct answer on every input.)

sep 3, 2025, 8:02 pm • 0 0 • view

Certainly not trying to say it would be easy but I do think you can empirically observe the "performance" under so given set of assumption. As yes you will have to get in to repetition and frequency counts and so forth

sep 3, 2025, 9:17 pm • 0 0 • view

Nope. If input language is text, what is a representative input sample?

sep 3, 2025, 9:21 pm • 0 0 • view

Strikes me we must be talking at crossed purposes 🤷🏻‍♂️

sep 3, 2025, 11:11 pm • 1 0 • view

Clearly I checked your bio *after* I replied 😂🤦‍♀️

sep 3, 2025, 7:46 pm • 2 0 • view

Heh. It's an interesting analogy though ... my software engineering based argument for LLMs being untestable, if you combined it with the dubious assumption that people are like LLMs, would then almost imply that psychology lab experiments are completely useless :)

sep 3, 2025, 8:00 pm • 3 0 • view

They argue for this. Fucking brain rot.

sep 3, 2025, 8:01 pm • 3 0 • view

At least people are taking them to task. I have a paper coming out on this too soon. But it's not a marginal view that this is okay sadly. arxiv.org/abs/2402.019...

sep 3, 2025, 8:03 pm • 12 4 • view

That looks exciting. I have synthetic data research where we need to think about overfitting (current hypothesis is it's reducing it that allows privacy+utility not to be zero sum), outliers, dimensionality reduction, etc, and some of my immature intuitions on that chime with that abstract.

sep 3, 2025, 8:14 pm • 3 0 • view

Do you have the double descent pattern? en.m.wikipedia.org/wiki/Double_...

sep 3, 2025, 8:21 pm • 1 0 • view

Ooh that's interesting, thanks. We're mostly looking at noise addition for privacy arxiv.org/abs/2502.03668, maybe we need to look more at changing the latent space ...

sep 3, 2025, 8:31 pm • 2 1 • view

Yep the benchmarks bear very little relation to any kind of actual real-world use. Bit like in psychology where lab experiments only tell us so much and what happens in the messy, noisy world is a totally different thing

sep 3, 2025, 7:45 pm • 3 0 • view

Benchmarks being nonsense should be so uncontroversial and yet not only do cognitive computational neuroscientists especially love them (see brain-score; cf. my pinned post) but we also have a long history of IQ BS (pun intended) too. I love angry grandpa on this. medium.com/incerto/iq-i...

sep 3, 2025, 7:58 pm • 16 2 • view

Gould dismantled IQ as a measure of spearman's g as a statistical artefact in like 2000 in the Mismeasure of Man, a direct response to the "argument through repetition" tract that was McMurrays "the Bell curve". What else but US white supremacy is keeping it propped up... oh... right.

sep 3, 2025, 10:23 pm • 4 1 • view