avatar
Eerke Boiten @eerkeboiten.bsky.social

No correlation. One LLM doing better on a benchmark than another may still mean it gets more things wrong outside the benchmark. There's no way to calculate the real world accuracy of an LLM, it would be exhaustive testing with no ground truth to check against for nearly all inputs.

sep 3, 2025, 7:30 pm • 9 3

Replies

avatar
Daniel @buckmeister.bsky.social

I think you can definitely sample real world use-case inputs and outputs

sep 3, 2025, 7:57 pm • 0 0 • view
avatar
Eerke Boiten @eerkeboiten.bsky.social

No you can't - there's no way to do it meaningfully, what would make your samples representative? (And that's if you strip away the layer that makes LLMs stochastic, if you leave it on you need to get into repetition and frequency of correct answer on every input.)

sep 3, 2025, 8:02 pm • 0 0 • view
avatar
Daniel @buckmeister.bsky.social

Certainly not trying to say it would be easy but I do think you can empirically observe the "performance" under so given set of assumption. As yes you will have to get in to repetition and frequency counts and so forth

sep 3, 2025, 9:17 pm • 0 0 • view
avatar
Eerke Boiten @eerkeboiten.bsky.social

Nope. If input language is text, what is a representative input sample?

sep 3, 2025, 9:21 pm • 0 0 • view
avatar
Daniel @buckmeister.bsky.social

Strikes me we must be talking at crossed purposes 🤷🏻‍♂️

sep 3, 2025, 11:11 pm • 1 0 • view
avatar
Dr Katie Twomey @k2mey.bsky.social

Clearly I checked your bio *after* I replied 😂🤦‍♀️

sep 3, 2025, 7:46 pm • 2 0 • view
avatar
Eerke Boiten @eerkeboiten.bsky.social

Heh. It's an interesting analogy though ... my software engineering based argument for LLMs being untestable, if you combined it with the dubious assumption that people are like LLMs, would then almost imply that psychology lab experiments are completely useless :)

sep 3, 2025, 8:00 pm • 3 0 • view
avatar
Olivia Guest · Ολίβια Γκεστ @olivia.science

They argue for this. Fucking brain rot.

sep 3, 2025, 8:01 pm • 3 0 • view
avatar
Olivia Guest · Ολίβια Γκεστ @olivia.science

At least people are taking them to task. I have a paper coming out on this too soon. But it's not a marginal view that this is okay sadly. arxiv.org/abs/2402.019...

sep 3, 2025, 8:03 pm • 12 4 • view
avatar
Eerke Boiten @eerkeboiten.bsky.social

That looks exciting. I have synthetic data research where we need to think about overfitting (current hypothesis is it's reducing it that allows privacy+utility not to be zero sum), outliers, dimensionality reduction, etc, and some of my immature intuitions on that chime with that abstract.

sep 3, 2025, 8:14 pm • 3 0 • view
avatar
Olivia Guest · Ολίβια Γκεστ @olivia.science

Do you have the double descent pattern? en.m.wikipedia.org/wiki/Double_...

sep 3, 2025, 8:21 pm • 1 0 • view
avatar
Eerke Boiten @eerkeboiten.bsky.social

Ooh that's interesting, thanks. We're mostly looking at noise addition for privacy arxiv.org/abs/2502.03668, maybe we need to look more at changing the latent space ...

sep 3, 2025, 8:31 pm • 2 1 • view
avatar
Dr Katie Twomey @k2mey.bsky.social

Yep the benchmarks bear very little relation to any kind of actual real-world use. Bit like in psychology where lab experiments only tell us so much and what happens in the messy, noisy world is a totally different thing

sep 3, 2025, 7:45 pm • 3 0 • view
avatar
Olivia Guest · Ολίβια Γκεστ @olivia.science

Benchmarks being nonsense should be so uncontroversial and yet not only do cognitive computational neuroscientists especially love them (see brain-score; cf. my pinned post) but we also have a long history of IQ BS (pun intended) too. I love angry grandpa on this. medium.com/incerto/iq-i...

sep 3, 2025, 7:58 pm • 16 2 • view
avatar
PJ Coffey - They/Them @homebrewandhacking@mastodon.ie @homebrewandhacking.bsky.social

Gould dismantled IQ as a measure of spearman's g as a statistical artefact in like 2000 in the Mismeasure of Man, a direct response to the "argument through repetition" tract that was McMurrays "the Bell curve". What else but US white supremacy is keeping it propped up... oh... right.

sep 3, 2025, 10:23 pm • 4 1 • view