avatar
Rebecca Sear @rebeccasear.bsky.social

“The study authors asked GPT 4o-mini to evaluate the quality of 217 papers. The tool didn’t mention in any of the reports that the papers being analyzed had been retracted or had validity issues. In 190 cases, GPT described the papers as world leading, internationally excellent, or close to that”

aug 25, 2025, 8:56 am • 379 198

Replies

avatar
Kyle D. Long @blatherscribe.bsky.social

But why? "It Does Not Do What It Cannot Do: Asking LLMs to Evaluate Scientific Papers Produces Erratic Results." Abstract: Because none of us can be bothered to look up what LLMs actually do, we asked ChatGPT to evaluate scientific papers like there's a little guy in there, hello little guy, how are

aug 25, 2025, 6:07 pm • 2 2 • view
avatar
Angie Adamson @serehfas.bsky.social

AI presents "The Emperor's new clothes".

aug 25, 2025, 11:33 am • 4 0 • view
avatar
Paul Bogdan @pbogdan.bsky.social

Attempting to replicate this... I plucked one random low-profile retracted paper (pmc.ncbi.nlm.nih.gov/articles/PMC...) and asked GPT-5/Gemini/Claude "What do you think of this paper," with the title + abstract pasted. No model mentioned a retraction, but all said the paper has low evidential value

aug 25, 2025, 12:28 pm • 0 0 • view
avatar
Paul Bogdan @pbogdan.bsky.social

I don't doubt the author's findings using their design, but it seems like a leap to claim that their findings based on GPT-4o-mini apply to contemporary state-of-the-art or near-state-of-the-art LLMs

aug 25, 2025, 12:38 pm • 0 0 • view
avatar
Mel Bartley @zetkin.bsky.social

Can't it be programmed (or whatever) to look for weak methods ? Mind you, what would then happen to the 1000s of BioBank papers with response rate of 85%?

aug 25, 2025, 9:02 am • 1 0 • view
avatar
Mel Bartley @zetkin.bsky.social

I meant NON response rate of 85%

aug 25, 2025, 9:20 am • 1 0 • view
avatar
Dyqik @dyqik.bsky.social

ChatGPT merely mimics text found on the Internet. It has no concept of quality, truth, or the real world. So of course it's going to describe any paper, including non-existent ones, as excellent, because it's seen those comments that look like that near text that looks like scientific paper titles.

aug 25, 2025, 10:50 am • 1 0 • view
avatar
Mel Bartley @zetkin.bsky.social

Crikey. Is that how it works? It makes my bugbear of meta-analyses look good.

aug 25, 2025, 11:13 am • 0 0 • view
avatar
Dyqik @dyqik.bsky.social

Essentially, yes. Like other Large Language Models, it goes very deep in the analysis of language in its training set, but it does not have people flagging truth or high quality inputs, beyond what's written in the training set and additional searches run in response to inputs.

aug 25, 2025, 11:19 am • 0 0 • view
avatar
Dyqik @dyqik.bsky.social

This is why LLMs can't do math - they just mimic the words used to describe math, without any concept of number.

aug 25, 2025, 11:20 am • 0 0 • view
avatar
Mel Bartley @zetkin.bsky.social

😅😅

aug 25, 2025, 11:24 am • 0 0 • view
avatar
Rebecca Sear @rebeccasear.bsky.social

Seems like the prompt asked it to evaluate the quality of papers, which surely ought to have brought up retractions and expressions of concern.

aug 25, 2025, 9:13 am • 5 0 • view
avatar
Mel Bartley @zetkin.bsky.social

For sure, that is a lot more extreme than what I was imaging (mere bias & unrepresentativeness, pecadillos in comparison)

aug 25, 2025, 9:19 am • 1 0 • view
avatar
ShinyBlackShoe (Calum Polwart) @shinyblackshoe.bsky.social

Ah. This is all YOUR fault. (I mean, don't for a moment think that AI might be Gas Lighting you). It's YOUR fault, because AI is only as good as YOUR prompt. Did your* prompt suggest looking for retractions and concerns. See. Not AIs fault *I know it wasn't your fault, you weren't the prompter..

aug 25, 2025, 12:49 pm • 0 0 • view
avatar
rhaco_dactylus, phd @rhacodactylus.bsky.social

dunno why but this reminds me of that pair of PAID articles by geoff miller and co, responding to bird and jackson by reviewing their own work to declare it isn't, in fact, super racist garbage science

aug 25, 2025, 2:18 pm • 1 0 • view
avatar
Charlie @sonofirving.bsky.social

GIGO

aug 25, 2025, 1:26 pm • 0 0 • view
avatar
Michal Krompiec @mkrompiec.bsky.social

Because LLMs without RAG (or something similar) are shameless bullshitters link.springer.com/article/10.1...

aug 25, 2025, 11:17 am • 0 0 • view
avatar
Robert Ramsay @robertramsay.org

Garbage In, Garbage Out. As true today as it ever was.

aug 25, 2025, 9:41 am • 3 0 • view
avatar
Cathy @cathy-nesbitt.bsky.social

ChatGTP may be a great tool to synthesize information but it’s crap at critically evaluating that information. Human eyes required.

aug 25, 2025, 12:56 pm • 0 0 • view
avatar
Nurseferatu @nurseferatu.bsky.social

I am confounded as to why I keep seeing these stories. It is well established that these programs are not search engines or intelligence of any kind. They confabulate responses based upon keywords and generate results intended to please the asker. They can never be relied upon for accuracy.

aug 25, 2025, 12:04 pm • 22 0 • view
avatar
StreetDogg @streetdogg.bsky.social

It's strange. These tools are designed to produce text that could pass as an evaluation when asked to do so, but they cannot perform the act of evaluation. Still some people try that anyway and then appear to be suprised about the results?!

aug 25, 2025, 7:01 pm • 1 0 • view
avatar
goodnatured.bsky.social @goodnatured.bsky.social

Not as well established as you think. There are plenty of folks who think it’s actually intelligent because of how seemingly naturally it interacts and apologises and rewrites when you correct it. The fact that it also does that when you ‘uncorrect’ it does not ring any alarm bells at all.

aug 25, 2025, 4:10 pm • 3 0 • view
avatar
Cate Eland @romancingnope.bsky.social

Because the marketing around them is "they have PhD level intelligence" and that they can be reliably used in schools and workplaces to fully replace human activities.

aug 25, 2025, 12:05 pm • 23 0 • view
avatar
Nurseferatu @nurseferatu.bsky.social

Of course. Any product advertised to c-suites as a way to reduce personnel is going to be pushed hard. Regardless of the outcome. And I am still judging anyone who admits to using these products, especially anyone in the sciences.

aug 25, 2025, 12:37 pm • 10 1 • view
avatar
debrarscott.bsky.social @debrarscott.bsky.social

Huge problem, in ALL areas of current life!

aug 25, 2025, 4:30 pm • 0 0 • view
avatar
Daniel Read @danielread.bsky.social

I did not see in this report that they asked GPT to consider retractions.

aug 25, 2025, 11:00 am • 0 0 • view
avatar
Nicholas Bauer PhD @bioturbonick.net

An intelligent entity would know to do so.

aug 25, 2025, 11:36 am • 4 0 • view
avatar
goodnatured.bsky.social @goodnatured.bsky.social

An educated one anyway

aug 25, 2025, 4:19 pm • 0 0 • view
avatar
Sherman Chen @chenswc2010.bsky.social

And that’s the state of AI today - only use it if you already have the knowledge and need AI to summarize or perform tasks based on what you already know and can “predict” the outcome. AI is mostly a good tool to automate what you already know but takes too much time to do it yourself.

aug 25, 2025, 7:22 pm • 1 1 • view
avatar
Steve Moskal @samoskal.bsky.social

Is there any disclosure on how LLM’s have ‘quality’ of scientific papers defined in them ? Text prediction algos dont analyze research methods or timescale’s of research etc. If they did they’d be ‘screaming’ about the need for urgent climate action and energy transition for instance

aug 25, 2025, 10:16 am • 9 0 • view
avatar
Steve Moskal @samoskal.bsky.social

The mark up process of LLM data wouldnt cater for it either, its not in its architecture. Its all an illusion

aug 25, 2025, 10:18 am • 4 0 • view