avatar
Dr Kat Day (she/her) @chronicleflask.katday.com

Trigger warning for maths and science teachers: the bar chats shown here are extremely upsetting.

Screenshot from article: Most obviously, the claimed SWE-bench performance of GPT-5 versus older model shown on launch day was badly botched. The chart showed accuracy figures of 74.9% for ChatGPT 5, 69.1% for OpenAi 03 and 30.8% for GPT-40. [pink and white bar charts: heights bear no relation to the y axis, or each other] Problem is, the bar graph heights were exactly the same for the latter two, giving the at-a-glance impression of total dominance for GPT-5 when in fact it is only marginally superior to OpenAl 03.
aug 26, 2025, 7:26 am • 8 1

Replies

avatar
Lorraine Wilson @rainewilson.bsky.social

My eyes! 😱

aug 26, 2025, 8:31 am • 2 0 • view
avatar
Dr Kat Day (she/her) @chronicleflask.katday.com

Link to article: www.pcgamer.com/software/ai/...

aug 26, 2025, 7:26 am • 0 0 • view
avatar
Gnarlygeek (he/him) aka Tom @gnarlygeek.bsky.social

Maybe they used the GPT-5 "without thinking" part to make the charts? I've always found just a bit of thinking usually helps. 🤣

aug 26, 2025, 12:00 pm • 0 0 • view
avatar
Mark Llety'r Deryn @markrees.bsky.social

Is there a screaming emoji out there?!?!

aug 26, 2025, 7:30 am • 2 0 • view