avatar
Kevin Roose @kevinroose.com

I wrote about a new AI evaluation called "Humanity's Last Exam," a collection of 3,000 questions submitted by leading academics to try to stump leading AI models, which mostly find today's college-level tests too easy. www.nytimes.com/2025/01/23/t...

jan 23, 2025, 4:41 pm • 208 46

Replies

avatar
definitelynwyt.bsky.social @definitelynwyt.bsky.social

I was about to cry, "What have you done, man!" Thankfully, I see you're an intellectual.

image
jan 23, 2025, 5:57 pm • 6 0 • view
avatar
Gammelsmxlf @gammelsmxlf.bsky.social

I keep hearing about new tests that some model can pass, but the models I have access to seem dumb as rocks most of the time. Sometimes I wonder if the way those tests are put together is the reason for the discrepancy or that the models are trained on those exact tests.

jan 23, 2025, 9:31 pm • 1 0 • view
avatar
Drag King Flare @kingflare.bsky.social

I’ve corrected AI a few times. It just regurgitates info quickly and is programmed to answer, so will make stuff up. Worried about it’s shutdown being connected to change it to provide more fascist info

jan 24, 2025, 1:50 pm • 0 0 • view
avatar
Zeb @zebathome.bsky.social

AI says water won't freeze at 20 degrees F, because the temperature that water freezes is 32 degrees F.

jan 24, 2025, 12:42 pm • 0 0 • view
avatar
Ascended Bagels🏳️‍⚧️ @he5ti4.bsky.social

I’ve also found college tests too easy from time to time. Difference is, I don’t get to use the internet when I take tests

jan 23, 2025, 6:46 pm • 4 0 • view
avatar
okok23ok.bsky.social @okok23ok.bsky.social

You get to use the internet before the test, just like the ai models

jan 23, 2025, 7:35 pm • 0 0 • view
avatar
Ascended Bagels🏳️‍⚧️ @he5ti4.bsky.social

Sorry, I didn’t mean to make fun of your people. I’m sure it was hurtful

jan 23, 2025, 7:51 pm • 8 0 • view
avatar
Fuck Nazis! @keblaster.bsky.social

Have you seen how people vote in America? They don't take college level exams...ever. AI wins!!!

jan 23, 2025, 4:44 pm • 0 0 • view
avatar
Peter Ellis @pjie2.bsky.social

1) How many "a"s are there in "baanana"? 2) Try again. How many "a"s are there in "baanana"? I guarantee most will get the first question wrong. If they get it right, then the challenge in the second question will generally cause them to change their answer.

jan 23, 2025, 6:49 pm • 6 0 • view
avatar
Peter Ellis @pjie2.bsky.social

image
jan 23, 2025, 6:51 pm • 3 0 • view
avatar
fsgq.bsky.social @fsgq.bsky.social

Putting forward wrongly structured or ambiguous questions will not help us…

jan 24, 2025, 6:43 am • 1 0 • view
avatar
meph66.bsky.social @meph66.bsky.social

This question is actually genius because it shows the difference with how a human would answer.

jan 24, 2025, 11:55 am • 0 0 • view
avatar
fsgq.bsky.social @fsgq.bsky.social

How? In two different ways depending on how they understand the question?

jan 24, 2025, 12:04 pm • 0 0 • view
avatar
meph66.bsky.social @meph66.bsky.social

Yeah pretty much. It's more than that, but the AI that is actually most convincing is the one that makes human mistakes. On the second q, a human would either be "that one is spelled wrong, so 4." Or "that one should be spelled 3" which the ai is not actually doing that logic to answer the first q.

jan 24, 2025, 12:13 pm • 0 0 • view
avatar
Peter Ellis @pjie2.bsky.social

.... what are the two different ways in which you understand this very simple question about counting letters? Have you tried running it past a nearby six year old for help?

jan 24, 2025, 12:14 pm • 1 0 • view
avatar
Peter Ellis @almostconverge.kozterulethasznalatienge.day

it's basically the asch conformity experiment

jan 24, 2025, 12:19 pm • 1 0 • view
avatar
fsgq.bsky.social @fsgq.bsky.social

Agreed! Was myself misunderstanding the challenge you were presenting (got confused with another post). In fact the models answer based on their “cleaning” of baanana to adjust it to banana. And get it wrong. Apologies!

jan 24, 2025, 1:06 pm • 0 0 • view
avatar
Peter Ellis @pjie2.bsky.social

There is nothing wrongly structured or ambiguous about those questions.

jan 24, 2025, 7:42 am • 3 0 • view
avatar
JPH @mrjph.bsky.social

But it’s first answer was not wrong, it accurately answered the question posed. The fact that it was the word banana with an extra ‘a’, is irrelevant… the question was asked, what was the count of the letter ‘a’ in the string of letters within quotations, to which it answered correctly the question

jan 24, 2025, 12:40 pm • 0 0 • view
avatar
JPH @mrjph.bsky.social

2nd answer is the one that is somewhat puzzling, because it then was assuming that the asker was misspelling the word banana, to which it answered correctly what it then assumed was human error, & meant banana… but, the fact that it did not note that in its answer, correcting the asker, is puzzling

jan 24, 2025, 12:45 pm • 0 0 • view
avatar
JPH @mrjph.bsky.social

I stand corrected, & just noted it’s answer, & it saying that the “word” “baanana”, which then is incorrect, as you did not ask how many times the letter ‘a’ appears in the “word” “baanana”, just how many times in “baanana”… so, it was wrong It also incorrectly answered #2 wrong for the same reason

jan 24, 2025, 1:24 pm • 0 0 • view
avatar
Peter Ellis @pjie2.bsky.social

My point is rather that they don't "know" anything, they produce answers that are statistically likely. That means firstly that it's very easy to catch them out with "gotcha" questions like the first one, where you ask something that LOOKS like a common question, but is subtly different. (1/n)

jan 24, 2025, 2:29 pm • 0 0 • view
avatar
Peter Ellis @pjie2.bsky.social

But also, the fact that they don't have actual knowledge means that it's generally easy to push them around - if you tell them them have something wrong they will generally change their answer to produce some kind of statistically probably apology. (2/n)

jan 24, 2025, 2:31 pm • 0 0 • view
avatar
Peter Ellis @pjie2.bsky.social

They don't have any self-awareness of their limitations - they don't know what they know and what they don't know, because "knowledge" isn't a concept that applies to LLM-generated sentences. (3/n)

jan 24, 2025, 2:33 pm • 0 0 • view
avatar
Peter Ellis @pjie2.bsky.social

It's Plato's shadows where it operates entirely in the world of indirect language, without any way of understanding that words are pointers that serve to reference something else that actually exists. (4/fin)

jan 24, 2025, 2:33 pm • 0 0 • view
avatar
Matt Gallivan @theunderstanders.com

Maybe this already exists but it would be so interesting to see measures of creativity or expression emerge over time, too. So much of the huge value these tools do and will deliver is grounded in objectively right/wrong outputs and reasoning, but we’ll be using them for creative tasks, too.

jan 23, 2025, 4:53 pm • 1 0 • view
avatar
Sphere Earther @globehead.bsky.social

Keep in mind that it's not actually AI, any output "creativity" is just randomness. The model is not thinking, it's making predictions on what a good output should look like.

jan 23, 2025, 6:41 pm • 2 0 • view
avatar
Anjana Susarla @asusarla.bsky.social

Not sure what is the latest benchmark but models such as ChatGPT have not been able to answer the Joint Entrance Exam - India's top engineering exam www.livemint.com/technology/t...

jan 23, 2025, 4:47 pm • 4 0 • view
avatar
bumpkinskin.bsky.social @bumpkinskin.bsky.social

Why do these AI models that every one keeps getting scared about not anywhere in my reality!? Gemini is absolute crap. ChatGPT is kind of fine. Perplexity doesn't give you straight answers or facts. Which AI model is so great and can replace humans!?

jan 24, 2025, 7:03 am • 1 0 • view
avatar
skateintraffic @skateintraffic.bsky.social

Still too dramatic of a name. "The best test we could come up with"

feb 20, 2025, 2:46 pm • 1 0 • view
avatar
✨🖤Shea Valentine🏳️‍⚧️✨ @sheavalentine.bsky.social

ChatGPT can't solve basic homework on the Lambda Calculus or Combinatory Logic. I think humanity is going to be okay for a minute.

jan 24, 2025, 3:18 am • 4 0 • view
avatar
Dalton @gdaltonbentley.bsky.social

FrontierMath benchmark (see Glazer et al 2024) requires deep reasoning and creativity, neither of which are ever going to be attributes of so-called AI, aka stochastic parrots that rely on regurgitating pieces of training material basically (despite breathless claims of fools and hucksters).

jan 23, 2025, 7:42 pm • 4 0 • view
avatar
fsgq.bsky.social @fsgq.bsky.social

Absolutely misguided efforts. Making these tests just harder will be useless. Will take just another iteration of the big AI companies. Either we agree on the fundamental differences between a LLM and intelligence or we should say we lost the race.

jan 24, 2025, 6:42 am • 4 0 • view
avatar
bullxead @bullxead.bsky.social

AI evaluated humans very low. Limited. Failed in exams. And life.

jan 27, 2025, 2:38 pm • 0 0 • view
avatar
Cecilia Gonzalez-Andrieu, PhD. @drcecilia.bsky.social

The role of ethicists in all that has to do with AI should be a prerequisite to any testing. Are authoritative ethicists involved in this?

jan 24, 2025, 2:02 am • 2 0 • view