avatar
ArkaeinMatt @arkaeinmatt.bsky.social

Second, you say Grok is the leading AI model, based on what? Based on benchmarks I've seen it is competitive, but certainly not an outlying leader. As a test I fed your question and replies to ChatGPT o3-mini-high. It started at 0.2% and moved up to 0.3%-0.5%.

mar 17, 2025, 9:47 pm • 30 1

Replies

avatar
Geezer Soze @plinytheelder-t.bsky.social

Grok is a me-too LLM at best, with no meaningful research or innovation by that group. They're just copying the work of all the others. If an LLM could be a mid-life crisis, Grok would be Elmo's

mar 18, 2025, 3:05 am • 0 0 • view
avatar
Seth Abramson @sethabramson.bsky.social

This does not appear to be accurate. You did not compare apples to apples. This PROOF reader did:

image
mar 18, 2025, 1:03 am • 0 0 • view
avatar
Seth Abramson @sethabramson.bsky.social

And another reader has reached the same results with yet another AI foray.

image
mar 18, 2025, 1:05 am • 0 0 • view
avatar
ArkaeinMatt @arkaeinmatt.bsky.social

So I checked my test and did make some honest mistakes, it looks like I stopped after the first three prompts. However adding the remaining prompts only raised the projection to 2%-3%. Here is my full chat if you'd like to examine yourself: chatgpt.com/share/67d8c7...

mar 18, 2025, 1:21 am • 0 0 • view
avatar
ArkaeinMatt @arkaeinmatt.bsky.social

I can't explain why others got different results, and I didn't test every model. I did tweak the initial prompt slightly to get it to calculate a percentage because my first attempt using ChatGPT o1 was reluctant to commit to any percentage.

mar 18, 2025, 1:22 am • 0 0 • view
avatar
ArkaeinMatt @arkaeinmatt.bsky.social

I'm now trying with ChatGPT 4.5 research preview, and after the first prompt it's refusing to provide an initial estimate and asking more questions about how to proceed. I'm telling it to provide a probability estimate, I'll see how it goes.

mar 18, 2025, 1:30 am • 1 0 • view
avatar
ArkaeinMatt @arkaeinmatt.bsky.social

So, ChatGPT 4.5 research preview gives radically different results from my earlier run that line up much more closely with what Grok produced in the PROOF article. I still don't like the method. It assumes the AI starts from a reasonable baseline, and that more prompting improves the results.

mar 18, 2025, 1:52 am • 1 0 • view
avatar
ArkaeinMatt @arkaeinmatt.bsky.social

I suspect Grok is too high, and ChatGPT being overly conservative and not reacting strongly enough to recent events and is estimating too low. Beyond that I think trying to calculate a reasonable probability is an extremely difficult problem for humans or AI. I don't really trust either estimate.

mar 17, 2025, 9:49 pm • 27 0 • view
avatar
Greg Morris @27ragbag.bsky.social

I’ve done similar tests (diff subject) between Grok and ChatGPT and get the same gap. To me it seems Grok is better and “thinking” what the reader wants to hear, which is far different than true probabilities from the AI response generated.

mar 17, 2025, 11:20 pm • 8 0 • view
avatar
justanother108.bsky.social @justanother108.bsky.social

It’s too easy to get sucked in to a more sophisticated Eliza

mar 18, 2025, 12:28 am • 1 0 • view
avatar
AtomicPwrdRobot @atomicpwrdrobot.bsky.social

It's really the equivalent of asking a Ouija board a bunch of personal questions and getting freaked out when it starts answering in ways that confirm your own thoughts. You're the one pushing the pointer around dude, ghosts aren't real and AIs are just good at stringing words together.

mar 18, 2025, 5:01 am • 10 1 • view