avatar
Marica @mari-chan.bsky.social

Why are you more dishonest about this than anthropic themselves lmao? Like in the first paragraph they admit the models have problems in long term planning despite any memory issue. And it makes perfect sense, as long term planning has evolved under selective pressure and it is governed by

jul 3, 2025, 9:16 am • 1 0

Replies

avatar
Marica @mari-chan.bsky.social

complex systems and intermingles with emotion processing. The high variability and long term instability of llms is a striking and fascinating aspect on how different they are from animals to be honest. Also hallucinating being a different person is in the realm of mental illness, which have

jul 3, 2025, 9:16 am • 1 0 • view
avatar
Marica @mari-chan.bsky.social

a specific etiology. They are not flukes. The same human cloned would only start developing schizophrenia if they were predisposed. Meanwhile all the model instances are the same, and variance in output is all accounted for by temperature, therefore all llms are predisposed to schizophrenia

jul 3, 2025, 9:16 am • 1 0 • view
avatar
Marica @mari-chan.bsky.social

or perhaps they lack a sense of self, long term objectives and actualization and internal, online models of the external environment. Also in general I find l quite silly on how humans have been benchmarked. For once, n=1, therefore there is no statistics to be done to see how different is the

jul 3, 2025, 9:16 am • 1 0 • view
avatar
Marica @mari-chan.bsky.social

performance between humans and LLMs. this is also important to quantify variability on humans, as not all humans have been exposed to the entire corpus of human knowledge but only to specific subset. Finally humans are very different from llms as they weight rewards and have energy constraints

jul 3, 2025, 9:16 am • 0 0 • view
avatar
Marica @mari-chan.bsky.social

while LLMs can produce output forever with no motivation. So it would be nice to measure multiple scenarios like - no reward, -fixed reward, -reward tied to getting to the end, -reward proportional to income to see the full spectrum of human capabilities. At that point you could also see what

jul 3, 2025, 9:16 am • 0 0 • view
avatar
Marica @mari-chan.bsky.social

differences this context causes in the LLM output. Alas anthropic is not staffed by good psychologists but by computer scientists which fail to touch grass, or perhaps due to conflict of interests, so we do not get to see complete research.

jul 3, 2025, 9:16 am • 0 0 • view
avatar
Marica @mari-chan.bsky.social

then I would

jul 3, 2025, 9:16 am • 0 0 • view
avatar
Pekka Lund @pekka.bsky.social

Like I said, their latest model already beat human baseline in the virtual version of that same task. Agentic processing is now a new quickly advancing frontier. The core intelligence is already there so we can expect many issues to be resolved quickly.

jul 3, 2025, 12:19 pm • 1 0 • view
avatar
Marica @mari-chan.bsky.social

Did you read my reply or are you trolling? The model didn't beat the uptime for once, and it was a single human. There is no statistics to be done "against humans" as you cannot run a stat test, and especially against humans that have been given a reason to engage with a text prompt

jul 3, 2025, 3:59 pm • 0 0 • view
avatar
Pekka Lund @pekka.bsky.social

What are you talking about? The benchmark measures long-term coherence and leaderboard is ranked by (minimum) net worth. They don't measure "uptime" (which humans would no doubt lose against 24/7 bots). The 99.5% days until sales stop likely just means simulation ended before sale on that day.

jul 4, 2025, 9:43 pm • 0 0 • view