Tim Kellogg (@timkellogg.me) / Redsky

Tim Kellogg (@timkellogg.me) reply parent

SAME

1/9/2025, 11:59:05 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

we’re reaching a point where only HBO can help us understand what’s going on

1/9/2025, 9:06:32 PM | 10 0 | View on Bluesky | view

Paul Sample (@paulsample.bsky.social) reposted reply parent

I think these guys are based in the US and offer GLM-4.5 synthetic.new/landing/home

1/9/2025, 7:55:55 PM | 2 2 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

i assume stuff like this is a joke, until i see the X logo

Adam Wathan v @adamwathan X. com Like it or not, typing speed is a reasonable proxy for IQ.

1/9/2025, 7:49:20 PM | 19 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

no doubt

1/9/2025, 7:28:53 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

it’s a crying shame that no US operators are running GLM-4.5 (that i know of) unseating Opus at **two orders of magnitude** lower cost

The image shows the Berkeley Function-Calling Leaderboard, ranking models by overall accuracy in function-calling tasks. The table compares cost, latency, and accuracy across single-turn (non-live AST, live AST) and multi-turn scenarios. Columns: • Rank – model placement. • Overall Acc – overall accuracy score. • Model – model name/version. • Cost ($) – cost in dollars. • Latency (s) Mean – average latency in seconds. • Single Turn – accuracy for non-live (AST) and live (AST). • Multi Turn – overall multi-turn accuracy. Results: 1. GLM-4.5 (FC) • Overall Acc: 70.85 • Cost: $2.90 • Latency: 2.73s • Non-live: 86.6 | Live: 81.72 | Multi-turn: 65.62 2. Claude-Opus-4-1-20250805 (FC) • Overall Acc: 70.36 • Cost: $207.12 • Latency: 4.33s • Non-live: 88.38 | Live: 81.5 | Multi-turn: 57.88 3. Claude-Sonnet-4-20250514 (FC) • Overall Acc: 70.29 • Cost: $41.49 • Latency: 4.08s • Non-live: 88.38 | Live: 81.05 | Multi-turn: 54.75 4. GLM-4.5-Air (FC) • Overall Acc: 67.87 • Cost: $4.22 • Latency: 3.89s • Non-live: 87.15 | Live: 79.42 | Multi-turn: 62.5 5. Grok-4-0709 (Prompt) • Overall Acc: 61.6 • Cost: $333.24 • Latency: 19.23s • Non-live: 81.27 | Live: 69.73 | Multi-turn: 43.25 6. Grok-4-0709 (FC) • Overall Acc: 61.01 • Cost: $329.44 • Latency: 10.78s • Non-live: 85.21 | Live: 74.39 | Multi-turn: 36.12 Key Takeaways: • GLM-4.5 (FC) leads in overall accuracy (70.85) and is the cheapest ($2.90). • Claude Opus & Sonnet are close behind in accuracy but much more expensive. • Grok-4 models have significantly higher costs and latency, with weaker multi-turn performance. Want me to rank these models by cost-efficiency (accuracy per dollar)?

1/9/2025, 7:28:20 PM | 16 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

"Golden" — K-Pop Demon Hunters .. but metal 🤘 open.spotify.com/track/5chf2l...

1/9/2025, 7:16:29 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

no, we’ll definitely invent a successor that’s better, it just won’t be used

1/9/2025, 7:13:44 PM | 4 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

idk i’ve been working with protocols for over a decade (originally IoT) and protocols aren’t easily replaced. the network effects are too strong

1/9/2025, 7:11:07 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

that’s what i mean. the main problem with auto-conversion is that the docs still suck if you stick to tools only, the conversion is fairly straightforward (FastMCP will do it dynamically)

1/9/2025, 7:09:57 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

imo if the auto generation includes a “throw all your docs at an LLM” step, then auto generation is probably the best possible way to go most of the time

1/9/2025, 7:05:08 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

if there’s a need for modular context enhancement, it will be MCP, forever tbh the main place where i’ve found MCP useful is as a plugin system. if you use it in a place where a plugin system isn’t useful it’s just painful

1/9/2025, 7:03:07 PM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

what will replace it?

1/9/2025, 7:00:33 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

i just (internally) phrased “write a blog post” as something i need to do the thing is, the only reader would be AI i think i’ve been using AI like this for a while. the idea is mostly solidified, so i write to figure the rest out. then have AI code the idea up to see if it matches reality

1/9/2025, 6:46:32 PM | 8 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

okay, i used to be the guy with a 8 inch stack of unread papers on the corner of my desk now, if it feels overwhelming i’ll just read 10% and hash out the rest with AI. so i end up reading a lot more, lower barrier to entry

1/9/2025, 6:43:19 PM | 4 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

probably, lots of practice now

1/9/2025, 6:40:01 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

well crap! lol, i guess i was thinking you were deeper into IR for some reason. sorry about that

1/9/2025, 6:39:12 PM | 0 0 | View on Bluesky | view

Shawn Manuel (@shwnmnl.bsky.social) reposted

some obvious shortcomings lead many to think that LLMs can’t be useful thought companions, but they are outweighed by the benefits of having an infinite sounding board the outputs shouldn’t replace your own thoughts, but help to refine them open.substack.com/pub/shifting...

1/9/2025, 6:33:34 PM | 8 2 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

that last part is key and really hard to teach. it takes a lot of introspection and self-awareness

1/9/2025, 6:37:59 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

ironically, one of the impacts of AI on me has been increased attention span & willingness to read more curious if other people noticed the same thing. i used to rarely actually read technical blog posts, but now i dive straight through, quickly and with high comprehension

1/9/2025, 6:35:35 PM | 12 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

maybe i read to fast, but i saw that part about choosing embedding dimensions based on number of attention heads pretty sure that’s only relevant for text generation trouble is “embeddings” can be either input or output, and the post seems to use it both ways without clarifying. a bit confusing

1/9/2025, 6:33:12 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

btw i said “partial function application”, which might be familiar if you did Haskell, but it’s probably more familiar as “class constructors” in OOP same thing: you need to pass parameters through two separate channels

1/9/2025, 6:23:05 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

it’s a problem bc a lot of times you don’t want to leave some details up to the LLM. like if a tool takes “customer ID” as a tool parameter, there’s gonna be a lot of cases where you’d like to completely eliminate the chance for error

1/9/2025, 6:21:21 PM | 5 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

most people’s criticisms of MCP are either bogus or misapplied, but one that i never hear and should is that it doesn’t standardize partial function application if you visualize the server as a class, and tool as function, the “constructor args” aren’t standardized they go in ENV, headers, etc.

1/9/2025, 6:21:21 PM | 8 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

oh, i think so. you just need an HTTP server, which i think it provides

1/9/2025, 6:02:48 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

this is for local serving on Apple Silicon

1/9/2025, 5:53:39 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

gosh, i understand that regulation is hard to get right, but that seems like a strong fucking argument to wait a bit longer until it’s a bit more baked

1/9/2025, 5:44:10 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

and what if you just split the supply chain so that no single provider supplies more than 10^13 FLOPS? annoying, but pretty easy to circumvent what if model trends end up with SOTA being 1b-3b with heavy RL + tool use and the regulations don’t even apply?

1/9/2025, 5:43:29 PM | 4 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

to be clear, the regs start kicking in at around 13b model size, so this impacts just about everything regarding AI here it seems to apply to only the “fine tune” part, but how do you even separate the two?? it’s almost like the authors had no idea what they were doing

1/9/2025, 5:43:29 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

“the EU AI regulation is good” no omg no it is definitely not take this: if you fine-tune a regulated model, is your product also regulated? no idea. they legit never got that far but shipped the regs anyway

Il someone fine-tunes or otheraise modiles a model, do they have to comply with the ebligations for providers of general-purpose Al modeis? General-purpose Al models may be further modified or fine-tuned into new models (recital 97 Al Act). Accordingly, downstream entties that fine-tune or otherwise modily an existing general-purpose Al model may become providers of new models. The spectic circumstances in which a doanstream entty becomes a provider of a new model is a difficult question with potentiaily large economic impications, since many organisations and individuals fine-tune or otherwise modify general-purpose Al models developed by another entity: In the case of a modification or fine-tuning of an existing general purpose Al model, the obligations for providers of general-purpose Al models in Articie 53 Al Act should be limited to the modification or fine-tuning. for example, by complementing the already existing lechnical documentation with information on the modifications (Recital 109). The obligatons for providers of general-purpose Al models with systemic risk in Article 55 Al Act should only apply in clearly specified cases. The Al Ofice intends to provide further clarifications on this question. Regardless of whether a downstream ently that incorporates a general-purpose Al model into an Al system is deemed to be a provider of the general-purpose Al model, that entity must comply with the relevant Al Act requirements and obligations for Al systems.

1/9/2025, 5:34:14 PM | 6 1 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

yeah, there's a balance. so far, GPT-5 has been the best so far, but it has this weird behavior where it goes overly terse and i can't figure out what tf it's talking about

1/9/2025, 5:24:48 PM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

sycophancy is critical though some: good, it lets you explore an idea a lot: catastrophic, it leads you into dead ends there’s this social angle that if it wasn’t sycophantic at all, it would shut down an idea before it bloomed enough to become real

1/9/2025, 4:17:55 PM | 9 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

progressive thought refinement is a great way to use AI, i do this a lot 1. type long jumbled thoughts 2. AI searches unrealistically hard to make sense of anything, rubber duck’s back to you 3. dive deeper into angles that make sense end up with much more coherence

1/9/2025, 4:17:55 PM | 32 2 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

flat white = steamer? my kids love those

1/9/2025, 2:25:36 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

well look at you pinky out, motherfuckers

1/9/2025, 2:21:24 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

it’s actually not a bad trade-off for personal gear when you’re extremely memory bound. i just wish they offered *a little*, something to cache the attention sinks

1/9/2025, 2:20:36 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

is australia all instant? when i was a kid, instant was horrendous, but it’s gotten pretty good lately. i’m still too scarred to buy it though

1/9/2025, 2:16:58 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

yeah, they’re what you’d use in production. you only do it on a dev box if you’re either masochistic or a fan of Arch Linux

1/9/2025, 2:14:16 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

vLLM? sglang?

1/9/2025, 2:07:57 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

basically ollama/llama.cpp/gguf is the mysql/php of AI it’s extremely easy to get started but sacrifices correctness and quality all over

1/9/2025, 1:45:05 PM | 3 1 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

ollama wraps llama.cpp which wraps gguf file format gguf represents the entire model as a single execution graph. this causes mismatch with some newer model architectures llama.cpp is designed for CPU & personal GPU workloads, and sacrifices a lot of normal features like a KV cache

1/9/2025, 1:45:05 PM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

underrated wisdom: if you ask for black coffee and the waitress gives you side eye, that is information that shouldn’t be ignored

1/9/2025, 1:40:50 PM | 17 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

i’ve been annoyed by ollama’s gguf dependency for a while, increasingly so with all the recent LLM architecture innovations MLX is such a high quality library whereas ollama/llama.cpp us such a.. barely passable tool

1/9/2025, 11:54:40 AM | 8 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

mlx-knife: an ollama-like CLI for Apple Silicon alright, this is the end of the road for me & ollama github.com/mzau/mlx-knife

1/9/2025, 11:54:40 AM | 25 1 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

notable — they did fp32 for attention and bf16 for MLP gpt-oss went to extreme low end, this went high bf16 & fp4 regarding the MLP layers, there’s a well known trade-off, and i imagine longcat went bf16 for the training stability. they don’t have the deep expertise yet, so it sheds risk

1/9/2025, 10:47:40 AM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

gpt-5 in codex-cli is actually very nice ngl

1/9/2025, 10:20:28 AM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

my parents say it used to be on their report card, so yeah, this checks out

1/9/2025, 9:42:17 AM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

always important for.. ya know

1/9/2025, 9:37:27 AM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

i hadn’t really thought about it until recently, but smart people on the left merely appreciate positive qualities, while on the right smart people seem to winnow it down to a stack-rankable number

1/9/2025, 1:43:32 AM | 7 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

is obsessing about IQ a right-aligned behavior?

1/9/2025, 1:38:41 AM | 18 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

catboost is incredible. almost no point in even tuning it

1/9/2025, 12:33:45 AM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

SEARCH THE COTS!!

1/9/2025, 12:28:35 AM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

this is hyperparameters not data (e.g. batch size, learning rate, ..) you actually can do reverse (little -> big) distillation pretty easily. e.g. rephrasing generally doesn’t have to be a big model

31/8/2025, 8:42:42 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

mine is latent. trust me.

31/8/2025, 8:06:34 PM | 5 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

what are you doing? (clearly something very interesting)

31/8/2025, 8:06:12 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

ya know when you capitalize it like that it makes it seem like we’re not very serious here 😅

31/8/2025, 7:11:43 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

ha, at work AI interoperability is my jam, so i’m reasonably on top of this stuff

31/8/2025, 6:08:17 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

openai took the microservices approach — a model router lets teams work in parallel on different models with little to no coordination. it’s more flexible bc you can tackle wildly different ideas this is the monolith approach. it’s really only good for scaling down compute

31/8/2025, 6:07:26 PM | 5 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

that’s one of those things that became obvious with entropix models get stuck in a run of tokens where each one obviously follows the next. you don’t need a ton of compute for that but others are linchpins, you need extra

31/8/2025, 6:01:10 PM | 5 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

routers are the hot thing it’s a problem that needs to be solved. how do you scale down compute for easier problems? this one takes a wildly different approach, scaling down compute on a per-token basis

31/8/2025, 6:01:10 PM | 14 1 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

what do you mean? was their architecture known a while ago? (tbqh i hadn’t heard of them until today)

31/8/2025, 5:55:49 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

9yo was shocked to discover a store that does full body piercings don’t worry i told her it’s just the skin

31/8/2025, 5:26:19 PM | 10 1 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

pretty sure this means longcat is going to be another gpt-oss in that most providers are going to fuck up the configuration and you’ll get worse performance depending on who you go with

31/8/2025, 3:38:22 PM | 15 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

no, there’s no inner vs outer loop. this is literally just a regular MoE where some experts are noop

31/8/2025, 3:33:38 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

fwiw openai oss was like 1/32nd active

31/8/2025, 3:32:20 PM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

yeah ngl i had not understood the reason for the shared experts

31/8/2025, 3:20:03 PM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

it feels like a logical outcome of entropix — they found out that the attention logits were important and useful for sampling, this instead uses them for increasing TTC wo reasoning

31/8/2025, 2:52:22 PM | 0 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

9yo pointing Seek at her younger sister: “oh, she is human”

31/8/2025, 2:46:02 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

i’m not sure it’s not a PID controller either. there’s definitely a loop, and they’re controlling single bias term..

31/8/2025, 1:24:43 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

i feel like this is the sort of shit you see when the US Government locks down compute bandwidth but not compute itself. We saw something similar with DeepSeek slinging their own PTX instead of CUDA to get around the nerfed comms

31/8/2025, 1:22:09 PM | 13 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

the "shortcut-connected MoE" part is solving a more complex problem than it seems on the surface the problem is the hand-off between attention & MoE causes communication overhead (e.g. expert is located on a different GPU) ScMoE re-orders the pipeline, better utilizing compute

31/8/2025, 1:20:32 PM | 10 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

oh this took me too long to figure out — the "zero computation experts" they have a (mostly) regular MoE router, but some of the experts are actually nothing at all. So the MoE router sometimes entirely skips experts

31/8/2025, 1:15:09 PM | 13 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

agreed, i think they’re doing something funny with definitions. unfortunately i haven’t gotten the tech report on my phone yet, it’s not entirely clear if the dynamic MoE router and the “PID controller” are the same thing

31/8/2025, 1:00:39 PM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

all these years and still, compilers are just syntax databases

31/8/2025, 12:37:36 PM | 5 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

that’s really cool, i’ll check it out

31/8/2025, 12:34:15 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

how are chinese labs cutting their dependence on NVIDIA? like this: run experiments on tiny models, transfer hyperparameters (result of experiments) to a far larger model for the yolo run bsky.app/profile/timk...

31/8/2025, 12:01:54 PM | 16 2 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

yeah, rewriting it into multiple queries with pareseable relationships between them

31/8/2025, 11:34:53 AM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

most interesting — dynamic computation not only is it fairly sparse MoE, each token can receive dynamically more compute via a PID controller for bias adjustment 🤯 so when it gets to a token that requires extra thought, it’ll just spin there, computing more

LongCat-Flash is designed and optimized under two key principles: efficient computation utilization, as well as efficient training and inference. Specifically, (1) As not all tokens are equal, we introduce the zero-computation experts mechanism in MoE blocks to allocate a dynamic computation budget to important tokens based on their significance, i.e., activating 18.6 to 31.3 billion parameters (out of 560 billion total) based on contextual demands. To ensure consistent computation load, we employ expert bias adjusted by a PID-controller, maintaining an average of~27 billion activated parameters per token. (2) As communication overhead becomes a bottleneck during MoE model scaling, we incorporate the Shortcut-connected MoE (ScMoE) design to expand the computation-communication overlap window. Combined with customized infrastructure optimizations, this design enables training at a massive scale of over tens of thousands accelerators and inference with high throughput and low latency.

31/8/2025, 11:20:08 AM | 14 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

Longcat-Flash-Chat (560B) uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay but inside.. damn this one has some cool ideas huggingface.co/meituan-long...

The image is a multi-panel bar chart comparing performance of different large language models across several benchmarks. It is divided into four categories: General Domains, Agentic Tool Use, Code, and Instruction Following. Each panel has bars representing model results, with scores on the y-axis. Top row – General Domains: • ArenaHard-V2: LongGPT-Flash leads with 86.5, followed by Kimi K2 (88.2), DeepSeek V3.1 (84.1), Claude Sonnet (61.5), GPT-4.1 (62.1), Qwen3.5 MoE-2507 (85.7), and Gemini 2.5 Flash (77.0). • MMLU-Pro: Best scores are Kimi K2 (84.5) and DeepSeek V3.1 (84.5), with LongGPT-Flash (82.7), Qwen3.5 MoE-2507 (82.1), GPT-4.1 (81.7), Claude Sonnet (83.7), Gemini 2.5 Flash (82.0). Top row – Agentic Tool Use: • t2-Bench (average): LongGPT-Flash leads (67.7), Kimi K2 (64.2), Claude Sonnet (62.1), GPT-4.1 (55.1), DeepSeek V3.1 (49.8), Qwen3.5 MoE-2507 (43.0), Gemini 2.5 Flash (40.9). • VitaBench: LongGPT-Flash 24.3, Claude Sonnet 23.0, DeepSeek V3.1 20.3, Kimi K2 18.2, GPT-4.1 19.0, Qwen3.5 MoE-2507 8.5, Gemini 2.5 Flash 8.0. Bottom row – Code: • SWE-Bench-Verified: Claude Sonnet leads with 68.0, Kimi K2 64.6, DeepSeek V3.1 66.0, LongGPT-Flash 60.4, GPT-4.1 48.6, Qwen3.5 MoE-2507 42.0, Gemini 2.5 Flash 40.6. • TerminalBench: Claude Sonnet 40.7, LongGPT-Flash 39.5, DeepSeek V3.1 31.3, GPT-4.1 28.4, Kimi K2 25.9, Qwen3.5 MoE-2507 17.3, Gemini 2.5 Flash 12.4. Bottom row – Instruction Following: • COLLIE: LongGPT-Flash 57.1, Kimi K2 56.3, Claude Sonnet 51.2, GPT-4.1 50.0, DeepSeek V3.1 49.7, Gemini 2.5 Flash 48.6, Qwen3.5 MoE-2507 43.8. • Meeseeks (ZH): LongGPT-Flash 43.0, Kimi K2 42.8, Claude Sonnet 41.5, GPT-4.1 35.1, DeepSeek V3.1 35.3, Qwen3.5 MoE-2507 33.8, Gemini 2.5 Flash 34.8.

31/8/2025, 11:20:08 AM | 45 5 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

imo if search is done perfectly, you effectively drive your LLM context to infinity but it’s very much not a solved problem to illustrate how underdeveloped this space is — research from 5 years ago still seems like the best ideas (contrast that to LLMs)

31/8/2025, 11:06:59 AM | 9 2 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

alternatively, sparse approaches like SPLADE do this in latent space but use inverted indices (regular full text search, exact matches) arxiv.org/abs/2107.057...

31/8/2025, 11:06:59 AM | 5 2 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

you really need to capture the query and decompose it into multiple sub queries e.g. maybe get a 1B-3B LLM to rewrite the query into a DSL (e.g. a JSON breakdown of the various components and concepts in the query) and then push that logic into the database engine itself

31/8/2025, 11:06:59 AM | 4 3 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

multi-vector (late interaction) search like ColBERT also works, because it handles the predicate logic in cheaper latent space, but storage costs are a lot higher because, well it’s multi-vector (fwiw Qdrant and a few other vector DBs support multi-vectors) huggingface.co/jinaai/jina-...

31/8/2025, 11:06:59 AM | 5 3 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

btw even adding a reranker won’t help if you’ve already dropped the relevant results in the first stage embedding retrieval agentic search DOES work, but now you’re relying on an expensive LLM to resolve simple boolean logic

31/8/2025, 11:06:59 AM | 7 2 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

Limits of vector search a new GDM paper shows that embeddings can’t represent combinations of concepts well e.g. Dave likes blue trucks AND Ford trucks even k=2 sub-predicates make SOTA embedding models fall apart www.alphaxiv.org/pdf/2508.21038

31/8/2025, 11:06:59 AM | 75 22 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

i’m not sure teachers and parents generally get this nuance, but most AI tools don’t answer only from their weights. e.g. chatgpt will search the internet — problem is if the answer isn’t found it’ll eagerly produce *something* anyway notebooklm is a bit stricter about that, so i guess better

31/8/2025, 9:34:31 AM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

this makes me happy

* Using GEPA © The Easiest Path: DSPy Integration The easiest and most powerful way to use GEPA for prompt optimization is within DSPy, where the GEPA algorithm is directly available through the dspy.GEPA API. Directly executable tutorial notebooks are at dspy.GEPA Tutorials.

31/8/2025, 2:05:13 AM | 2 1 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

i thought this was all a joke this morning, now i’m starting to wonder

31/8/2025, 12:21:07 AM | 6 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

wait what? is there any non-gossip about this?

30/8/2025, 5:38:22 PM | 2 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

which photos?

30/8/2025, 5:33:59 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

the best kind of yummy tbh

30/8/2025, 3:31:05 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

very much so. i think all of AI is going this direction swiftly the big labs own the priors (models) and algorithms, but they’re missing the environments, bc those are ours, they’re in the real world things are about to change

30/8/2025, 3:30:38 PM | 1 1 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

fyi these are poison

30/8/2025, 3:16:06 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

yummy?

The image shows a plant with clusters of small, round, glossy black berries. The plant has bright green leaves that are oval-shaped with pointed tips and noticeable veins. Many of the leaves have small holes and damage, likely from insects. The berries grow in groups, typically 4–6 per cluster, hanging along the stems. The background shows other green plants and ground vegetation. This plant closely resembles black nightshade (Solanum nigrum), which can be toxic if consumed raw, especially the unripe green berries and leaves. The black berries may sometimes be eaten when fully ripe in certain cultures, but identification should be made with extreme caution because of its similarity to poisonous plants like deadly nightshade (Atropa belladonna). Would you like me to give you a detailed breakdown of how to safely identify whether this is indeed black nightshade versus the more dangerous deadly nightshade?

30/8/2025, 2:56:04 PM | 6 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

i mean there’s a wide range between small scale experiments and yolo runs, hard to say what happens in the middle

30/8/2025, 2:23:59 PM | 1 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

this true?

30/8/2025, 2:04:53 PM | 9 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

DeepSeek is reducing their dependence on NVIDIA they do small scale training runs & experiments on Huawei Ascend, but yolo runs on NVIDIA

30/8/2025, 1:56:11 PM | 25 4 | View on Bluesky | view

Tim Kellogg (@timkellogg.me)

something to keep in mind, modern RL is embedded in the real world via @jxmnop.bsky.social

Jack Morris @jxmnop • 17h for the first time i am aware of, there is an entirely private subfield of Al research every company that actually trains models is doing RL with rubrics and LLM-judged rewards but academic work is stuck on RL with automated rewards (math problems and code). much cleaner for benchmarking. much easier to write papers about

30/8/2025, 12:53:38 PM | 20 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

to be clear, your point is about the RL reward function, not the RL environment. benchmarks are still different, but that’s not even the point. it’s that it’s targeting the real world why do benchmarks seem contrived when you look at their definition? because they’re benchmarks. RL is different

30/8/2025, 12:49:25 PM | 3 0 | View on Bluesky | view

Tim Kellogg (@timkellogg.me) reply parent

guys, i had just woken up, give me a pass. it was pre-coffee

30/8/2025, 12:42:51 PM | 6 0 | View on Bluesky | view

Tim Kellogg

Posts