Tim Kellogg (@timkellogg.me) reply parent
SAME
AI Architect | North Carolina | AI/ML, IoT, science WARNING: I talk about kids sometimes
6,550 followers 681 following 9,111 posts
view profile on Bluesky Tim Kellogg (@timkellogg.me) reply parent
SAME
Tim Kellogg (@timkellogg.me)
we’re reaching a point where only HBO can help us understand what’s going on
Paul Sample (@paulsample.bsky.social) reposted reply parent
I think these guys are based in the US and offer GLM-4.5 synthetic.new/landing/home
Tim Kellogg (@timkellogg.me) reply parent
no doubt
Tim Kellogg (@timkellogg.me)
it’s a crying shame that no US operators are running GLM-4.5 (that i know of) unseating Opus at **two orders of magnitude** lower cost
Tim Kellogg (@timkellogg.me)
"Golden" — K-Pop Demon Hunters .. but metal 🤘 open.spotify.com/track/5chf2l...
Tim Kellogg (@timkellogg.me) reply parent
no, we’ll definitely invent a successor that’s better, it just won’t be used
Tim Kellogg (@timkellogg.me) reply parent
idk i’ve been working with protocols for over a decade (originally IoT) and protocols aren’t easily replaced. the network effects are too strong
Tim Kellogg (@timkellogg.me) reply parent
that’s what i mean. the main problem with auto-conversion is that the docs still suck if you stick to tools only, the conversion is fairly straightforward (FastMCP will do it dynamically)
Tim Kellogg (@timkellogg.me) reply parent
imo if the auto generation includes a “throw all your docs at an LLM” step, then auto generation is probably the best possible way to go most of the time
Tim Kellogg (@timkellogg.me) reply parent
if there’s a need for modular context enhancement, it will be MCP, forever tbh the main place where i’ve found MCP useful is as a plugin system. if you use it in a place where a plugin system isn’t useful it’s just painful
Tim Kellogg (@timkellogg.me) reply parent
what will replace it?
Tim Kellogg (@timkellogg.me)
i just (internally) phrased “write a blog post” as something i need to do the thing is, the only reader would be AI i think i’ve been using AI like this for a while. the idea is mostly solidified, so i write to figure the rest out. then have AI code the idea up to see if it matches reality
Tim Kellogg (@timkellogg.me) reply parent
okay, i used to be the guy with a 8 inch stack of unread papers on the corner of my desk now, if it feels overwhelming i’ll just read 10% and hash out the rest with AI. so i end up reading a lot more, lower barrier to entry
Tim Kellogg (@timkellogg.me) reply parent
probably, lots of practice now
Tim Kellogg (@timkellogg.me) reply parent
well crap! lol, i guess i was thinking you were deeper into IR for some reason. sorry about that
Shawn Manuel (@shwnmnl.bsky.social) reposted
some obvious shortcomings lead many to think that LLMs can’t be useful thought companions, but they are outweighed by the benefits of having an infinite sounding board the outputs shouldn’t replace your own thoughts, but help to refine them open.substack.com/pub/shifting...
Tim Kellogg (@timkellogg.me) reply parent
that last part is key and really hard to teach. it takes a lot of introspection and self-awareness
Tim Kellogg (@timkellogg.me)
ironically, one of the impacts of AI on me has been increased attention span & willingness to read more curious if other people noticed the same thing. i used to rarely actually read technical blog posts, but now i dive straight through, quickly and with high comprehension
Tim Kellogg (@timkellogg.me) reply parent
maybe i read to fast, but i saw that part about choosing embedding dimensions based on number of attention heads pretty sure that’s only relevant for text generation trouble is “embeddings” can be either input or output, and the post seems to use it both ways without clarifying. a bit confusing
Tim Kellogg (@timkellogg.me) reply parent
btw i said “partial function application”, which might be familiar if you did Haskell, but it’s probably more familiar as “class constructors” in OOP same thing: you need to pass parameters through two separate channels
Tim Kellogg (@timkellogg.me) reply parent
it’s a problem bc a lot of times you don’t want to leave some details up to the LLM. like if a tool takes “customer ID” as a tool parameter, there’s gonna be a lot of cases where you’d like to completely eliminate the chance for error
Tim Kellogg (@timkellogg.me)
most people’s criticisms of MCP are either bogus or misapplied, but one that i never hear and should is that it doesn’t standardize partial function application if you visualize the server as a class, and tool as function, the “constructor args” aren’t standardized they go in ENV, headers, etc.
Tim Kellogg (@timkellogg.me) reply parent
oh, i think so. you just need an HTTP server, which i think it provides
Tim Kellogg (@timkellogg.me) reply parent
this is for local serving on Apple Silicon
Tim Kellogg (@timkellogg.me) reply parent
gosh, i understand that regulation is hard to get right, but that seems like a strong fucking argument to wait a bit longer until it’s a bit more baked
Tim Kellogg (@timkellogg.me) reply parent
and what if you just split the supply chain so that no single provider supplies more than 10^13 FLOPS? annoying, but pretty easy to circumvent what if model trends end up with SOTA being 1b-3b with heavy RL + tool use and the regulations don’t even apply?
Tim Kellogg (@timkellogg.me) reply parent
to be clear, the regs start kicking in at around 13b model size, so this impacts just about everything regarding AI here it seems to apply to only the “fine tune” part, but how do you even separate the two?? it’s almost like the authors had no idea what they were doing
Tim Kellogg (@timkellogg.me)
“the EU AI regulation is good” no omg no it is definitely not take this: if you fine-tune a regulated model, is your product also regulated? no idea. they legit never got that far but shipped the regs anyway
Tim Kellogg (@timkellogg.me) reply parent
yeah, there's a balance. so far, GPT-5 has been the best so far, but it has this weird behavior where it goes overly terse and i can't figure out what tf it's talking about
Tim Kellogg (@timkellogg.me) reply parent
sycophancy is critical though some: good, it lets you explore an idea a lot: catastrophic, it leads you into dead ends there’s this social angle that if it wasn’t sycophantic at all, it would shut down an idea before it bloomed enough to become real
Tim Kellogg (@timkellogg.me)
progressive thought refinement is a great way to use AI, i do this a lot 1. type long jumbled thoughts 2. AI searches unrealistically hard to make sense of anything, rubber duck’s back to you 3. dive deeper into angles that make sense end up with much more coherence
Tim Kellogg (@timkellogg.me) reply parent
flat white = steamer? my kids love those
Tim Kellogg (@timkellogg.me) reply parent
well look at you pinky out, motherfuckers
Tim Kellogg (@timkellogg.me) reply parent
it’s actually not a bad trade-off for personal gear when you’re extremely memory bound. i just wish they offered *a little*, something to cache the attention sinks
Tim Kellogg (@timkellogg.me) reply parent
is australia all instant? when i was a kid, instant was horrendous, but it’s gotten pretty good lately. i’m still too scarred to buy it though
Tim Kellogg (@timkellogg.me) reply parent
yeah, they’re what you’d use in production. you only do it on a dev box if you’re either masochistic or a fan of Arch Linux
Tim Kellogg (@timkellogg.me) reply parent
vLLM? sglang?
Tim Kellogg (@timkellogg.me) reply parent
basically ollama/llama.cpp/gguf is the mysql/php of AI it’s extremely easy to get started but sacrifices correctness and quality all over
Tim Kellogg (@timkellogg.me) reply parent
ollama wraps llama.cpp which wraps gguf file format gguf represents the entire model as a single execution graph. this causes mismatch with some newer model architectures llama.cpp is designed for CPU & personal GPU workloads, and sacrifices a lot of normal features like a KV cache
Tim Kellogg (@timkellogg.me)
underrated wisdom: if you ask for black coffee and the waitress gives you side eye, that is information that shouldn’t be ignored
Tim Kellogg (@timkellogg.me) reply parent
i’ve been annoyed by ollama’s gguf dependency for a while, increasingly so with all the recent LLM architecture innovations MLX is such a high quality library whereas ollama/llama.cpp us such a.. barely passable tool
Tim Kellogg (@timkellogg.me)
mlx-knife: an ollama-like CLI for Apple Silicon alright, this is the end of the road for me & ollama github.com/mzau/mlx-knife
Tim Kellogg (@timkellogg.me)
notable — they did fp32 for attention and bf16 for MLP gpt-oss went to extreme low end, this went high bf16 & fp4 regarding the MLP layers, there’s a well known trade-off, and i imagine longcat went bf16 for the training stability. they don’t have the deep expertise yet, so it sheds risk
Tim Kellogg (@timkellogg.me) reply parent
gpt-5 in codex-cli is actually very nice ngl
Tim Kellogg (@timkellogg.me) reply parent
my parents say it used to be on their report card, so yeah, this checks out
Tim Kellogg (@timkellogg.me) reply parent
always important for.. ya know
Tim Kellogg (@timkellogg.me) reply parent
i hadn’t really thought about it until recently, but smart people on the left merely appreciate positive qualities, while on the right smart people seem to winnow it down to a stack-rankable number
Tim Kellogg (@timkellogg.me)
is obsessing about IQ a right-aligned behavior?
Tim Kellogg (@timkellogg.me) reply parent
catboost is incredible. almost no point in even tuning it
Tim Kellogg (@timkellogg.me) reply parent
SEARCH THE COTS!!
Tim Kellogg (@timkellogg.me) reply parent
this is hyperparameters not data (e.g. batch size, learning rate, ..) you actually can do reverse (little -> big) distillation pretty easily. e.g. rephrasing generally doesn’t have to be a big model
Tim Kellogg (@timkellogg.me) reply parent
mine is latent. trust me.
Tim Kellogg (@timkellogg.me) reply parent
what are you doing? (clearly something very interesting)
Tim Kellogg (@timkellogg.me) reply parent
ya know when you capitalize it like that it makes it seem like we’re not very serious here 😅
Tim Kellogg (@timkellogg.me) reply parent
ha, at work AI interoperability is my jam, so i’m reasonably on top of this stuff
Tim Kellogg (@timkellogg.me) reply parent
openai took the microservices approach — a model router lets teams work in parallel on different models with little to no coordination. it’s more flexible bc you can tackle wildly different ideas this is the monolith approach. it’s really only good for scaling down compute
Tim Kellogg (@timkellogg.me) reply parent
that’s one of those things that became obvious with entropix models get stuck in a run of tokens where each one obviously follows the next. you don’t need a ton of compute for that but others are linchpins, you need extra
Tim Kellogg (@timkellogg.me)
routers are the hot thing it’s a problem that needs to be solved. how do you scale down compute for easier problems? this one takes a wildly different approach, scaling down compute on a per-token basis
Tim Kellogg (@timkellogg.me) reply parent
what do you mean? was their architecture known a while ago? (tbqh i hadn’t heard of them until today)
Tim Kellogg (@timkellogg.me)
9yo was shocked to discover a store that does full body piercings don’t worry i told her it’s just the skin
Tim Kellogg (@timkellogg.me)
pretty sure this means longcat is going to be another gpt-oss in that most providers are going to fuck up the configuration and you’ll get worse performance depending on who you go with
Tim Kellogg (@timkellogg.me) reply parent
no, there’s no inner vs outer loop. this is literally just a regular MoE where some experts are noop
Tim Kellogg (@timkellogg.me) reply parent
fwiw openai oss was like 1/32nd active
Tim Kellogg (@timkellogg.me) reply parent
yeah ngl i had not understood the reason for the shared experts
Tim Kellogg (@timkellogg.me) reply parent
it feels like a logical outcome of entropix — they found out that the attention logits were important and useful for sampling, this instead uses them for increasing TTC wo reasoning
Tim Kellogg (@timkellogg.me)
9yo pointing Seek at her younger sister: “oh, she is human”
Tim Kellogg (@timkellogg.me) reply parent
i’m not sure it’s not a PID controller either. there’s definitely a loop, and they’re controlling single bias term..
Tim Kellogg (@timkellogg.me) reply parent
i feel like this is the sort of shit you see when the US Government locks down compute bandwidth but not compute itself. We saw something similar with DeepSeek slinging their own PTX instead of CUDA to get around the nerfed comms
Tim Kellogg (@timkellogg.me) reply parent
the "shortcut-connected MoE" part is solving a more complex problem than it seems on the surface the problem is the hand-off between attention & MoE causes communication overhead (e.g. expert is located on a different GPU) ScMoE re-orders the pipeline, better utilizing compute
Tim Kellogg (@timkellogg.me) reply parent
oh this took me too long to figure out — the "zero computation experts" they have a (mostly) regular MoE router, but some of the experts are actually nothing at all. So the MoE router sometimes entirely skips experts
Tim Kellogg (@timkellogg.me) reply parent
agreed, i think they’re doing something funny with definitions. unfortunately i haven’t gotten the tech report on my phone yet, it’s not entirely clear if the dynamic MoE router and the “PID controller” are the same thing
Tim Kellogg (@timkellogg.me)
all these years and still, compilers are just syntax databases
Tim Kellogg (@timkellogg.me) reply parent
that’s really cool, i’ll check it out
Tim Kellogg (@timkellogg.me) reply parent
how are chinese labs cutting their dependence on NVIDIA? like this: run experiments on tiny models, transfer hyperparameters (result of experiments) to a far larger model for the yolo run bsky.app/profile/timk...
Tim Kellogg (@timkellogg.me) reply parent
yeah, rewriting it into multiple queries with pareseable relationships between them
Tim Kellogg (@timkellogg.me) reply parent
most interesting — dynamic computation not only is it fairly sparse MoE, each token can receive dynamically more compute via a PID controller for bias adjustment 🤯 so when it gets to a token that requires extra thought, it’ll just spin there, computing more
Tim Kellogg (@timkellogg.me)
Longcat-Flash-Chat (560B) uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay but inside.. damn this one has some cool ideas huggingface.co/meituan-long...
Tim Kellogg (@timkellogg.me) reply parent
imo if search is done perfectly, you effectively drive your LLM context to infinity but it’s very much not a solved problem to illustrate how underdeveloped this space is — research from 5 years ago still seems like the best ideas (contrast that to LLMs)
Tim Kellogg (@timkellogg.me) reply parent
alternatively, sparse approaches like SPLADE do this in latent space but use inverted indices (regular full text search, exact matches) arxiv.org/abs/2107.057...
Tim Kellogg (@timkellogg.me) reply parent
you really need to capture the query and decompose it into multiple sub queries e.g. maybe get a 1B-3B LLM to rewrite the query into a DSL (e.g. a JSON breakdown of the various components and concepts in the query) and then push that logic into the database engine itself
Tim Kellogg (@timkellogg.me) reply parent
multi-vector (late interaction) search like ColBERT also works, because it handles the predicate logic in cheaper latent space, but storage costs are a lot higher because, well it’s multi-vector (fwiw Qdrant and a few other vector DBs support multi-vectors) huggingface.co/jinaai/jina-...
Tim Kellogg (@timkellogg.me) reply parent
btw even adding a reranker won’t help if you’ve already dropped the relevant results in the first stage embedding retrieval agentic search DOES work, but now you’re relying on an expensive LLM to resolve simple boolean logic
Tim Kellogg (@timkellogg.me)
Limits of vector search a new GDM paper shows that embeddings can’t represent combinations of concepts well e.g. Dave likes blue trucks AND Ford trucks even k=2 sub-predicates make SOTA embedding models fall apart www.alphaxiv.org/pdf/2508.21038
Tim Kellogg (@timkellogg.me) reply parent
i’m not sure teachers and parents generally get this nuance, but most AI tools don’t answer only from their weights. e.g. chatgpt will search the internet — problem is if the answer isn’t found it’ll eagerly produce *something* anyway notebooklm is a bit stricter about that, so i guess better
Tim Kellogg (@timkellogg.me) reply parent
i thought this was all a joke this morning, now i’m starting to wonder
Tim Kellogg (@timkellogg.me) reply parent
wait what? is there any non-gossip about this?
Tim Kellogg (@timkellogg.me) reply parent
which photos?
Tim Kellogg (@timkellogg.me) reply parent
the best kind of yummy tbh
Tim Kellogg (@timkellogg.me) reply parent
very much so. i think all of AI is going this direction swiftly the big labs own the priors (models) and algorithms, but they’re missing the environments, bc those are ours, they’re in the real world things are about to change
Tim Kellogg (@timkellogg.me) reply parent
fyi these are poison
Tim Kellogg (@timkellogg.me) reply parent
i mean there’s a wide range between small scale experiments and yolo runs, hard to say what happens in the middle
Tim Kellogg (@timkellogg.me)
this true?
Tim Kellogg (@timkellogg.me)
DeepSeek is reducing their dependence on NVIDIA they do small scale training runs & experiments on Huawei Ascend, but yolo runs on NVIDIA
Tim Kellogg (@timkellogg.me)
something to keep in mind, modern RL is embedded in the real world via @jxmnop.bsky.social
Tim Kellogg (@timkellogg.me) reply parent
to be clear, your point is about the RL reward function, not the RL environment. benchmarks are still different, but that’s not even the point. it’s that it’s targeting the real world why do benchmarks seem contrived when you look at their definition? because they’re benchmarks. RL is different
Tim Kellogg (@timkellogg.me) reply parent
guys, i had just woken up, give me a pass. it was pre-coffee