Post by Tim Kellogg / Redsky

i’ve been annoyed by ollama’s gguf dependency for a while, increasingly so with all the recent LLM architecture innovations MLX is such a high quality library whereas ollama/llama.cpp us such a.. barely passable tool

sep 1, 2025, 11:54 am • 8 0

Replies

Lmstudio was my step up from Ollama. Mlx and gguf with a handy server great ui and mcp client. I wish the server exposed a reranking endpoint but other than that it’s great.

sep 1, 2025, 5:32 pm • 0 0 • view

Can you go into more detail about what makes ollama/llama.cpp so hard to deal with? I haven't used either much so I'm curious. I had thought they were pretty well liked by people.

sep 1, 2025, 1:37 pm • 1 0 • view

ollama wraps llama.cpp which wraps gguf file format gguf represents the entire model as a single execution graph. this causes mismatch with some newer model architectures llama.cpp is designed for CPU & personal GPU workloads, and sacrifices a lot of normal features like a KV cache

sep 1, 2025, 1:45 pm • 2 0 • view

basically ollama/llama.cpp/gguf is the mysql/php of AI it’s extremely easy to get started but sacrifices correctness and quality all over

sep 1, 2025, 1:45 pm • 3 1 • view

What's the Postgres of AI? It's not Oracle, but it's the next step up from the thing everybody gets into first?

sep 1, 2025, 1:57 pm • 2 0 • view

vLLM? sglang?

sep 1, 2025, 2:07 pm • 3 0 • view

Never heard of either one! Neat! 😁

sep 1, 2025, 2:08 pm • 1 0 • view

yeah, they’re what you’d use in production. you only do it on a dev box if you’re either masochistic or a fan of Arch Linux

sep 1, 2025, 2:14 pm • 3 0 • view

But sir, you repeat yourself.

sep 1, 2025, 2:15 pm • 2 0 • view

Thanks for elaborating. I didn't know they sacrificed the KV cache altogether!

sep 1, 2025, 2:09 pm • 1 0 • view

it’s actually not a bad trade-off for personal gear when you’re extremely memory bound. i just wish they offered *a little*, something to cache the attention sinks

sep 1, 2025, 2:20 pm • 1 0 • view