i’ve been annoyed by ollama’s gguf dependency for a while, increasingly so with all the recent LLM architecture innovations MLX is such a high quality library whereas ollama/llama.cpp us such a.. barely passable tool
i’ve been annoyed by ollama’s gguf dependency for a while, increasingly so with all the recent LLM architecture innovations MLX is such a high quality library whereas ollama/llama.cpp us such a.. barely passable tool
Lmstudio was my step up from Ollama. Mlx and gguf with a handy server great ui and mcp client. I wish the server exposed a reranking endpoint but other than that it’s great.
Can you go into more detail about what makes ollama/llama.cpp so hard to deal with? I haven't used either much so I'm curious. I had thought they were pretty well liked by people.
ollama wraps llama.cpp which wraps gguf file format gguf represents the entire model as a single execution graph. this causes mismatch with some newer model architectures llama.cpp is designed for CPU & personal GPU workloads, and sacrifices a lot of normal features like a KV cache
basically ollama/llama.cpp/gguf is the mysql/php of AI it’s extremely easy to get started but sacrifices correctness and quality all over
What's the Postgres of AI? It's not Oracle, but it's the next step up from the thing everybody gets into first?
vLLM? sglang?
Never heard of either one! Neat! 😁
yeah, they’re what you’d use in production. you only do it on a dev box if you’re either masochistic or a fan of Arch Linux
But sir, you repeat yourself.
Thanks for elaborating. I didn't know they sacrificed the KV cache altogether!
it’s actually not a bad trade-off for personal gear when you’re extremely memory bound. i just wish they offered *a little*, something to cache the attention sinks