Post by Tim Kellogg / Redsky

Tim Kellogg • Feed

mlx-knife: an ollama-like CLI for Apple Silicon alright, this is the end of the road for me & ollama github.com/mzau/mlx-knife

sep 1, 2025, 11:54 am • 25 1

Replies

This is just for working directly with the models, right? It doesn't serve locally so I can use other things to talk to the model?

sep 1, 2025, 5:50 pm • 0 0 • view

this is for local serving on Apple Silicon

sep 1, 2025, 5:53 pm • 1 0 • view

Right, yes, (and I am a novice on this, so please bear with me) but it will serve to my network? I'm playing around running various models on my Macbook serving to some apps running on RPis on my network. Anything that does that better or more efficiently, I'm intriqued.

sep 1, 2025, 5:56 pm • 0 0 • view

oh, i think so. you just need an HTTP server, which i think it provides

sep 1, 2025, 6:02 pm • 1 0 • view

Excellent, thank you! I will give it a look! I'm working on something that will watch my photography ingest folder on my NAS and then query a vision model to get alt-text that is then written to the sidecar file on the photos.

sep 1, 2025, 6:05 pm • 1 0 • view

Nice, I’m working with mlx a lot, great to see more examples using the library. Thanks for sharing!

sep 1, 2025, 3:04 pm • 1 0 • view

Someone should compile a list of these ollama-like adapters for everything

sep 1, 2025, 2:26 pm • 0 0 • view

i’ve been annoyed by ollama’s gguf dependency for a while, increasingly so with all the recent LLM architecture innovations MLX is such a high quality library whereas ollama/llama.cpp us such a.. barely passable tool

sep 1, 2025, 11:54 am • 8 0 • view

Lmstudio was my step up from Ollama. Mlx and gguf with a handy server great ui and mcp client. I wish the server exposed a reranking endpoint but other than that it’s great.

sep 1, 2025, 5:32 pm • 0 0 • view

Can you go into more detail about what makes ollama/llama.cpp so hard to deal with? I haven't used either much so I'm curious. I had thought they were pretty well liked by people.

sep 1, 2025, 1:37 pm • 1 0 • view

ollama wraps llama.cpp which wraps gguf file format gguf represents the entire model as a single execution graph. this causes mismatch with some newer model architectures llama.cpp is designed for CPU & personal GPU workloads, and sacrifices a lot of normal features like a KV cache

sep 1, 2025, 1:45 pm • 2 0 • view

basically ollama/llama.cpp/gguf is the mysql/php of AI it’s extremely easy to get started but sacrifices correctness and quality all over

sep 1, 2025, 1:45 pm • 3 1 • view

What's the Postgres of AI? It's not Oracle, but it's the next step up from the thing everybody gets into first?

sep 1, 2025, 1:57 pm • 2 0 • view

vLLM? sglang?

sep 1, 2025, 2:07 pm • 3 0 • view

Never heard of either one! Neat! 😁

sep 1, 2025, 2:08 pm • 1 0 • view

yeah, they’re what you’d use in production. you only do it on a dev box if you’re either masochistic or a fan of Arch Linux

sep 1, 2025, 2:14 pm • 3 0 • view

But sir, you repeat yourself.

sep 1, 2025, 2:15 pm • 2 0 • view

Thanks for elaborating. I didn't know they sacrificed the KV cache altogether!

sep 1, 2025, 2:09 pm • 1 0 • view

it’s actually not a bad trade-off for personal gear when you’re extremely memory bound. i just wish they offered *a little*, something to cache the attention sinks

sep 1, 2025, 2:20 pm • 1 0 • view

Oooooooh

sep 1, 2025, 11:59 am • 1 0 • view