Discussion about this post

User's avatar
Alex Razvant's avatar

🔥

Pawel Jozefiak's avatar

The focused architecture beating larger models holds in my experience too. Running a 2B model routed appropriately beats a 35B doing everything, and you still get the 35B available for genuinely hard tasks.

The Ollama plus llama.cpp combo is what I landed on as well. Curious how GLM-OCR handles multi-column PDFs versus single-column. Everything I've tested degrades on dense tables regardless of model size, and I haven't found a good local solution for that specific layout yet. Does the Docker setup add meaningful latency versus direct llama.cpp execution?

No posts

Ready for more?