Run the World's Best OCR on Your Own Laptop
Finetuning Sessions · Lab 8 / 8
In the previous article, "A Practical Guide to LLM Inference at Scale", we explored the theory behind serving large language models efficiently — from quantization strategies to deployment architectures.
But theory without practice is just PowerPoint.
In this hands-on guide, we're getting our hands dirty with a model that perfectly illustrates those principles in action: GLM-OCR, a 0.9B parameter vision-language model that ranks #1 on OmniDocBench V1.5, beating models 10x its size.
We'll take it from a local Docker container running on your laptop all the way to a production-ready pipeline, covering hardware optimization, custom model configuration, and the official SDK for complex document parsing. If you read the theory, this is where it clicks.
Why OCR?
The Optical Character Recognition (OCR) landscape is vast, but GLM-OCR stands out as a multimodal model specifically built for complex document understanding.

Developed by Z.ai and based on the GLM-V encoder-decoder architecture, it introduces advanced training techniques like Multi-Token Prediction (MTP) loss and full-task reinforcement learning to drastically improve recognition accuracy. Instead of relying on massive, unwieldy models, GLM-OCR proves that highly focused architectures can dominate specific tasks.
Despite its incredibly small size of just 0.9 billion parameters, GLM-OCR achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall. This small footprint means it can run fully locally on standard consumer devices, like MacBooks or edge devices, without sacrificing capability.
It successfully rivals and often outperforms much larger, closed-source models across benchmarks for formula recognition, table extraction, and information extraction.
The secret to this "small but mighty" performance is its two-stage pipeline, which pairs the language decoder with the PP-DocLayout-V3 layout detection model. By first analyzing the document layout and then performing parallel recognition, GLM-OCR maintains robust performance on highly complex real-world scenarios, including code-heavy documents, intricate tables, and documents with rotating or staggered layouts.
Scaling Inference with vLLM
Running models locally with Ollama is perfect for testing, personal use, and CPU-only environments.
However, as your document parsing needs grow, you must consider the transition from a local machine to a robust cloud infrastructure. Cloud deployments allow you to serve the pipeline at scale, making use of time-slicing Kubernetes (k8s) configurations and worker-server deployments for maximum efficiency.
When moving to a production environment, the official GLM-OCR documentation strongly recommends transitioning to engines like vLLM or SGLang. These frameworks are specifically designed for high-concurrency services and provide significantly better performance and stability when you have access to one or multiple GPUs.

Using vLLM allows you to serve the model via an OpenAI-compatible /v1/chat/completions API endpoint. When configured properly—such as setting the max_workers and connection_pool_size appropriately in your SDK configuration to avoid 503 errors—vLLM ensures your pipeline can handle massive parallel OCR requests without crashing under load.
Introducing Ollama and llama.cpp
Before deploying, it is crucial to understand the ecosystem that makes local inference so accessible. At the core of this democratization is llama.cpp, a high-performance C++ engine designed to run LLMs on standard hardware with maximum efficiency.
While
llama.cppis incredibly powerful, it can require manual compilation and complex command-line arguments to operate.
This is where Ollama steps in as the "user interface" and manager. Ollama acts as a user-friendly wrapper around the llama.cpp backbone, allowing developers to download models, manage memory, and serve a clean API with simple commands. It handles the underlying complexity, bringing powerful language models to developers who may not be machine learning engineers.
To achieve this efficiency on local hardware, the engine relies heavily on quantization and the GGUF file format. Quantization shrinks the size of the model weights—such as using 2-bit (Q2) or 4-bit (Q4) representations instead of standard 16-bit floats—so the model can run on cheaper hardware without losing significant performance.
🔍 Step 0: Discover Your Hardware Specs
To get the best speed, you must identify your Physical Cores. LLMs perform best when assigned to physical cores rather than "logical" ones (Hyperthreading/SMT).
Windows (PowerShell)
Get-WmiObject -Class Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessorsmacOS (Terminal)
sysctl -n hw.physicalcpu hw.logicalcpuLinux (Terminal)
lscpu | grep -E '^CPU\(s\):|Core\(s\) per socket|Thread\(s\) per core'The Rule of Thumb: For your thread configuration later, always aim for the Number of Cores, not the logical processors. If you have a hybrid CPU (like newer Intel chips), aim for the number of Performance Cores.
🐳 Step 1: Launch the Ollama Container
We use a Docker volume to ensure that once you download a multi-gigabyte model, it stays on your disk even if the container is deleted.
docker run -d \
--name ollama-server \
-v ollama_storage:/root/.ollama \
-p 11434:11434 \
ollama/ollama📦 Step 2: Download the Model
Pull the GLM-OCR weights from the library. This usually requires about 2GB–4GB of space.
docker exec -it ollama-server ollama pull glm-ocr⚙️ Step 3: The Modelfile
Standard settings often fail for OCR because images require more "memory space"(context) than simple text. We will create a custom version of the model with optimized parameters.
Enter the container:
docker exec -it ollama-server bashCreate a Modelfile:
cat <<EOF > GLM-Config
FROM glm-ocr
# Hardware & Context Settings
PARAMETER num_ctx 16384
PARAMETER num_thread 6
# Your Specific Generation Parameters
PARAMETER num_predict 8192
PARAMETER temperature 0
PARAMETER top_p 0.00001
PARAMETER top_k 1
PARAMETER repeat_penalty 1.1
EOFHere it's important to point out that sampling parameters are typically specified at runtime (i.e., during model utilization), but given we're performing a very specific task, with a cutting-edge vision language model, sampling parameters are often hardcoded to obtain optimal results, and it’s not recommended to switch them.
Plus, maximum number of threads must follow the convention established in the previous section.
Deploy the updated model version:
ollama create glm-ocr-optimized -f GLM-Config
exit🚀 Step 4: Using the API
With the server running, you can now send images to the model. Images must be sent as Base64 encoded strings.
import requests
import base64
import ollama
import sys
import time
from io import BytesIO
from PIL import Image
# 1. Configuration
IMAGE_URL = "https://marketplace.canva.com/EAE92Pl9bfg/6/0/1131w/canva-black-and-gray-minimal-freelancer-invoice-wPpAXSlmfF4.jpg" # Replace with your invoice link
MODEL_NAME = "glm-ocr-optimized" # The model you created with specific parameters
MAX_DIMENSION = 1024 # Resize the longest edge to 1024px
def get_optimized_image_b64(url):
"""Downloads, resizes, and encodes the image."""
print(f"📥 Downloading image...")
response = requests.get(url)
img = Image.open(BytesIO(response.content))
# Calculate aspect ratio and resize
original_width, original_height = img.size
print(f"📐 Original Size: {original_width}x{original_height}")
# Only resize if the image is actually larger than our limit
if max(img.size) > MAX_DIMENSION:
img.thumbnail((MAX_DIMENSION, MAX_DIMENSION), Image.Resampling.LANCZOS)
print(f"🪄 Resized to: {img.width}x{img.height}")
else:
print("✅ Image is already small enough, skipping resize.")
# Convert to Base64
buffered = BytesIO()
img.save(buffered, format="JPEG", quality=85) # JPEG is lighter than PNG for OCR
return base64.b64encode(buffered.getvalue()).decode('utf-8')
def run_ocr():
try:
# Prepare the image
image_b64 = get_optimized_image_b64(IMAGE_URL)
print(f"🚀 Sending to Ollama (waiting for first token)...")
start_time = time.time()
first_token = True
# 2. Invoke Ollama with streaming
# The parameters (temp: 0, top_k: 1, etc.) are already in your Modelfile
stream = ollama.generate(
model=MODEL_NAME,
prompt="Text recognition:",
images=[image_b64],
stream=True
)
for chunk in stream:
if first_token:
print(f"⏱️ Time to first token: {time.time() - start_time:.2f}s\n")
first_token = False
print(chunk['response'], end='', flush=True)
print(f"\n\n✅ Total Processing Time: {time.time() - start_time:.2f}s")
except Exception as e:
print(f"\n❌ Error: {e}")
if __name__ == "__main__":
run_ocr()👀 Step 5: Inspecting results
When running via terminal the instruction docker stats while running the script, we will see something like:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0ca7a28d6fe2 ollama-server 600.42% 4.359GiB / 7.607GiB 57.31% 79.4kB / 5.25kB 0B / 0B 41
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0ca7a28d6fe2 ollama-server 598.91% 4.359GiB / 7.607GiB 57.31% 79.4kB / 5.25kB 0B / 0B 41
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0ca7a28d6fe2 ollama-server 598.91% 4.359GiB / 7.607GiB 57.31% 79.4kB / 5.25kB 0B / 0B 41
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0ca7a28d6fe2 ollama-server 601.90% 4.359GiB / 7.607GiB 57.31% 79.4kB / 5.25kB 0B / 0B 41Notice how the CPU usage is being maximized as the main task now that is generating the bottleneck is the vision encoder. It's not the decoding, which is not compute heavy, but the prefill stage. Indeed, by checking the logs:
📥 Downloading image...
📐 Original Size: 1131x1600
🪄 Resized to: 724x1024
🚀 Sending to Ollama (waiting for first token)...
⏱️ Time to first token: 174.96s
YOUR
LOGO
NO. 000001
INVOICE
Date: 02 June, 2030
Billed to:
Studio Shodwe
123 Anywhere St., Any City
hello@reallygreatsite.com
From:
Olivia Wilson
123 Anywhere St., Any City
hello@reallygreatsite.com
Item Quantity Price Amount
Logo 1 $500 $500
Banner (2x6m) 2 $45 $90
Poster (1x2m) 3 $55 $165
Total $755
Payment method: Cash
Note: Thank you for choosing us!
✅ Total Processing Time: 183.39sIt is noticeable that vision encoder and complete prefill took almost 3 minutes, and generation only 9 seconds. That is indeed where GPUs and specific hardware devices shine by applying vectorized operations, in the encoding; i.e., compute-heavy section of the pipeline.
So far results look pretty strong but this is just a toy example. By leveraging the layout detector, we will be able to adapt to a much broader range of situations, like the next one.
🧩 Step 6: GLM-OCR SDK
Together with the vision language model, Z.ai team provided a comprehensive client SDK that includes the safetensors version of the amazing PPDocLayoutV3 from PaddlePaddle, that in particular adapts to polygonal regions and edge cases.
We will slightly adapt our code to an image taken form the first two pages of Qwen3 Technical Report:
import requests
from PIL import Image
from glmocr import GlmOcr
# --- Configuration ---
LOCAL_FILENAME = "7cf7af6c-0581-4fdc-a20f-7123aab8c0a2_3308x2339.jpg"
def run_sdk_ocr(image_path):
# Optional: Resize to speed up CPU inference (1024px is the sweet spot)
with Image.open(image_path) as img:
if max(img.size) > 1024:
img.thumbnail((1024, 1024), Image.Resampling.LANCZOS)
img.save(image_path)
print(f"🪄 Resized {image_path} for faster CPU processing.")
print(f"🚀 Initializing GLM-OCR SDK...")
# Initialize the SDK in self-hosted mode
with GlmOcr(config_path='./config.yaml') as parser:
print("🔍 Analyzing document structure...")
result = parser.parse(image_path)
# Output the Markdown result
print("\n" + "="*20 + " OCR RESULT " + "="*20)
print(result.markdown_result)
print("="*52)
if __name__ == "__main__":
try:
run_sdk_ocr(LOCAL_FILENAME)
except Exception as e:
print(f"❌ Error: {e}")It's critical that you select the config.yaml file we provide alongside the article, as it will not be possible to run the example otherwise. It has been specifically tuned for it.
Let's check the results:
Starting Pipeline...
🪄 Resized 7cf7af6c-0581-4fdc-a20f-7123aab8c0a2_3308x2339.jpg for faster CPU processing.
🚀 Initializing GLM-OCR SDK...
Pipeline started!
GLM-OCR initialized in self-hosted mode
🔍 Analyzing document structure...
Stopping Pipeline...
Pipeline stopped!
==================== OCR RESULT ====================
Qwen3 Technical Report
Qwen Team
https://huggingface.co/qwen
https://modelscope.cn/organization/qwen
https://github.com/qwenLM/qwen3
Abstract
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 253 billion. A key innovation in Qwen3 is the integration of thinking mode for complex, multi-step reasoning and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models, such as chat optimized models (e.g., GPT-3) and dedicated reasoning models (e.g., QX-32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their high competitive performance. Empirical evaluations demonstrate that Qwen3 achieves stated-theory results across benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive signal larger MoE models and proprietary models. Compared to its predecessor Qwen-2.5, Qwen3 expands multilingual support from 29 to 11 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
Table 7: Comparison among Qwen3-4B-Base and other strong open-source baselines. The highest and second-best scores are shown in bold and underlined, respectively.
| Architecture | Gemma-3-4B Base | Gemma-2.5-3B Base | Gemma-2.5-7B Base | Gemma-3-4B Base |
| :--- | :--- | :--- | :--- | :--- |
| # Total Params | 4B | 3B | 3B | 4B |
| # Activated Params | 4B | 3B | 3B | 4B |
General Tasks
| MMLU | 59.41 | 65.62 | 74.16 | 72.99 |
| :--- | :--- | :--- | :--- | :--- |
| MMLU-Redux | 56.91 | 63.68 | 71.08 | 72.79 |
| MMLU-Pro | 29.23 | 34.61 | 45.00 | 50.58 |
| SuperGQA | 17.98 | 20.31 | 26.33 | 28.43 |
| BBH | 17.87 | 16.24 | 16.30 | 17.29 |
GPQA | 24.24 | 26.26 | 36.36 | 36.87 |
| :--- | :--- | :--- | :--- | :--- |
| GSMK | 43.97 | 79.08 | 85.36 | 87.79 |
| MATH | 26.10 | 42.04 | 49.80 | 52.18 |
Coding Tasks
| EvalPlus | 43.23 | 46.28 | 62.18 | 63.53 |
| :--- | :--- | :--- | :--- | :--- |
| MultiPL-E | 28.06 | 39.65 | 50.72 | 53.13 |
| MMPT | 46.40 | 34.60 | 43.83 | 51.00 |
| CRUX-O | 34.00 | 36.50 | 48.50 | 55.00 |
Multilingual Tasks
| MGSM | 33.11 | 47.53 | 63.60 | 67.74 |
| :--- | :--- | :--- | :--- | :--- |
| MMLU-Pro | 59.62 | 65.55 | 73.31 | 71.42 |
| INCLUDE | 49.06 | 45.90 | 53.98 | 56.29 |
Table 8: Comparison among Qwen3-1.7B-Base, Qwen3-0.68-Base, and other strong open-source base-lines. The highest and second-best scores are shown in bold and underlined, respectively.
| Qwen2.5-0.58 Base | Qwen3-0.68 Base | Qwen2.5-1.58 Base | Qwen3-1.78 Base |
| :--- | :--- | :--- | :--- |
| Architecture | Gemma-3-4B Base | Gemma-2.5-3B Base | Gemma-2.5-7B Base | Gemma-3-4B Base |
| # Total Params | 0.58 | 0.68 | 1B | 1.78 |
| # Activated Params | 0.58 | 0.68 | 1B | 1.78 |
General Tasks
| MMLU | 52.11 | 26.26 | 60.90 | 62.63 |
| :--- | :--- | :--- | :--- | :--- |
| MMLU-Redux | 51.26 | 25.99 | 58.46 | 61.66 |
| MMLU-Pro | 24.74 | 9.72 | 28.53 | 36.76 |
| SuperGQA | 11.30 | 10.01 | 14.54 | 20.92 |
| BBH | 20.30 | 41.47 | 28.13 | 54.47 |
Coding Tasks
| EvalPlus | 32.77 | 24.75 | 24.24 | 28.28 |
| :--- | :--- | :--- | :--- | :--- |
| MultiPL-E | 24.75 | 24.75 | 62.54 | 72.44 |
| MMPT | 19.48 | 32.44 | 36.66 | 43.50 |
| CRUX-O | 12.10 | 27.00 | 3.80 | 36.40 |
Multilingual Tasks
| MGSM | 30.99 | 7.74 | 32.82 | 50.71 |
| :--- | :--- | :--- | :--- | :--- |
| MMLU-Pro | 31.53 | 60.16 | 60.27 | 63.27 |
| INCLUDE | 24.74 | 34.26 | 25.62 | 45.57 |
====================================================You can now observe how the model has correctly gathered all the elements in the image and conveniently parsed them in the final markdown result.
Next Steps
This Sunday we're hosting our 8th and final Office Hours!
We'll cover two main topics:
Last week's session → Multimodal Finetuning
This week's session → LLM Deployment
In this lab we focused on local deployments, but during the office hours we'll go a step further — we'll walk you through the key parameters you can configure on Hugging Face Endpoints (with vLLM) to make your inference truly efficient.
See you there!








🔥
The focused architecture beating larger models holds in my experience too. Running a 2B model routed appropriately beats a 35B doing everything, and you still get the 35B available for genuinely hard tasks.
The Ollama plus llama.cpp combo is what I landed on as well. Curious how GLM-OCR handles multi-column PDFs versus single-column. Everything I've tested degrades on dense tables regardless of model size, and I haven't found a good local solution for that specific layout yet. Does the Docker setup add meaningful latency versus direct llama.cpp execution?