Member-only story
SmolDockling — Hugging Face’s Tiny OCR & Document Understanding Model
In a world obsessed with scaling LLMs to 70 billion parameters, Hugging Face did something wild — they went small. Really small. Enter SmolDocling, a 256M parameter model that just might punch above its weight in the space of OCR and document understanding.
You’ve heard of AI models reading PDFs. But SmolDocling? It understands them. It doesn’t just regurgitate text — it identifies charts, code blocks, tables, even logos — and spits it back in doc-tag format with position and layout preserved.
So yeah, it’s tiny. But it’s also brilliant. Let’s get into it.
📦 What Exactly Is SmolDocling?
Developed by Hugging Face in collaboration with IBM, SmolDocling is part of their “smol” family — a line of tiny vision-language models aimed at high-efficiency, real-world utility. It brings document parsing, layout analysis, and structured content extraction into a compact package.
- 🧠 Model Size: 256 million parameters
- 🔍 Vision Encoder: SigLIP (93M)
- ✍️ Language Model: 135M
- 🧬 Architecture: Two-tower structure + projection layers
- 🔄 Input: Any doc/image
- 💬 Output: Structured tag format — headings, paragraphs, tables, code, images, logos
Unlike generic OCR tools, SmolDocling doesn’t just return raw text — it returns context, layout, and structure.
🔬 Architecture Breakdown (Nerd Mode ON)
SmolDocling isn’t just a wrapper around a language model. It’s a finely balanced, lightweight Vision-Language Model (VLM).
Components:
- SigLIP Vision Tower: Responsible for encoding document visuals, layout, and spatial relationships.
- Text Tower (LM): Processes and predicts the document’s semantic structure.
- Projection Layers: Merge visual and language tokens into a single sequence for autoregressive decoding.
Its training recipe involves doc-tagged documents — think HTML/Markdown meets layout-aware tagging.
🧪 Real-World Testing: What Can It Actually Do?
The creator threw some serious real-world chaos at SmolDocling. Here’s how it fared: