Data And Beyond

Selected stories around Data Science, Machine Learning, Artificial Intelligence, Programming, and Technology topics. Writing guide: https://medium.com/data-and-beyond/how-to-write-for-data-and-beyond-b83ff0f3813e

Follow publication

SmolDockling — Hugging Face’s Tiny OCR & Document Understanding Model

TONI RAMCHANDANI
Data And Beyond
Published in
7 min readMar 24, 2025

In a world obsessed with scaling LLMs to 70 billion parameters, Hugging Face did something wild — they went small. Really small. Enter SmolDocling, a 256M parameter model that just might punch above its weight in the space of OCR and document understanding.

You’ve heard of AI models reading PDFs. But SmolDocling? It understands them. It doesn’t just regurgitate text — it identifies charts, code blocks, tables, even logos — and spits it back in doc-tag format with position and layout preserved.

So yeah, it’s tiny. But it’s also brilliant. Let’s get into it.

📦 What Exactly Is SmolDocling?

Developed by Hugging Face in collaboration with IBM, SmolDocling is part of their “smol” family — a line of tiny vision-language models aimed at high-efficiency, real-world utility. It brings document parsing, layout analysis, and structured content extraction into a compact package.

  • 🧠 Model Size: 256 million parameters
  • 🔍 Vision Encoder: SigLIP (93M)
  • ✍️ Language Model: 135M
  • 🧬 Architecture: Two-tower structure + projection layers
  • 🔄 Input: Any doc/image
  • 💬 Output: Structured tag format — headings, paragraphs, tables, code, images, logos

Unlike generic OCR tools, SmolDocling doesn’t just return raw text — it returns context, layout, and structure.

🔬 Architecture Breakdown (Nerd Mode ON)

SmolDocling isn’t just a wrapper around a language model. It’s a finely balanced, lightweight Vision-Language Model (VLM).

Components:

  • SigLIP Vision Tower: Responsible for encoding document visuals, layout, and spatial relationships.
  • Text Tower (LM): Processes and predicts the document’s semantic structure.
  • Projection Layers: Merge visual and language tokens into a single sequence for autoregressive decoding.

Its training recipe involves doc-tagged documents — think HTML/Markdown meets layout-aware tagging.

🧪 Real-World Testing: What Can It Actually Do?

The creator threw some serious real-world chaos at SmolDocling. Here’s how it fared:

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

No responses yet

Write a response