i recently integrated 4 ocr models into fiftyone as remote zoo models these handle text extraction and document parsing.
all available as remote zoo sources, you can get started with a few lines of code
different approaches for different needs:
1. mineru-2.5
1.2b params, two-stage strategy: global layout on downsampled image, then fine-grained recognition on native-resolution crops.
handles headers, footers, lists, code blocks. strong on complex math formulas (mixed chinese-english) and tables (rotated, borderless, partial-border).
good for: documents with complex layouts and mathematical content
github.com/harpreetsahota204…
deepseek-ocr
dual-encoder (sam + clip) for "contextual optical compression."
outputs structured markdown with bounding boxes. has five resolution modes (tiny/small/base/large/gundam). gundam mode is the default - uses multi-view processing (1024×1024 global + 640×640 patches for details).
supports custom prompts for specific extraction tasks.
good for: complex pdfs and multi-column layouts where you need structured output
github.com/harpreetsahota204…
olmocr-2
built on qwen2.5-vl, 7b params. outputs markdown with yaml front matter containing metadata (language, rotation, table/diagram detection).
converts equations to latex, tables to html. labels figures with markdown syntax. reads documents like a human would.
good for: academic papers and technical documents with equations and structured data
github.com/harpreetsahota204…
kosmos-2.5
microsoft's 1.37b param multimodal model. two modes: ocr (text with bounding boxes) or markdown generation. automatically optimizes hardware usage (bfloat16 for ampere+, float16 for older gpus, float32 for cpu). handles diverse document types including handwritten text.
good for: general-purpose ocr when you need either coordinates or clean markdown
github.com/harpreetsahota204…
two modes typical across these models: detection (bounding boxes) and extraction (text output)
i also built/revamped the caption viewer plugin for better text visualization in the app:
github.com/harpreetsahota204…
i've also got two events poppin off for document visual ai:
- nov 6 (tomorrow) with a stellar line up of speakers (
@mervenoyann @barrowjoseph @dineshredy)
voxel51.com/events/visual-do…
- a deep dive into document visual ai with just me:
voxel51.com/events/document-…