🐳DeepSeek-OCR-2: How to Run & Fine-tune

Guide on how to run and fine-tune DeepSeek-OCR-2 locally.

DeepSeek-OCR 2 is the new 3B-parameter model for SOTA vision and document understanding released on Jan 27, 2026 by DeepSeek. The model focuses on image-to-text with stronger visual reasoning, not just text extraction.

DeepSeek-OCR 2 introduces DeepEncoder V2, which enables the model to 'see' an image in the same logcal order as a human.

Unlike traditional vision LLMs that scan images in a fixed grid (top-left → bottom-right), DeepEncoder V2 builds a global understanding first, then learns a human-like reading order—what to attend to first, next, and so on. This boosts OCR on complex layouts by better following columns, linking labels to values, reading tables coherently, and handling mixed text + structure.

You can now fine-tune DeepSeek-OCR 2 in Unsloth via our free fine-tuning notebookarrow-up-right. We demonstrated a 88.6% improvement for language understanding.

Running DeepSeek-OCR 2Fine-tuning DeepSeek-OCR 2

🖥️ Running DeepSeek-OCR 2

To run the model in vLLM, transformers or Unsloth, here are the recommended settings:

DeepSeek recommends these settings:

  • Temperature = 0.0

  • max_tokens = 8192

  • ngram_size = 30

  • window_size = 90

Support Modes - Dynamic resolution:

  • Default: (0-6)×768×768 + 1×1024×1024 — (0-6)×144 + 256 visual tokens

Prompts examples:

Turns any document into markdown using Visual Causal Flow.

📖 vLLM: Run DeepSeek-OCR 2 Tutorial

  1. Obtain the latest vLLM via:

  1. Then run the following code:

🤗Transformers: Run DeepSeek-OCR 2 Tutorial

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8:

🦥 Unsloth: Run DeepSeek-OCR 2 Tutorial

  1. Obtain the latest unsloth via pip install --upgrade unsloth . If you already have Unsloth, update it via pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo

  2. Then use the code below to run DeepSeek-OCR 2:

🦥 Fine-tuning DeepSeek-OCR 2

Unsloth now supports fine-tuning of DeepSeek-OCR 2. Like the original first model, you'll need to use custom semantic for it to work on transformers. Like the first model, Unsloth trains DeepSeek-OCR-2 1.4x faster with 40% less VRAM and 5x longer context lengths with no accuracy degradation. You can now fine-tune DeepSeek-OCR 2 via our free Colab notebook.

See below for CER (Character Error Rate) accuracy improvements on the Persian language:

Per-sample CER (10 samples)

idx
OCR1 before
OCR1 after
OCR2 before
OCR2 after

1520

1.0000

0.8000

10.4000

1.0000

1521

0.0000

0.0000

2.6809

0.0213

1522

2.0833

0.5833

4.4167

1.0000

1523

0.2258

0.0645

0.8710

0.0968

1524

0.0882

0.1176

2.7647

0.0882

1525

0.1111

0.1111

0.9444

0.2222

1526

2.8571

0.8571

4.2857

0.7143

1527

3.5000

1.5000

13.2500

1.0000

1528

2.7500

1.5000

1.0000

1.0000

1529

2.2500

0.8750

1.2500

0.8750

Average CER (10 samples)

  • OCR1: before 1.4866, after 0.6409 (-57%)

  • OCR2: before 4.1863, after 0.6018 (-86%)

📊 Benchmarks

Benchmarks for DeepSeek-OCR 2 model are derived from the official research paper.

Table 1: Comprehensive evaluation of document reading on OmniDocBench v1.5. V-token𝑚𝑎𝑥 represents the maximum number of visual tokens used per page in this benchmark. R-order denotes reading order. Except for DeepSeek OCR and DeepSeek OCR 2, all other model results in this table are sourced from the OmniDocBench repository.

Table 2: Edit Distances for different categories of document-elements in OmniDocBench v1.5. V-token𝑚𝑎𝑥 denotes the lowest maximum number of visual tokens.

Outperforms Gemini-3 Pro on the OmniDocBench

Last updated

Was this helpful?