DeepSeek Unveils DeepSeek-OCR 2 with Groundbreaking Visual Causal Flow Architecture

4 weeks ago / Directory:AI News / Views:46

‍‌​TodAy, AI startup DeepSeek released its new paper titled “DeepSeek-OCR 2: Visual Causal Flow” and open-sourced the DeepSeek-OCR 2 model. The system introduces an innovative DeepEncoder V2 architecture that enables AI to dynamically reorder image segments based on semantic meaning—mimicking how humans naturally perceive and interpret visual information.

Unlike conventional vision-language models (VLMs), which process images using a fixed raster-scan order (left-to-right, top-to-bottom), DeepSeek-OCR 2 adopts a causally aware visual encoding strategy. Inspired by human eye-tracking behavior—where gaze patterns follow logical, content-driven “causal flows”—the model uses learnable causal flow queries to intelligently resequence visual tokens before they are passed to the language decoder. This creates a two-stage, cascaded 1D causal reasoning pipeline: first, the encoder reorganizes visual information according to semantic logic; then, the decoder performs autoregressive reasoning on this structured sequence.

This approach is especially effective for complex document understanding tasks involving non-linear layouts—such as tables, multi-column text, mathematical formulas, or spiral diagrams—where spatial proximity does not reflect reading order.

According to DeepSeek’s technical report, DeepSeek-OCR 2 achieves 91.09% accuracy on the OmniDocBench v1.5 benchmark, a 3.73 percentage point improvement over its predecessor. Notably, the model maintains high efficiency: it limits visual token counts to between 256 and 1,120, aligning with the computational constraints of leading models like Google’s Gemini 3 Pro. In real-world deployment, it reduced duplication rates by 2.08% in online user logs and 0.81% in PDF pretraining data, demonstrating strong production readiness.

Beyond OCR performance, DeepEncoder V2 represents a significant architectural exploration. By leveraging a language-model-like backbone for visual encoding, it inherits decades of infrastructure advances from the LLM ecosystem—including Mixture-of-Experts (MoE) scaling and optimized attention mechanisms. DeepSeek suggests this paves the way toward a unified multimodal encoder: a single model capable of processing images, audio, and text through modality-specific learnable queries within a shared parameter space.

The team posits that decomposing 2D visual understanding into two complementary 1D causal reasoning stages—“reading-order inference” followed by “task-specific reasoning”—may offer a scalable path toward true 2D reasoning in foundation models.

Kimi Launches Multimodal open-source model Kimi K2.5

In related news, Chinese AI company Moonshot AI (Yue Zhi An Mian) today launched Kimi K2.5, its latest open-source model built on a native multimodal architecture. Kimi K2.5 natively supports both visual and textual inputs and integrates capabilities across visual understanding, reasoning, coding, and autonomous agent functions into a single unified model.

Alibaba’s Qwen3-Max-Thinking Sets New Global Benchmark

Meanwhile, Alibaba Cloud announced yesterday evening (January 26) the release of Qwen3-Max-Thinking, its flagship reasoning model in the Qwen series. In multiple key benchmarks, Qwen3-Max-Thinking surpassed leading models including GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro, setting new global records and significantly advancing the frontier of AI reasoning capabilities.

Together, these releases underscore China’s accelerating innovation in foundational AI models—spanning optical character recognition, multimodal integration, and advanced reasoning—highlighting a growing shift toward architectures that better emulate human cognition.


[S][o][u]‌‍​
★★★★★
★★★★★
Be the first to rate!

Comments & Questions (0)

Captcha
Please be respectful — let's keep the conversation friendly.

No comments yet

Be the first to comment!