The Paper
"OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory" was published in April 2026 by Jinze Li, Yang Zhang, Xin Yang, Jiayi Qu, Jinfeng Xu, Shuo Yang, Junhua Ding, and Edith Cheuk-Han Ngai from the University of Hong Kong, University of North Texas, University of Tsukuba, and Yonsei University, accepted at ACL 2026 Main Conference. The paper argues that agent memory should shift from the text domain to the visual modality - rendering interaction trajectories as annotated images and retrieving evidence through a Locate-and-Transcribe mechanism that fetches verbatim text deterministically rather than generating it, achieving 100% retrieval faithfulness while cutting reasoning-context tokens by 6.7x compared to text-based RAG.
Read the Paper on arXiv →The Problem Before This Paper
Long-horizon agents generate extensive interaction histories - reasoning traces, tool invocations, environment feedback - that are critical for future reference but impossible to store verbatim under finite context windows. Existing approaches force a painful trade-off. Retrieval-based systems (MemGPT, MemoryBank, Raptor) store past interactions externally and fetch relevant fragments via semantic similarity, but similarity matching is brittle for tasks that depend on causality or long-range dependencies rather than topical overlap. Experience abstraction methods (AWM, Expel, Dilu) compress trajectories into reusable skills or procedural knowledge, but discard the low-level details - exact error messages, intermediate states, nuanced dialogue turns - that are essential for debugging, faithful retrospection, and grounded decision-making. Context compression approaches (ACON, LLMLingua, MemGen) reduce the text itself via latent representations or token pruning, but text-centric compression inevitably trades compression ratio against information fidelity, especially in multimodal settings where visual layouts and structural cues are lost under pure textual summarization.
What They Built
OCR-Memory stores agent trajectories as rendered images rather than raw text, leveraging the DeepSeek-OCR (3B) vision encoder to compress dense textual content into a small number of visual tokens - achieving over 10x compression while preserving full fidelity. Each trajectory chunk is rendered into a marked image with Set-of-Mark (SoM) visual anchors: red bounding boxes annotated with unique numerical IDs that highlight individual text segments. When a new query arrives, the retrieval module scans these visual representations and outputs a binary relevance vector - predicting which segment indices are relevant - rather than generating free-form text. The corresponding original text is then deterministically fetched from an external log, completely eliminating generation-based hallucination. To handle growing history, OCR-Memory implements a multi-resolution aging policy: the five most recent interaction steps are stored at 1024x1024 (256 visual tokens), while all older history is downsampled to 512x512 (64 visual tokens). When a low-resolution memory is retrieved as relevant, an Active Recall mechanism upscales it back to high fidelity on demand - mimicking the vivid-to-fuzzy decay of human memory while preserving the ability to recover full detail when needed.
// Visual Encoding (DeepSeek-OCR):
Z = f_enc(I) ∈ R^{n(r) x d_latent}
n(r) ∈ {64, 100, 256, 400} // compressed-token budgets
// Segment Relevance Probability:
p_{i,k}(q) = exp(z_{i,k}(1)) / (exp(z_{i,k}(1)) + exp(z_{i,k}(0)))
// Adaptive Resolution Aging:
l_i = rho(delta_t_i), I_i = phi_{l_i}(I_i^hi)
// Active Recall Upscaling (when retrieved):
if exists(i,k) in S_hat(q) s.t. l_i > l_min: I_i ← I_i^hi
Key Findings
What the experiments revealed
- Visual encoding outperforms all text-based memory paradigms. On Mind2Web, OCR-Memory achieves 53.8% Element Accuracy and 46.1% Step Success Rate, beating the next-best method ACON (48.2% and 41.4%) and the abstraction-based AWM (49.1% and 42.6%) under the same 4096-token context budget.
- Hard tasks show the largest gains. On AppWorld, OCR-Memory reaches 30.8% on Hard tasks - substantially above standard Retrieval (21.4%) and AWM (27.2%) - where extensive history backtracking is required.
- Locate-and-Transcribe eliminates retrieval hallucination. The free-form generative retrieval variant achieves only 84.3% faithfulness, while OCR-Memory hits 100.0% because it predicts segment indices and fetches verbatim text deterministically.
- SoM anchors are critical for both accuracy and speed. Removing Set-of-Mark prompting drops Element Accuracy from 53.8% to 46.5% (text generation variant) and triples inference latency from 1.7s to 5.3s per retrieval step.
- Multi-resolution aging matches high-res quality at low-res cost. The dynamic strategy achieves 46.1% Step SR using only 82 average visual tokens per frame - compared to 46.5% Step SR for static high-res at 256 tokens, a 3.1x reduction with minimal performance loss.
Results
On Mind2Web, OCR-Memory scores 53.8% Element Accuracy, 59.2 Action F1, 46.1% Step Success Rate, and 4.8% Task Success Rate - outperforming ACON (48.2/54.1/41.4/4.1), AWM (49.1/55.7/42.6/4.3), and MemoryBank (43.8/49.5/39.2/3.3) across all metrics under the same context budget. On AppWorld, OCR-Memory reaches 58.1% average success rate (86.2% Easy, 57.4% Medium, 30.8% Hard), beating ACON's 56.2% and AWM's 55.0%. The retrieval-level evaluation on a dedicated Mind2Web subset shows 78.6% Recall@1 versus Dense Text-RAG's 52.7%, with 93.4% Recall@5 and MRR of 0.84 versus 0.61. On the NIAH benchmark adapted for agents, OCR-Memory maintains 98.5% retrieval accuracy at 4k context and sustains 94.1% at 32k, with a consistent 10x+ compression ratio across all lengths. The gains are backbone-agnostic: switching from GPT-4 to Qwen3-32B preserves the relative improvement over text-based retrieval (48.6% vs 35.2% Element Accuracy).
OCR-Memory
596 Text tokens per step 100% Retrieval faithfulnessText-RAG
3,980 Text tokens per step 84.3% Retrieval faithfulnessWhy This Matters for AI and Automation
- Reasoning tokens are the scarcest resource in agent systems. OCR-Memory's 6.7x reduction in text tokens injected into the reasoning LLM directly translates to lower API costs and faster inference for production agent deployments. The trade-off - higher disk usage (1.47 MB vs 18 KB per episode) and slightly higher retrieval latency (1.7s vs 0.3s) - is favorable economics when reasoning tokens cost orders of magnitude more than storage.
- 100% retrieval faithfulness changes the trust equation. Generative retrieval that fabricates 15.7% of recovered evidence is a serious liability for agent systems that take real-world actions. The Locate-and-Transcribe paradigm makes retrieval deterministic and auditable - the agent either gets the exact original text or nothing.
- Multi-resolution aging is a practical pattern for production memory. The idea of progressively compressing old memories while preserving the ability to restore them on demand maps directly to how persistent agent deployments should manage growing interaction histories - keep recent context sharp, let older context fade efficiently, and restore on demand when relevance is detected.
- Visual encoding as a compression layer is underexplored. The finding that rendering text as images and reading them with a vision encoder achieves 10x compression with near-lossless retrieval accuracy suggests a broader opportunity beyond agent memory - any system that needs to store and retrieve large text corpora under token constraints could benefit from this approach.
My Take
The core insight here is counterintuitive but well-supported: converting text to images and reading it back with a vision model is more token-efficient than storing the text directly. DeepSeek-OCR's optical compression achieves 10x+ compression ratios while maintaining 98.5% retrieval accuracy at 4k context, and the Locate-and-Transcribe mechanism solves the hallucination problem that plagues generative retrieval by making evidence recovery fully deterministic. The multi-resolution aging with Active Recall is the most production-relevant contribution - it is a clean answer to the "memory grows forever" problem that every persistent agent deployment faces. The main limitations are real: rendering text to images adds disk overhead (1.47 MB vs 18 KB per episode), retrieval latency increases from 0.3s to 1.7s, and the system requires fine-tuning a dedicated 3B-parameter retrieval model. The fine-tuning dependency on HotpotQA also raises questions about domain transfer - whether the learned grounding generalizes beyond web navigation and API interaction tasks. Still, in a landscape where every agent memory approach either loses information (summarization) or burns tokens (raw storage), OCR-Memory finds a genuinely novel third path by shifting the storage modality entirely.
Discussion question: OCR-Memory trades cheap storage for expensive reasoning tokens by encoding text as images - a favorable trade-off today when LLM inference is the bottleneck. But as inference costs drop and context windows expand, does the visual encoding approach become unnecessary overhead, or does the 100% faithfulness guarantee and multi-resolution aging keep it relevant regardless of token economics?