OCR for Historical Newsprint: Four Models Worth Running Locally in LM Studio

If you work with scanned, typeset documents from archives like the British Newspaper Archive, you will know the frustration of running standard OCR tools on material they were never really designed for: degraded print, Victorian column layouts, eccentric typography, and occasionally deliberate non-standard spelling.

You can leverage the power of local AI models, however, to automate this process, and with free inferencing software like LM Studio, the learning curve isn’t at all steep. Below, I take a look at four specialist OCR models you can run entirely locally using the package – and why you might prefer doing so over handing your documents to a web service.

Why Run OCR Locally?

There are some truly excellent web-based OCR services. There’s Transkribus, for instance, which is widely used in the academic community. Tool like this are powerful and convenient, but they come with some real trade-offs:

Privacy: Your document images leave your machine and are processed on someone else’s server. For sensitive archival material or unpublished research corpora, that matters.
Cost at scale: Processing hundreds or thousands of newspaper pages through a paid API adds up quickly.
No customisation: Cloud OCR engines don’t always offer many pipeline options. You cannot instruct them to preserve dialect spellings, flag ambiguous characters, or respect the orthographic conventions of a specific historical variety of English.
Reproducibility: Web services update their models silently. A corpus processed in 2024 may produce different output if you re-run it in 2026. A local model stays consistent – important for methodological reproducibility.

Running OCR-trained models in an inferencing software like LM Studio removes most of this friction. The program handles multiple model download and management through a clean interface, and also allows you to customise model settings, up to the inclusion of system prompts that persist across sessions. For historical document work, that means you can instruct the model once about the linguistic conventions of your material and have it apply those rules to every page you send it.

The Four Models

1. OLMOCR 2 (7B) — Best Overall for Documents

Developed by the Allen Institute for AI (Ai2), olmOCR 2 is built on Qwen2.5-VL-7B-Instruct and fine-tuned using reinforcement learning with unit-test rewards specifically targeting document OCR tasks. It is one of the few models designed from the ground up for this use case rather than adapted from a general vision assistant.

Size: 7 billion parameters. Available as a ~4.7 GB GGUF (Q4 quantisation) or ~8.85 GB at Q8. Needs around 5–10 GB RAM depending on quantisation.

Why it works for newspaper archives: Handles multi-column layouts, mixed content (tables, headings, body text), and degraded print reliably. Scores 82.4 on olmOCR-Bench. It responds well to system prompt instructions, making it a strong candidate for dialect-preservation workflows.

LM Studio: There’s a GGUF in the native catalogue – search and download directly in the app.
🔗 lmstudio.ai/models/allenai/olmocr-2-7b-1025

✅ Pros: Best-in-class document OCR accuracy; strong layout understanding; instruction-following is reliable; native LM Studio support.
❌ Cons: 7B means slower inference on modest hardware; not ideal for rapid bulk processing.

2. NANONETS-OCR-S — Clean Catalogue Option

Developed by Nanonets, a document AI company, this model is also based on the Qwen2.5-VL architecture but fine-tuned specifically on structured document extraction tasks including forms, invoices, and archival print.

Size: Approximately 7B parameters, similar footprint to olmOCR 2. Available directly via the LM Studio model catalogue as a GGUF.

Why it works for newspaper archives: Strong on structured layout extraction and clean Markdown output. Useful when you want transcription that preserves document structure (headings, columns, captions) as well as raw text.

LM Studio: Native catalogue – findable by searching “Nanonets” in the model browser.
🔗 lmstudio.ai/models (search: Nanonets-OCR-s)

✅ Pros: Easy one-click setup; good structural output; reliable on clean and moderately degraded scans.
❌ Cons: Less tested on heavily damaged historical material than olmOCR 2; similar hardware demands.

3. DOTS.OCR (1.7B) — Best for Complex Column Layouts

Released by Rednote (小红书) in late 2025, dots.ocr is a compact 1.7B vision-language model that combines layout detection and text recognition in a single pass. Unusually for its size, it explicitly predicts reading order — the sequence in which text blocks should be read — which is critical for Victorian newspaper pages where columns can be irregular and text wraps around illustrations.

Size: 1.7 billion parameters; approximately 2 GB as a GGUF. Runs comfortably on 3 GB VRAM.

Why it works for newspaper archives: Reading order prediction alone makes it worth considering for multi-column broadsheet layouts. Supports over 100 languages, outputs JSON, Markdown, or HTML, and benchmarks show Table TEDS accuracy of 88.6% — ahead of Gemini 2.5 Pro on that metric.

LM Studio: Load via HuggingFace GGUF import (paste the HuggingFace URL into LM Studio’s search bar).
🔗 huggingface.co/dotsdocx/dots.ocr-1.7B-GGUF

✅ Pros: Tiny footprint; reading order detection; fast; strong on multi-column layouts; multilingual.
❌ Cons: Smaller context window means system prompts may drift on very long sessions; can hallucinate on heavily degraded scans; not in the native LM Studio catalogue.

4. GLM-OCR (0.9B) — Best for Bulk Processing on Modest Hardware

Released by Z.ai (Zhipu AI) in early 2026, GLM-OCR is built on the GLM-V encoder–decoder architecture and fine-tuned exclusively for OCR. At under 1 billion parameters it is the smallest model here, yet it scores 94.0 on OCRBench and 93.96% Table TEDS accuracy – results that comfortably outperform much larger general-purpose models.

Size: 0.9 billion parameters; approximately 1 GB quantised (Q8). Needs under 1.5 GB VRAM – it will run on almost any laptop made in the last five years.

Why it works for newspaper archives: Speed and low resource use make it ideal for processing large batches of pages. It is not a chat model — it takes an image and outputs text, triggered by the phrase Text Recognition: — so it is best suited to pure transcription pipelines rather than interactive use.

LM Studio: Load via HuggingFace GGUF import using the ggml-org GGUF repository.
🔗 huggingface.co/ggml-org/GLM-OCR-GGUF

✅ Pros: Tiny; fast; runs on minimal hardware; excellent accuracy for its size; good for bulk workflows.
❌ Cons: Not a chat/instruction model — no system prompt support for dialect customisation; requires a separate layout detection step for complex multi-column pages; not in the native LM Studio catalogue.

Quick Comparison

Model	Size (GGUF)	VRAM	LM Studio Route	Best For
olmOCR 2 (7B)	~4.7 GB	5 GB+	Native catalogue	Best accuracy, complex layouts, dialect workflows
Nanonets-OCR-s	~4.7 GB	5 GB+	Native catalogue	Structured document extraction, clean output
dots.ocr (1.7B)	~2 GB	3 GB	HuggingFace GGUF import	Multi-column layouts, reading order, low VRAM
GLM-OCR (0.9B)	~1 GB	<1.5 GB	HuggingFace GGUF import	Bulk processing, minimal hardware

A Practical Workflow for Newspaper Archives

For a large corpus like material from the British Newspaper Archive, a two-tier approach works well. Use GLM-OCR for the bulk of clean, well-preserved pages – it is fast and accurate enough for standard 20th-century newsprint. Then escalate difficult pages (damaged, illegible columns, unusual typefaces, pre-1880 material) to olmOCR 2 for a more careful second pass. If column order is scrambling your output, switch to dots.ocr for those pages specifically.

For dialect writing research – where you need the transcription to preserve non-standard spellings rather than silently normalise them – load olmOCR 2 or Nanonets-OCR-s and write a system prompt that explicitly instructs the model to treat all orthographic choices as intentional. That single step does something no traditional OCR engine is capable of: it makes the tool linguistically aware of your material.

All four models run fully offline once downloaded. No subscription, no API key, no usage limits — just your hardware and your documents.

The GLM-OCR model running in LM Studio, transcribing a 19th-century newspaper article

Polyglossic

Love Learning Languages

OCR for Historical Newsprint: Four Models Worth Running Locally in LM Studio

Why Run OCR Locally?

The Four Models

1. OLMOCR 2 (7B) — Best Overall for Documents

2. NANONETS-OCR-S — Clean Catalogue Option

3. DOTS.OCR (1.7B) — Best for Complex Column Layouts

4. GLM-OCR (0.9B) — Best for Bulk Processing on Modest Hardware

Quick Comparison

A Practical Workflow for Newspaper Archives

Leave a Reply

Why Run OCR Locally?

The Four Models

1. OLMOCR 2 (7B) — Best Overall for Documents

2. NANONETS-OCR-S — Clean Catalogue Option

3. DOTS.OCR (1.7B) — Best for Complex Column Layouts

4. GLM-OCR (0.9B) — Best for Bulk Processing on Modest Hardware

Quick Comparison

A Practical Workflow for Newspaper Archives

Share this:

Leave a Reply