Name
cx:tesseract — Tesseract OCR.
Synopsis
This step runs the Tesseract OCR application
| Input port | Primary | Sequence | Content types |
|---|---|---|---|
| source | ✔ |
| Output port | Primary | Sequence | Content types |
|---|---|---|---|
| result | ✔ |
| Option name | Type | Values | Default value | Required |
|---|---|---|---|---|
| language | xs:string | ✔ | ||
| data-path | xs:string? | () | ||
| debug-output | xs:string? | () | ||
| engine-mode | xs:string | ('tesseract-only', 'lstm-only', 'lstm-combined', 'default') | 'default' | |
| output-format | xs:string | ('text', 'hocr', 'tsv', 'alto', 'lstmbox', 'wordstrbox') | 'text' | |
| page-segmentation-mode | xs:string | ('osd-only', 'auto-osd', 'auto', 'single-column', 'single-block', 'single-line', 'sparse-text', 'raw-line') | 'auto' | |
| variables | map(xs:string,xs:string)? | () |
<p:import href="https://xmlcalabash.com/ext/library/tesseract.xpl"/>Declaration
1 |<p:declare-step xmlns:cx="http://xmlcalabash.com/ns/extensions"
| xmlns:p="http://www.w3.org/ns/xproc"
| type="cx:tesseract">
| <p:input port="source"/>
5 | <p:output port="result"/>
| <p:option name="language" as="xs:string" required="true"/>
| <p:option name="data-path" as="xs:string?"/>
| <p:option name="engine-mode"
| values="('tesseract-only', 'lstm-only', 'lstm-combined', 'default')"
10 | select="'default'"/>
| <!-- Values not in the Tesseract OCR documentat have been removed -->
| <p:option name="page-segmentation-mode"
| values="('osd-only', 'auto-osd', (: 'auto-only', :) 'auto', 'single-column', (: 'single-block-vert-text', :) 'single-block', 'single-line', (: 'single-word', :) (: 'circle-word', :) (: 'single-char', :) 'sparse-text', (: 'sparse-text-osd', :) 'raw-line')"
| select="'auto'"/>
15 | <p:option name="output-format"
| values="('text','hocr','tsv', 'alto', 'lstmbox', 'wordstrbox')"
| select="'text'"/>
| <p:option name="variables" as="map(xs:string,xs:string)?"/>
| <p:option name="debug-output" as="xs:string?"/>
20 |</p:declare-step>Description
This step performs OCR on the input (usually an image) using Tesseract OCR. In order to use this step, you must install the Tesseract OCR application.
This library uses
JNA
to communicate with the Tesseract application. It’s also possible to use
the p:os-exec step to run Tesseract directly, but that requires
the pipeline to read the OCR results from the filesystem.
Options
Language
The language option identifies the language to expect in
the image text, for example “eng”. The available languages and
their names depends on which sets of training data you have available.
Data path
The data-path points to the directory containing training
data. If the option isn’t given, the value of the TESSDATA_PREFIX
environment variable is used.
The location of the training data will depend on where Tesseract OCR was installed, and on what kind of system.
Debug output
The debug-output can be used to redirect debugging output
from the Tesseract API to a file. (In my experience, it only works sometimes.)
Engine Mode
The engine-mode identifies the
engine mode,
which switches between pattern matching engines.
| Mode | Description |
|---|---|
| tesseract-only | Only the legacy engine (traditional computer vision) |
| lstm-only | Only the neural nets LSTM engine |
| lstm-combined | Legacy and LSTM engines combined |
| default | Default, based on what’s available in the model data |
Page Segmentation Mode
The page-segmentation-mode identifies the
page segmentation mode
which determines how Tesseract analyzes the layout of
an image to find text blocks. The modes are:
| Mode | Description |
|---|---|
| osd-only | Orientation and script detection (OSD) only |
| auto-osd | Automatic page segmentation with OSD |
| auto | Fully automatic page segmentation, but no OSD |
| single-column | Assume a single column of text in varying sizes |
| single-block | Assume a single, uniform block of text |
| single-line | Assume a single line of text |
| sparse-text | Find as much text as possible, in no particular order |
| raw-line | Assume a single line of text, ignoring any Tesserect-specific hacks |
The Tesseract documentation recommends auto for
general documents, single-block for uniform text chunks,
and single-line for bar codes or labels.
Output format
The output-format option determines what kind
of output is produced.
| Format | Description |
|---|---|
| text | Plain text output |
| hocr | hOCR output; an HTML result |
| tsv | Tab-separated-values output; a JSON result |
| alto | Alto XML output; an XML result |
| lstmbox | The lstmbox format can be used to make training data |
| wordstrbox | Coordinates and text for whole lines |
The Tesseract application supports several other formats, but they aren’t
supported by the underlying Java API. The Java API only supports text output
formats (so PDF can’t work), and the “page-xml” format causes
a spectacular crash.
Variables
The variables option allows you to set any
Tesseract parameter. For example, an alternative way to enable hOCR output
is to specify map{'tessedit_create_hocr': '1'} as the value
of variables. (Don’t cross the streams this way, the results
of specifying variables that conflict with the requested output format are
undefined.)
A complete set of variables can be obtained by running the
tesseract application with the --print-parameters
option in a shell or command window.
The dpi option sets
user_defined_dpi; the output-format option
sets tessedit_create_format;
the debug-output option sets
debug_output.
Not all variables appear to be supported by the underlying Java API; setting some causes the Java process to crash spectacularly. This is likely some consequence of the underlying JNA architecture and is completely out of XML Calabash’s control.
Document properties
No document properties are preserved.
Additional dependencies
This step is included in the XML Calabash application. If you are getting XML Calabash from Maven, you will also need to include these additional dependencies:
net.sourceforge.tess4j:tess4j:5.19.0
Tess4J on Ubuntu
It was a bit of a struggle getting the underlying JNA library to work
on Ubuntu. There’s a discussion under tess4j
issue #273.
I was able to get it workiing with the instructions in
the comment
from 19 February 2026. (Although clang-19 seems to be the clang compiler on
Ubuntu now.)
Additional examples
The XML Calabash test suite contains examples of the cx:tesseract step.