Name

cx:tesseract — Tesseract OCR.

Synopsis

This step runs the Tesseract OCR application

Input portPrimarySequenceContent types
source✔   
Output portPrimarySequenceContent types
result✔   
Option nameTypeValuesDefault valueRequired
languagexs:string  ✔ 
data-pathxs:string? () 
debug-outputxs:string? () 
engine-modexs:string('tesseract-only', 'lstm-only', 'lstm-combined', 'default') 'default' 
output-formatxs:string('text', 'hocr', 'tsv', 'alto', 'lstmbox', 'wordstrbox') 'text' 
page-segmentation-modexs:string('osd-only', 'auto-osd', 'auto', 'single-column', 'single-block', 'single-line', 'sparse-text', 'raw-line') 'auto' 
variablesmap(xs:string,xs:string)? () 
This is an extension step; to use it, your pipeline must include its declaration. For example, by including the extension library with an import at the top of your pipeline:
<p:import href="https://xmlcalabash.com/ext/library/tesseract.xpl"/>
Declaration
 1 |<p:declare-step xmlns:cx="http://xmlcalabash.com/ns/extensions"
   |                xmlns:p="http://www.w3.org/ns/xproc"
   |                type="cx:tesseract">
   |   <p:input port="source"/>
 5 |   <p:output port="result"/>
   |   <p:option name="language" as="xs:string" required="true"/>
   |   <p:option name="data-path" as="xs:string?"/>
   |   <p:option name="engine-mode"
   |             values="('tesseract-only', 'lstm-only', 'lstm-combined', 'default')"
10 |             select="'default'"/>
   |   <!-- Values not in the Tesseract OCR documentat have been removed -->
   |   <p:option name="page-segmentation-mode"
   |             values="('osd-only',                      'auto-osd',                      (: 'auto-only', :)                      'auto',                      'single-column',                      (: 'single-block-vert-text', :)                      'single-block',                      'single-line',                      (: 'single-word', :)                      (: 'circle-word', :)                      (: 'single-char', :)                      'sparse-text',                      (: 'sparse-text-osd', :)                      'raw-line')"
   |             select="'auto'"/>
15 |   <p:option name="output-format"
   |             values="('text','hocr','tsv', 'alto', 'lstmbox', 'wordstrbox')"
   |             select="'text'"/>
   |   <p:option name="variables" as="map(xs:string,xs:string)?"/>
   |   <p:option name="debug-output" as="xs:string?"/>
20 |</p:declare-step>

Description

This step performs OCR on the input (usually an image) using Tesseract OCR. In order to use this step, you must install the Tesseract OCR application.

This library uses JNA to communicate with the Tesseract application. It’s also possible to use the p:os-exec step to run Tesseract directly, but that requires the pipeline to read the OCR results from the filesystem.

Options

Language

The language option identifies the language to expect in the image text, for example “eng”. The available languages and their names depends on which sets of training data you have available.

Data path

The data-path points to the directory containing training data. If the option isn’t given, the value of the TESSDATA_PREFIX environment variable is used.

The location of the training data will depend on where Tesseract OCR was installed, and on what kind of system.

Debug output

The debug-output can be used to redirect debugging output from the Tesseract API to a file. (In my experience, it only works sometimes.)

Engine Mode

The engine-mode identifies the engine mode, which switches between pattern matching engines.

ModeDescription
tesseract-onlyOnly the legacy engine (traditional computer vision)
lstm-onlyOnly the neural nets LSTM engine
lstm-combinedLegacy and LSTM engines combined
defaultDefault, based on what’s available in the model data

Page Segmentation Mode

The page-segmentation-mode identifies the page segmentation mode which determines how Tesseract analyzes the layout of an image to find text blocks. The modes are:

ModeDescription
osd-onlyOrientation and script detection (OSD) only
auto-osdAutomatic page segmentation with OSD
autoFully automatic page segmentation, but no OSD
single-columnAssume a single column of text in varying sizes
single-blockAssume a single, uniform block of text
single-lineAssume a single line of text
sparse-textFind as much text as possible, in no particular order
raw-lineAssume a single line of text, ignoring any Tesserect-specific hacks

The Tesseract documentation recommends auto for general documents, single-block for uniform text chunks, and single-line for bar codes or labels.

Output format

The output-format option determines what kind of output is produced.

FormatDescription
textPlain text output
hocrhOCR output; an HTML result
tsvTab-separated-values output; a JSON result
altoAlto XML output; an XML result
lstmboxThe lstmbox format can be used to make training data
wordstrboxCoordinates and text for whole lines

The Tesseract application supports several other formats, but they aren’t supported by the underlying Java API. The Java API only supports text output formats (so PDF can’t work), and the “page-xml” format causes a spectacular crash.

Variables

The variables option allows you to set any Tesseract parameter. For example, an alternative way to enable hOCR output is to specify map{'tessedit_create_hocr': '1'} as the value of variables. (Don’t cross the streams this way, the results of specifying variables that conflict with the requested output format are undefined.)

A complete set of variables can be obtained by running the tesseract application with the --print-parameters option in a shell or command window.

The dpi option sets user_defined_dpi; the output-format option sets tessedit_create_format; the debug-output option sets debug_output.

Not all variables appear to be supported by the underlying Java API; setting some causes the Java process to crash spectacularly. This is likely some consequence of the underlying JNA architecture and is completely out of XML Calabash’s control.

Document properties

No document properties are preserved.

Additional dependencies

This step is included in the XML Calabash application. If you are getting XML Calabash from Maven, you will also need to include these additional dependencies:

  • net.sourceforge.tess4j:tess4j:5.19.0

Tess4J on Ubuntu

It was a bit of a struggle getting the underlying JNA library to work on Ubuntu. There’s a discussion under tess4j issue #273. I was able to get it workiing with the instructions in the comment from 19 February 2026. (Although clang-19 seems to be the clang compiler on Ubuntu now.)

Additional examples

The XML Calabash test suite contains examples of the cx:tesseract step.