cx:tesseract

Name

cx:tesseract — Tesseract OCR.

Synopsis

This step runs the Tesseract OCR application

Input port	Primary	Sequence	Content types
source	✔

Output port	Primary	Sequence	Content types
result	✔

Option name	Type	Values	Default value	Required
language	xs:string			✔
data-path	xs:string?		()
debug-output	xs:string?		()
engine-mode	xs:string	('tesseract-only', 'lstm-only', 'lstm-combined', 'default')	'default'
output-format	xs:string	('text', 'hocr', 'tsv', 'alto', 'lstmbox', 'wordstrbox')	'text'
page-segmentation-mode	xs:string	('osd-only', 'auto-osd', 'auto', 'single-column', 'single-block', 'single-line', 'sparse-text', 'raw-line')	'auto'
variables	map(xs:string,xs:string)?		()

This is an extension step; to use it, your pipeline must include its declaration. For example, by including the extension library with an import at the top of your pipeline:

<p:import href="https://xmlcalabash.com/ext/library/tesseract.xpl"/>

Declaration

 1 |<p:declare-step xmlns:cx="http://xmlcalabash.com/ns/extensions"
   |                xmlns:p="http://www.w3.org/ns/xproc"
   |                type="cx:tesseract">
   |   <p:input port="source"/>
 5 |   <p:output port="result"/>
   |   <p:option name="language" as="xs:string" required="true"/>
   |   <p:option name="data-path" as="xs:string?"/>
   |   <p:option name="engine-mode"
   |             values="('tesseract-only', 'lstm-only', 'lstm-combined', 'default')"
10 |             select="'default'"/>
   |   <!-- Values not in the Tesseract OCR documentat have been removed -->
   |   <p:option name="page-segmentation-mode"
   |             values="('osd-only',                      'auto-osd',                      (: 'auto-only', :)                      'auto',                      'single-column',                      (: 'single-block-vert-text', :)                      'single-block',                      'single-line',                      (: 'single-word', :)                      (: 'circle-word', :)                      (: 'single-char', :)                      'sparse-text',                      (: 'sparse-text-osd', :)                      'raw-line')"
   |             select="'auto'"/>
15 |   <p:option name="output-format"
   |             values="('text','hocr','tsv', 'alto', 'lstmbox', 'wordstrbox')"
   |             select="'text'"/>
   |   <p:option name="variables" as="map(xs:string,xs:string)?"/>
   |   <p:option name="debug-output" as="xs:string?"/>
20 |</p:declare-step>

Description

This step performs OCR on the input (usually an image) using Tesseract OCR. In order to use this step, you must install the Tesseract OCR application.

This library uses JNA to communicate with the Tesseract application. It’s also possible to use the p:os-exec step to run Tesseract directly, but that requires the pipeline to read the OCR results from the filesystem.

Options

Language

The language option identifies the language to expect in the image text, for example “eng”. The available languages and their names depends on which sets of training data you have available.

Data path

The data-path points to the directory containing training data. If the option isn’t given, the value of the TESSDATA_PREFIX environment variable is used.

The location of the training data will depend on where Tesseract OCR was installed, and on what kind of system.

Debug output

The debug-output can be used to redirect debugging output from the Tesseract API to a file. (In my experience, it only works sometimes.)

Engine Mode

The engine-mode identifies the engine mode, which switches between pattern matching engines.

Mode	Description
tesseract-only	Only the legacy engine (traditional computer vision)
lstm-only	Only the neural nets LSTM engine
lstm-combined	Legacy and LSTM engines combined
default	Default, based on what’s available in the model data

Page Segmentation Mode

The page-segmentation-mode identifies the page segmentation mode which determines how Tesseract analyzes the layout of an image to find text blocks. The modes are:

Mode	Description
osd-only	Orientation and script detection (OSD) only
auto-osd	Automatic page segmentation with OSD
auto	Fully automatic page segmentation, but no OSD
single-column	Assume a single column of text in varying sizes
single-block	Assume a single, uniform block of text
single-line	Assume a single line of text
sparse-text	Find as much text as possible, in no particular order
raw-line	Assume a single line of text, ignoring any Tesserect-specific hacks

The Tesseract documentation recommends auto for general documents, single-block for uniform text chunks, and single-line for bar codes or labels.

Output format

The output-format option determines what kind of output is produced.

Format	Description
text	Plain text output
hocr	hOCR output; an HTML result
tsv	Tab-separated-values output; a JSON result
alto	Alto XML output; an XML result
lstmbox	The `lstmbox` format can be used to make training data
wordstrbox	Coordinates and text for whole lines

The Tesseract application supports several other formats, but they aren’t supported by the underlying Java API. The Java API only supports text output formats (so PDF can’t work), and the “page-xml” format causes a spectacular crash.

Variables

The variables option allows you to set any Tesseract parameter. For example, an alternative way to enable hOCR output is to specify map{'tessedit_create_hocr': '1'} as the value of variables. (Don’t cross the streams this way, the results of specifying variables that conflict with the requested output format are undefined.)

A complete set of variables can be obtained by running the tesseract application with the --print-parameters option in a shell or command window.

The dpi option sets user_defined_dpi; the output-format option sets tessedit_create_format; the debug-output option sets debug_output.

Not all variables appear to be supported by the underlying Java API; setting some causes the Java process to crash spectacularly. This is likely some consequence of the underlying JNA architecture and is completely out of XML Calabash’s control.

Document properties

No document properties are preserved.

Additional dependencies

This step is included in the XML Calabash application. If you are getting XML Calabash from Maven, you will also need to include these additional dependencies:

net.sourceforge.tess4j:tess4j:5.19.0

Tess4J on Ubuntu

It was a bit of a struggle getting the underlying JNA library to work on Ubuntu. There’s a discussion under tess4j issue #273. I was able to get it workiing with the instructions in the comment from 19 February 2026. (Although clang-19 seems to be the clang compiler on Ubuntu now.)

Additional examples

The XML Calabash test suite contains examples of the cx:tesseract step.

Prev	Up	Next
cx:selenium	Home	cx:trang