Name
cx:pdf-info — Get information about a PDF.
Synopsis
This step returns metadata, and optionally the content, of a PDF.
| Input port | Primary | Sequence | Content types |
|---|---|---|---|
| source | ✔ | application/pdf |
| Output port | Primary | Sequence | Content types |
|---|---|---|---|
| result | ✔ | xml |
| Option name | Type | Default value |
|---|---|---|
| form-details | xs:boolean? | () |
| page-details | xs:boolean? | () |
| page-text | xs:boolean? | () |
| password | xs:string? | () |
<p:import href="https://xmlcalabash.com/ext/library/pdf-steps.xpl"/>Declaration
1 |<p:declare-step xmlns:cx="http://xmlcalabash.com/ns/extensions"
| xmlns:p="http://www.w3.org/ns/xproc"
| type="cx:pdf-info">
| <p:input port="source" content-types="application/pdf"/>
5 | <p:output port="result" content-types="xml"/>
| <p:option name="password" as="xs:string?"/>
| <p:option name="page-details" as="xs:boolean?"/>
| <p:option name="page-text" as="xs:boolean?"/>
| <p:option name="form-details" as="xs:boolean?"/>
10 |</p:declare-step>Description
This step returns details about the content of a PDF: PDF version, number of pages, creator, title, etc.
The page-size returned reflects the maximum width and the
maximum height of all pages. In a document that contains different media sizes,
use the page-details option to get information about each page.
Options
page-detailsIf
page-detailsis true, details about each page are included in the result.page-textIf
page-textis true, the text content of each page is included in the result. There is tremendous variation in the way text can be placed on a page. This option will return more-or-less useful results depending on how the page was constructed. If the results are unusable, another approach that might work is to convert the page to images and then OCR them with thecx:tesseractstep.form-detailsIf
page-detailsis true, and the PDF contains a fillable form, details about the form fields are included in the result. The fields are listed in “field tree” order from the PDF. That may not reflect the order of the fields on the page.At present, only the fields directly in the field tree view are included. It’s likely that more complex arrangements are possible (nested fields, etc.). If you find such a PDF, and you can provide it as an example, please open an issue for it.
Document properties
No document properties are preserved.
Additional examples
The XML Calabash test suite contains examples of the cx:pdf-info step.