Name

cx:fileset — Approximates the Ant notion of a fileset.

Synopsis

This step selects files using a vocabulary based on the FileSet vocabulary of Ant.

Input portPrimarySequenceContent typesDefault binding
source✔ ✔ xml p:empty
Output portPrimarySequenceContent typesDefault binding
result✔ ✔ xml 
Option nameTypeDefault valueRequired
pathxs:string ✔ 
case-sensitivexs:booleantrue() 
default-excludesxs:booleantrue() 
detailedxs:booleanfalse() 
error-on-missing-dirxs:booleantrue() 
excludesxs:string?() 
follow-symlinksxs:booleantrue() 
includesxs:string?() 
This is an extension step; to use it, your pipeline must include its declaration. For example, by including the extension library with an import at the top of your pipeline:
<p:import href="https://xmlcalabash.com/ext/library/fileset.xpl"/>
Declaration
 1 |<p:declare-step xmlns:p="http://www.w3.org/ns/xproc">
   |   <p:input port="source" content-types="xml" sequence="true">
   |      <p:empty/>
   |   </p:input>
 5 |   <p:output port="result" content-types="xml" sequence="true"/>
   |   <p:option name="path" as="xs:string" required="true"/>
   |   <p:option name="default-excludes" as="xs:boolean" select="true()"/>
   |   <p:option name="case-sensitive" as="xs:boolean" select="true()"/>
   |   <p:option name="error-on-missing-dir" as="xs:boolean" select="true()"/>
10 |   <p:option name="follow-symlinks" as="xs:boolean" select="true()"/>
   |   <p:option name="includes" as="xs:string?"/>
   |   <p:option name="excludes" as="xs:string?"/>
   |   <p:option name="detailed" as="xs:boolean" select="false()"/>
   |</p:declare-step>
Errors
CodeDescription
cxerr:XC0043It is a dynamic error (cxerr:XC0043) if the source document is not a fileset document.
cxerr:XC0044It is a dynamic error (cxerr:XC0044) if they are specified in both places.
cxerr:XC0051It is a dynamic error (cxerr:XC0051) if the path provided is not a file: URI.
cxerr:XC0052It is a dynamic error (cxerr:XC0052) if the path provided does not exist and error-on-missing-dir is true.

Description

This step is based on the Ant notion of FileSets. Conceptually, all of the files below some starting point (the path) are filtered through a set of inclusions and exclusions. Any files that satisfy the filters are further subjected to a set of selectors (which may use mappers) to determine if they are in the final result or not.

The cx:fileset step differs from Ant in some ways:

  1. The files selected by cx:fileset are normalized with urify(). They exclusively use “/” as the path separator.

  2. The step is only concerned with selecting existing files.

    1. The “type” selector is not provided.

    2. Attributes related to filtering directories are not provided.

    3. It doesn’t support anything like Ant FilterSets which change the contents of files.

  3. The names of the elements and attributes have been changed to make them more consistent with XProc (“compound-name” instead of “compoundname”).

  4. The “scriptselector” is not supported.

  5. You can’t provide a custom filter or mapper class, the step isn’t extensible in that way.

  6. Ant has a number of features for defining filters, patterns, and selectors and then reusing them by reference. XProc has lots of features for constructing XML documents, so those mechanisms are not supported.

In order to avoid having all of Ant become a dependency, this is a reimplementation of the functionality. The Ant documentation isn’t especially precise. It’s possible that there are unintentional semantic differences beyond the differences outlined here. (Please report them.)

This step differs substantially from he p:directory-list step in that the include and exclude filters are “globs” not regular expressions!

If the source port is not empty, it must contain a fileset document. It is a dynamic error (cxerr:XC0043) if the source document is not a fileset document.

Important

Where regular expressions are used, they use the regular expression syntax of the underlying platform. This will change in the future and the XPath regular expression syntax will be used instead. For most simple cases, there’s no difference.

Step options

The cx:fileset step has several options.

path

The path. The value provided will be normalized with the urify() function. The resulting URI must be a “file” URI. It is a dynamic error (cxerr:XC0051) if the path provided is not a file: URI.

case-sensitive

Several selectors support case-sensitive or case-insensitive comparisons. This option provides the default value for those selectors.

default-excludes

If this option is true, the default exclusions are automatically used.

detailed

If this option is true, detailed information will be provided about each file. See p:directory-list for more details.

error-on-missing-dir

It is a dynamic error (cxerr:XC0052) if the path provided does not exist and error-on-missing-dir is true.

excludes

A space or comma separated list of globs. Any file that matches will be excluded.

follow-symlinks

Several selectors have an option to follow symbolic links. This option provides the default value for those selectors.

includes

A space or comma separated list of globs. Any file that matches will be included, unless it also matches an exclusion. If no inclusions are provided, all of the descendants are included by default.

The fileset document

The fileset document contains zero or more include or exclude elements and zero or more selectors. Some selectors may contain mappers.

A selector tests each file that is filtered through the includes and excludes. If the file “passes” the selection test, it remains included. If it “fails”, it is removed.

Selectors that compare the file in question against other files on the filesystem may use mappers to transform the filename. For example:

1 |<cx:fileset path="/path/to/input" includes="*.svg">
  |  <p:with-input>
  |    <fileset>
  |      <present target="/path/to/output">
5 |        <glob-mapper from="*.svg" to="*.png"/>
  |      </present>
  |    </fileset>
  |  </p:with-input>
  |</cx:fileset>

This fileset begins with all of the SVG files under /path/to/input. For each one, the present selector tests, does this file exist at an equivalent location under the /path/to/output directory? The glob-mapper changes the .svg extension to .png.

Suppose /path/to/input contains a.svg, b.svg, and c.svg and /path/to/output contains a.svg, b.png, and d.png.

The only file returned will be b.svg. Neither a.svg nor c.svg will be returned because there’s no equivalent PNG file in the output directory. And d.svg won’t be returned because there’s no such file in the input directory.

If more than one selector is provided, that is the same as a single and selector containing all of them.

The fileset element

The fileset element contains the configuration of the step.

<fs:fileset
  default-excludes? = booleanEnable default exclusions
  case-sensitive? = booleanDefault case sensitivity
  error-on-missing-dir? = booleanRaise an error if the path doesn't exist
  follow-symlinks? = booleanDefault value for following symlinks
  includes? = stringOne or more glob patterns to include
  excludes? = stringOne or more glob patterns to exclude
>
  (include |
   exclude |
   (contains |
    date |
    depend |
    depth |
    different |
    filename |
    present |
    contains-regexp |
    size |
    readable |
    writable |
    executable |
    symlink |
    owned-by |
    posix-group |
    posix-permissions |
    content-type))*
</fs:fileset>

The options for default exclusions, case sensitivity, whether it’s an error if the path is missing, and the default for following symbolic links must be specified either on the step or on the fileset element. It is a dynamic error (cxerr:XC0044) if they are specified in both places. If inclusions or exclusions are specified in both places, both sets of patterns (and the patterns from any nested include or exclude elements) are used.

The include element

The include element identifies a (single) glob pattern to include. If the if attribute is false or the unless attribute is true, the element is ignored.

<fs:include
  name = stringA single glob pattern
  if? = stringUse this inclusion if this is true
  unless? = stringUse this inclusion unless this is true
 />

The exclude element

The exclude element identifies a (single) glob pattern to exclude. If the if attribute is false or the unless attribute is true, the element is ignored.

<fs:exclude
  name = stringA single glob pattern
  if? = stringUse this exclusion if this is true
  unless? = stringUse this exclusion unless this is true
 />

The contains element

The contains element selects a file if it contains the specified text.

<fs:contains
  text = stringThe text that must appear in the document
  case-sensitive? = booleanShould the search be case sensitive?
  ignore-whitespace? = booleanIgnore whitespace?
  encoding? = stringThe encoding to use when reading the document
 />

If ignore-whitespace is true, all white space is stripped from the search text and the file before making the comparision.

The content-type element

The content-type element selects a file if it has one of the specified content types.

<fs:content-type
  content-types = stringThe list of content types (as per p:input)
 />

This selector is not present in Ant.

The date element

The date element selects a file if it’s last modified time matches the constraints specified.

<fs:date
  date-time = dateTimeThe target date-time
  when = before|after|equalThe relationship to test
  granularity? = integerGranularity of comparison
 />

If when is “before”, the last modified time must be before the specified date-time. If it’s “after”, it must be after. If it’s “equal” (the default), it must be equal.

For the purposes of comparision, the last modified time is equal to the specified date time if it’s within granularity milliseconds of the specified date-time. On Windows, the default granularity is 2 seconds, on other systems, it’s 0.

The depend element

The depend element selects a file if it exists under the target-dir and if the target file is newer.

<fs:depend
  target-dir? = anyURIThe target directory
  granularity? = integerGranularity of comparison
>
  (identity-mapper |
   flatten-mapper |
   merge-mapper |
   glob-mapper |
   regexp-mapper |
   package-mapper |
   unpackage-mapper |
   composite-mapper |
   chained-mapper |
   first-match-mapper |
   cut-dirs-mapper)?
</fs:depend>

For the purposes of comparision, two files are considered to have the same last modified time if they are within granularity milliseconds of each other. On Windows, the default granularity is 2 seconds, on other systems, it’s 0.

If no mapper is specified, the identity-mapper is used.

The depth element

The depth element selects a file if it is at least min directory levels and at most max directory levels from the root.

<fs:depth
  min? = integerThe minimum depth
  max? = integerThe maximum depth
 />

If min is unspecified, it defaults to 0. If max is unspecified, it defaults to ∞.

The different element

The different element selects a file if it exists under the target-dir and is different.

<fs:different
  target-dir? = anyURIThe target directory
  ignore-file-times? = booleanIgnore last modified time?
  ignore-contents? = booleanIgnore the file contents?
  granularity? = integerGranularity of comparison
>
  (identity-mapper |
   flatten-mapper |
   merge-mapper |
   glob-mapper |
   regexp-mapper |
   package-mapper |
   unpackage-mapper |
   composite-mapper |
   chained-mapper |
   first-match-mapper |
   cut-dirs-mapper)
</fs:different>

The default value for ignore-file-times is true; the default value for ignore-contents is false.

For the purposes of comparision, two files are considered to have the same last modified time if they are within granularity milliseconds of each other. On Windows, the default granularity is 2 seconds, on other systems, it’s 0.

If no mapper is specified, the identity-mapper is used.

The filename element

The filename element matches a file if it matches the glob or regular expression provided.

<fs:filename
  name? = stringA single glob pattern
  regex? = stringA regular expression
  case-sensitive? = booleanShould the comparison be case sensitive?
  negate? = booleanReverse the effect of the selection
 />

Exactly one of name or regex must be provided.

This element is like the include element, but it can be combined with other selectors. If negate is true, this element is like the exclude element.

The present element

The present element selects a file by comparing it against an equivalent file under the target-dir.

<fs:present
  target-dir = anyURIThe target directory
  present? = srconly|bothOnly in the source, or in both?
>
  (identity-mapper |
   flatten-mapper |
   merge-mapper |
   glob-mapper |
   regexp-mapper |
   package-mapper |
   unpackage-mapper |
   composite-mapper |
   chained-mapper |
   first-match-mapper |
   cut-dirs-mapper)
</fs:present>

If present is “srconly”, the file is selected if it only exists in the source (if it is not under the target-dir). If present is “both”, the file is selected if it exists in both places. (A “target only” value is incoherent because the comparison is always against a file that does exist.)

If no mapper is specified, the identity-mapper is used.

The contains-regexp element

The contains-regexp element selects a file if it contains the specified regular expression.

<fs:contains-regexp
  expression = stringThe expression to match
  case-sensitive? = booleanShould the search be case sensitive?
  encoding? = stringThe encoding to use when reading the document
  multi-line? = booleanUse multi-line searches
  single-line? = booleanUse single-line searches
 />

If multi-line is true, the match may extend across line breaks.

If single-line is true, “.” may match newlines. (This is the “dotall” flag in Java regular expressions.

The size element

The size element selects a file based on its size.

<fs:size
  value = integerThe value
  units? = da|h|k|M|G|T|P|Ki|Mi|Gi|Ti|PiThe units
  when? = less|more|equalThe relationship to test
 />

If no units are specified, the value is an exact number of bytes. If units are specified, they have the following effect: da multiplies the value by 10, h multiplies the value by 100, k, 1,000, M, 1,000,000, G, 109 T, 1012 and P, 1015.

The remaining units multiply by powers of two: Ki, 1024, Mi, 10242, Gi, 10243, Ti, 10244, and Pi, 10245.

The when attribute determines how the actual file size must compare to the specified value.

The readable element

The readable element selects a file if it is readable by the user.

<fs:readable/>

The writable element

The readable element selects a file if it is writable by the user.

<fs:writable/>

The executable element

The executable element selects a file if it is executable by the user.

<fs:executable/>

The owned-by element

The owned-by element selects files that are owned by owner.

<fs:owned-by
  owner = stringThe owner name
  follow-symlinks? = booleanFollow symlinks?
 />

If the file being tested is a symbolic link and follow-symlinks is true, the ownership of the file linked to is tested. Otherwise the owner of the link is tested.

The posix-group element

The posix-group element selects files that are members of the specified group.

<fs:posix-group
  group = stringThe group name
  follow-symlinks? = booleanFollow symlinks?
 />

If the file being tested is a symbolic link and follow-symlinks is true, the group of the file linked to is tested. Otherwise the group of the link is tested.

This selector requires a POSIX compatible filesystem. It will always return false in other cases, for example, on Windows.

The posix-permissions element

The posix-permissions element selects files that have specific permissions.

<fs:posix-permissions
  permissions = stringThe permissions
  follow-symlinks? = booleanFollow symlinks?
 />

Permissions can be expressed as an octal number (“755”) or using the r/w/- notation (“rwxr-xr-x”). In either case, the match is exact. The selector doesn’t support any kind of wildcard matching.

If the file being tested is a symbolic link and follow-symlinks is true, the permissions of the file linked to are tested. Otherwise the permissions of the link are tested.

This selector requires a POSIX compatible filesystem. It will always return false in other cases, for example, on Windows.

The chained-mapper element

The chained-mapper returns the result of applying each mapper in turn. The initial file is the input to the first mapper, the output from the first mapper is the input to the second mapper, and so on.

<fs:chained-mapper>
  (identity-mapper |
   flatten-mapper |
   merge-mapper |
   glob-mapper |
   regexp-mapper |
   package-mapper |
   unpackage-mapper |
   composite-mapper |
   chained-mapper |
   first-match-mapper |
   cut-dirs-mapper)+
</fs:chained-mapper>

The composite-mapper element

The composite-mapper returns the result of applying the (same) initial file to each mapper. This is the union of all the mapper outputs.

<fs:composite-mapper>
  (identity-mapper |
   flatten-mapper |
   merge-mapper |
   glob-mapper |
   regexp-mapper |
   package-mapper |
   unpackage-mapper |
   composite-mapper |
   chained-mapper |
   first-match-mapper |
   cut-dirs-mapper)+
</fs:composite-mapper>

The cut-dirs-mapper element

The cut-dirs-mapper removes dirs path segments from the front of the file. If the file has fewer path segments, nothing is returned.

<fs:cut-dirs-mapper
  dirs = integerThe number of directories to remove
 />

The first-match-mapper element

The first-match-mapper applies the original file to each mapper in turn, returning the results of the first mapper that succeeds.

<fs:first-match-mapper>
  (identity-mapper |
   flatten-mapper |
   merge-mapper |
   glob-mapper |
   regexp-mapper |
   package-mapper |
   unpackage-mapper |
   composite-mapper |
   chained-mapper |
   first-match-mapper |
   cut-dirs-mapper)+
</fs:first-match-mapper>

The flatten-mapper element

The flatten-mapper returns the filename of the original file with all leading path segments removed.

<fs:flatten-mapper/>

The glob-mapper element

The glob-mapper matches the original file against the from glob. If it doesn’t match, nothing is returned. If it does match, the to glob is used to change the file.

<fs:glob-mapper
  from = stringThe match glob
  to = stringThe target glob
  case-sensitive? = booleanDefault case sensitivity
 />

The from and to globs much each contain exactly one *. The text matched by the * in the from glob is used to replace the * in the to glob.

The identity-mapper element

The identity-mapper returns the original file unchanged.

<fs:identity-mapper/>

The merge-mapper element

The merge-mapper returns the to value irrespective of the original file.

<fs:merge-mapper
  to = stringThe target value
 />

The package-mapper element

The package-mapper applies the same processing as the regexp-matcher, then replaces all of the “/” characters with “.”.

<fs:package-mapper
  from = stringThe match glob
  to = stringThe target glob
 />

This turns, for example, org/example/package/Class.java into org.example.package.Class.html assuming the from value is “*.java” and the two value is “*.html”.

The unpackage-mapper element

The unpackage-mapper applies the same processing as the regexp-matcher, then replaces all but the last “.” characters with “/”.

<fs:unpackage-mapper
  from = stringThe match glob
  to = stringThe target glob
 />

This is the reverse of the package-mapper.

The regexp-mapper element

The regexp-mapper matches the original file against the from regular expression. If it doesn’t match, nothing is returned. If it does match, the to expression is used to change the file.

<fs:regexp-mapper
  to = stringThe match regular expression
  from = stringThe replacement expression
  case-sensitive? = booleanDefault case sensitivity
 />

Each occurrence of “\0” to “\9” is replaced with the corresponding match group from the from expression (where “\0” is the whole string, “\1” is the first match group, etc.

Default exclusions

The default exclusions are:

Any file or directory named .DSStore, .bzr, .bzrignore, .cvsignore, .git, .gitattributes, .gitignore, .gitmodules, .hg, .hgignore, .hgsub, .hgsubstate, .hgtags, .svn, CVS, SCCS, vssver.scc. If a directory name is excluded, so are all of its descendants.

Any file with a name that ends with ~.

Any file with a name that starts with .# or ._.

Any file with a name that begins and ends with # or begins and ends with %.

Globs

A glob is a file or path that may contain the wildcards “*” or “**/”.

The “*” wildcard matches any number of characters except “/”.

The “**/” wildcard matches any number of path segments (including none).

If a glob begins with “/”, it is anchored at the root. (Otherwise, it is logically preceded by “**/”.)