Name

cx:rdfa — Extracts RDFa from documents.

Synopsis

This step extracts triples from documents marked up with RDFa.

Input portPrimarySequenceContent types
source✔  xml html 
Output portPrimarySequenceContent types
result✔  application/rdf+thrift 
This is an extension step; to use it, your pipeline must include its declaration. For example, by including the extension library with an import at the top of your pipeline:
<p:import href="https://xmlcalabash.com/ext/library/rdf.xpl"/>
Declaration
  |<p:declare-step xmlns:p="http://www.w3.org/ns/xproc">
  |   <p:input port="source" content-types="xml html"/>
  |   <p:output port="result" content-types="application/rdf+thrift"/>
  |</p:declare-step>

Description

The cx:rdfa step uses the Semargl libraries to extract triples from documents marked up with RDFa.

The output from the RDFa step is always application/rdf+thrift, a binary format suitable for further RDF processing. To view or save results, it will often be necessary to cast the results to some other RDF content type.

Limitations

The Semargl library does not provide any access to the in-scope namespace bindings. Consequently, it’s impossible to tell what datatypes are being used. The cx:rdfa step assumes that they are XML Schema data types, irrespective of their prefix.

Examples

Given a web page like this one:

 1 |<html xmlns="http://www.w3.org/1999/xhtml"
   |      xmlns:xs="http://www.w3.org/2001/XMLSchema"
   |      xmlns:dc="http://purl.org/dc/terms/"
   |      class="test">
 5 |<!-- Example from https://alistapart.com/article/introduction-to-rdfa/ -->
   |<head>
   |  <title>RDFa: Now everyone can have an API</title>
   |  <link rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/" />
   |</head>
10 |<body>
   |  <h1>RDFa: Now everyone can have an API</h1>
   | 
   |  <p>Author: <em property="dc:creator">Mark Birbeck</em>
   |  Published: <em property="dc:created" content="2009-05-09" datatype="xs:date">May 14th, 2009</em></p>
15 |  
   |  <img src="image1.png"
   |       rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/" />
   |  <img src="image2.png"
   |       rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/" />
20 | 
   |  <p>Previous version: <a rel="dc:replaces" href="rdfa.0.8.html">version 0.8</a></p>
   | 
   |  <p><a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC Attribution-ShareAlike</a>
   |  </p>
25 |</body>
   |</html>

This pipeline:

 1 |<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
   |                xmlns:cx="http://xmlcalabash.com/ns/extensions"
   |                version="3.0">
   |<p:import href="https://xmlcalabash.com/ext/library/rdf.xpl"/>
 5 |<p:input port="source"/>
   |<p:output port="result"/>
   | 
   |<!-- The base URI from the documention build system is distracting -->
   |<p:set-properties properties="map{'base-uri': 'http://example.org/rdfa.html'}"/>
10 | 
   |<cx:rdfa/>
   | 
   |<p:cast-content-type content-type="text/turtle"/>
   | 
15 |</p:declare-step>

Produces Turtle output like this:

1 |<http://example.org/rdfa.html>
  |        <http://purl.org/dc/terms/created>
  |                "2009-05-09"^^<http://www.w3.org/2001/XMLSchema#date>;
  |        <http://purl.org/dc/terms/creator>
5 |                "Mark Birbeck";
  |        <http://purl.org/dc/terms/replaces>
  |                <http://example.org/rdfa.0.8.html>;
  |        <http://www.w3.org/1999/xhtml/vocab#license>
  |                <http://creativecommons.org/licenses/by-sa/3.0/> , <http://creativecommons.org/licenses/by-nc-nd/3.0/> .