Name
cx:rdfa — Extracts RDFa from documents.
Synopsis
This step extracts triples from documents marked up with RDFa.
Input port | Primary | Sequence | Content types |
---|---|---|---|
source | ✔ | xml html |
Output port | Primary | Sequence | Content types |
---|---|---|---|
result | ✔ | application/rdf+thrift |
This is an extension step; to use it, your pipeline must include its declaration.
For example, by including the extension library with an import at the top of your
pipeline:
<p:import href="https://xmlcalabash.com/ext/library/rdf.xpl"/>
Declaration
|<p:declare-step xmlns:p="http://www.w3.org/ns/xproc">
| <p:input port="source" content-types="xml html"/>
| <p:output port="result" content-types="application/rdf+thrift"/>
|</p:declare-step>
Description
The cx:rdfa
step uses the
Semargl libraries
to extract triples from documents marked up with RDFa.
The output from the RDFa step is always application/rdf+thrift
,
a binary format suitable for further RDF processing. To view or save results, it will
often be necessary to cast the results to some other RDF content type.
Limitations
The Semargl library does not provide any access to the in-scope namespace
bindings. Consequently, it’s impossible to tell what datatypes are being used.
The cx:rdfa
step assumes that they are XML Schema data types,
irrespective of their prefix.
Examples
Given a web page like this one:
1 |<html xmlns="http://www.w3.org/1999/xhtml"
| xmlns:xs="http://www.w3.org/2001/XMLSchema"
| xmlns:dc="http://purl.org/dc/terms/"
| class="test">
5 |<!-- Example from https://alistapart.com/article/introduction-to-rdfa/ -->
|<head>
| <title>RDFa: Now everyone can have an API</title>
| <link rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/" />
|</head>
10 |<body>
| <h1>RDFa: Now everyone can have an API</h1>
|
| <p>Author: <em property="dc:creator">Mark Birbeck</em>
| Published: <em property="dc:created" content="2009-05-09" datatype="xs:date">May 14th, 2009</em></p>
15 |
| <img src="image1.png"
| rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/" />
| <img src="image2.png"
| rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/" />
20 |
| <p>Previous version: <a rel="dc:replaces" href="rdfa.0.8.html">version 0.8</a></p>
|
| <p><a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC Attribution-ShareAlike</a>
| </p>
25 |</body>
|</html>
This pipeline:
1 |<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
| xmlns:cx="http://xmlcalabash.com/ns/extensions"
| version="3.0">
|<p:import href="https://xmlcalabash.com/ext/library/rdf.xpl"/>
5 |<p:input port="source"/>
|<p:output port="result"/>
|
|<!-- The base URI from the documention build system is distracting -->
|<p:set-properties properties="map{'base-uri': 'http://example.org/rdfa.html'}"/>
10 |
|<cx:rdfa/>
|
|<p:cast-content-type content-type="text/turtle"/>
|
15 |</p:declare-step>
Produces Turtle output like this:
1 |<http://example.org/rdfa.html>
|<http://purl.org/dc/terms/created>
|"2009-05-09"^^<http://www.w3.org/2001/XMLSchema#date>;
|<http://purl.org/dc/terms/creator>
5 |"Mark Birbeck";
|<http://purl.org/dc/terms/replaces>
|<http://example.org/rdfa.0.8.html>;
|<http://www.w3.org/1999/xhtml/vocab#license>
|<http://creativecommons.org/licenses/by-sa/3.0/> , <http://creativecommons.org/licenses/by-nc-nd/3.0/> .