Experiences with Pandoc

Richards, Jan jrichards at ocadu.ca
Fri Aug 22 20:50:44 UTC 2014

Since Pandoc's raison d'etre is to "convert files from one markup format into another" (http://johnmacfarlane.net/pandoc/), it's a major problem that it strips accessibility related content! I assume this means that the internal go-between format it uses (since I can't imagine there are 1-to-1 conversion paths for every format combination) doesn't have sufficient accessibility richness. 

As an editor on ATAG 2.0, I note that this kind of loss of accessibility information is something that ATAG 2.0 tries to prevent (http://www.w3.org/TR/ATAG20/#gl_b12).



T 416 977 6000 x3957
F 416 977 9844
E jrichards at ocadu.ca

From: fluid-work-bounces at fluidproject.org [fluid-work-bounces at fluidproject.org] on behalf of Cheetham, Anastasia [acheetham at ocadu.ca]
Sent: August-22-14 2:19 PM
To: Fluid Work
Subject: Experiences with Pandoc

Jon and I have been working with Pandoc (http://johnmacfarlane.net/Pandoc/) to produce our EPUB 3 exemplar resource.

Pandoc is a great conversion tool, and it has been helpful in the EPUB work, since it
1) creates the navigation XML file,
2) creates the manifest XML file,
3) creates the metadata, and
4) packages it all up.

But Pandoc has some limitations, and I’m beginning to wonder if we’ve run up against those limitations hard enough to want to stop using it.

Two of the EPUB 3 features we wish to use are a) media overlays and b) enhanced TTS support. Both of these require specific mark-up to be present in the HTML. Unfortunately, Pandoc strips much of this markup (and moves some of it!) as part of its conversion process. (Pandoc also strips aria attributes off block-level elements.)

(We raised these issues with the Pandoc developers (https://github.com/jgm/pandoc/issues/1555). They did investigate and ponder, but their basic conclusion was "If you already have your input in HTML, pandoc may not be the best approach.”)

In addition to modifying the HTML in ways that cause problems, Pandoc cannot include in its output the extra SMIL and audio files required for media overlays.

To work around these problems, we’ve tried a process that is roughly as follows:
- use Pandoc to create the archive using the HTML;
- unzip the archive to get access to the content;
- restore the HTML to its un-molested form;
- add the SMIL files, audio files and lexicons to the archive;
- edit the manifest to add references to the SMIL files, audio files and lexicons;
- re-insert the manifest and the HTML back into the archive.

Further complication is added by the fact that Pandoc creates its own file/folder hierarchy within the archive and changes file names, so
- authoring the SMIL file, which will be inserted into the archive, requires knowledge of the final file locations in the archive (which is currently not what we’ve got in our project), and
- restoring the “original” HTML can also require adjusting paths in the markup.

These latter problems would be alleviated if we modified our github repository structure to better mirror the hierarchy that Pandoc produces.

I’m beginning to think our workflow would be smoother and easier if we stopped using Pandoc and simply constructed the archive ourselves. This would require manual editing of the navigation, manifest and metadata, but we could start with what Pandoc has already generated – and we’re already modifying the manifest, anyway. And there is plenty of documentation on how to structure an EPUB 3 archive, as well as freely available validators.

Does anyone have any thoughts on this?

Anastasia Cheetham     Inclusive Design Research Centre
acheetham at ocadu.ca           Inclusive Design Institute
                                        OCAD University

fluid-work mailing list - fluid-work at fluidproject.org
To unsubscribe, change settings or access archives,
see http://lists.idrc.ocad.ca/mailman/listinfo/fluid-work

More information about the fluid-work mailing list