Experiences with Pandoc

Cheetham, Anastasia acheetham at ocadu.ca
Fri Aug 22 18:19:34 UTC 2014

Jon and I have been working with Pandoc (http://johnmacfarlane.net/Pandoc/) to produce our EPUB 3 exemplar resource.

Pandoc is a great conversion tool, and it has been helpful in the EPUB work, since it
1) creates the navigation XML file,
2) creates the manifest XML file,
3) creates the metadata, and
4) packages it all up.

But Pandoc has some limitations, and I’m beginning to wonder if we’ve run up against those limitations hard enough to want to stop using it.

Two of the EPUB 3 features we wish to use are a) media overlays and b) enhanced TTS support. Both of these require specific mark-up to be present in the HTML. Unfortunately, Pandoc strips much of this markup (and moves some of it!) as part of its conversion process. (Pandoc also strips aria attributes off block-level elements.)

(We raised these issues with the Pandoc developers (https://github.com/jgm/pandoc/issues/1555). They did investigate and ponder, but their basic conclusion was "If you already have your input in HTML, pandoc may not be the best approach.”)

In addition to modifying the HTML in ways that cause problems, Pandoc cannot include in its output the extra SMIL and audio files required for media overlays.

To work around these problems, we’ve tried a process that is roughly as follows:
- use Pandoc to create the archive using the HTML;
- unzip the archive to get access to the content;
- restore the HTML to its un-molested form;
- add the SMIL files, audio files and lexicons to the archive;
- edit the manifest to add references to the SMIL files, audio files and lexicons;
- re-insert the manifest and the HTML back into the archive.

Further complication is added by the fact that Pandoc creates its own file/folder hierarchy within the archive and changes file names, so
- authoring the SMIL file, which will be inserted into the archive, requires knowledge of the final file locations in the archive (which is currently not what we’ve got in our project), and
- restoring the “original” HTML can also require adjusting paths in the markup.

These latter problems would be alleviated if we modified our github repository structure to better mirror the hierarchy that Pandoc produces.

I’m beginning to think our workflow would be smoother and easier if we stopped using Pandoc and simply constructed the archive ourselves. This would require manual editing of the navigation, manifest and metadata, but we could start with what Pandoc has already generated – and we’re already modifying the manifest, anyway. And there is plenty of documentation on how to structure an EPUB 3 archive, as well as freely available validators.

Does anyone have any thoughts on this?

Anastasia Cheetham     Inclusive Design Research Centre
acheetham at ocadu.ca           Inclusive Design Institute
                                        OCAD University

More information about the fluid-work mailing list