Skip to content

DAISY2.02 audiobooks with TOC PageList conversion to Readium WebPub Manifest

Daniel Weck edited this page Nov 25, 2021 · 12 revisions

Prerequisites:

  1. node --version => v16.13.0 (or greater)
  2. npm --version => 8.1.4 (or greater)

https://nodejs.org

Installation:

  1. npm install json-diff --global
  2. npm install r2-shared-js --global
  3. npm install r2-streamer-js --global

In case of filesystem permission failures, try with sudo in Linux and Mac, or in Windows try opening the shell with "run as administrator" (sometimes --unsafe-perm=true helps too)

Verify installed "binaries" (i.e. globally-available NodeJS scripts):

  1. which r2-shared-js-cli => /usr/local/bin/r2-shared-js-cli (for example, on Mac)
  2. which r2-streamer-js-server => /usr/local/bin/r2-streamer-js-server (for example, on Mac)

Note that a future revision of the CLI utilities will include a UNIX "shebang" at the top of the JS file in order to automatically invoke Node executable. See below for example on how to start the scripts.

Test the "r2-streamer-js" server

Assuming some EPUB files are present inside a folder path (replace PATH_TO_EPUB_FOLDER with your own filesystem location, which can be absolute or relative to the current pwd folder):

  • DEBUG=r2:* node /usr/local/bin/r2-streamer-js-server PATH_TO_EPUB_FOLDER (note that DEBUG=r2:* is optional, but useful to display runtime information in the console ... for even more verbosity, use DEBUG=*)
  • Open a web browser with URL http://127.0.0.1:3000 (as indicated in the console)
  • Click on any blue link at the top of the page (each link corresponds to an EPUB file discovered inside the folder, but note that subfolders are not scanned by this simple server demo / test CLI)
  • Click on the ./manifest.json/show/all link, this will display the Readium WebPub Manifest with clickable links to resources (images, CSS, HTML, etc.)
  • Note that the http://127.0.0.1:3000/pub/_ID_/manifest.json URL endpoint (without /show/all) serves the raw JSON resource, which is probably what a real world deployment would use. The /show/all URL is here to facilitate debugging / exploration of Readium WebPub Manifest JSON.
  • In the above URL, the _ID_ token represents the "unique identifier" of the publication served by the streamer software component. This is not dc:identifier / ISBN / UUID, etc., this is in fact the base64 encoding of the publication's filepath.

A production deployment of the r2-streamer-js would typically not use the built-in CLI as-is (i.e. https://github.com/readium/r2-streamer-js/blob/develop/src/http/server-cli.ts ), but instead a smarter CLI should be implemented to meet real-world needs. The core server runtime can be created with the following lines of code:

        const server = new Server({
            // options
        });
        server.preventRobots(); // for example
        server.addPublications(files); // <=== this can be called any time after the server starts (incremental add/remove of publications, cache management)
        const url = await server.start(0, false);

See: https://github.com/readium/r2-streamer-js/blob/a2faa6140074418fc354bca792023b387cb837a3/src/http/server-cli.ts#L113-L118

Try the "r2-shared-js" CLI

Assuming some DAISY2.02 audio-only publications are present inside a folder path (replace PATH_TO_DAISY_FOLDER with your own filesystem location, which can be absolute or relative to the current pwd folder):

  • DEBUG=r2:* node /usr/local/bin/r2-shared-js-cli PATH_TO_DAISY_FOLDER/book.zip PATH_TO_DAISY_FOLDER generate-daisy-audio-manifest-only (note that DEBUG=r2:* is optional, but useful to display runtime information in the console ... for even more verbosity, use DEBUG=*)

In the above example, PATH_TO_DAISY_FOLDER/book.zip refers to a zipped DAISY fileset, but the command works with exploded / unzipped contents too:

  • DEBUG=r2:* node /usr/local/bin/r2-shared-js-cli PATH_TO_DAISY_FOLDER/book/ PATH_TO_DAISY_FOLDER generate-daisy-audio-manifest-only

When the DEBUG flag is used, the console displays the following in case of success: DAISY audio only book => manifest-audio.json and DAISY-EPUB-RWPM done.

The Readium WebPub Manifest JSON files are created based on the original DAISY filename, for example: book.zip_manifest.json or book_manifest.json with the unzipped folder. This file naming convention is critical, the DAISY and JSON file names must be kept in sync.

Note that the generate-daisy-audio-manifest-only command line parameter can only be used with audio-only DAISY books, not with full-text full-audio publications. When this CLI parameter is omitted, a full .webpub zipped publication is generated in the destination folder instead of just the JSON manifest. The full conversion process involves renaming files from the original DAISY fileset (notably, XML vs. HTML vs. XHTML file extensions), and other files are created too (notably, DAISY3 DTBOOK to XHTML, or SMIL to XHTML). With audio-only books, the generated .webpub archive can be unzipped to reveal both manifest.json (i.e. the default one which relies on EPUB3 Media Overlays SMIL in order to preserve the phrase-level DAISY navigation) and manifest-audio.json (i.e. the simplified audiobook with reading order, TOC, pagelist, but no phrase-level navigation). A .webpub file can directly be ingested by the streamer via server.addPublications().

Wrapping up

Now, simply start the "r2-streamer-js" test server inside the folder that contains the generated JSON files and original DAISY filesets, in order to demonstrate them working together: DEBUG=r2:* node /usr/local/bin/r2-streamer-js-server PATH_TO_DAISY_FOLDER. The CLI offers an easy way to test the server, but in a real-world scenario the server.addPublications(files) Javascript function would be called after the server is started to enable the on-demand streaming of the Readium WebPub Manifest JSON. For example server.addPublications([PATH_TO_JSON_FILE]), and the streamer will automatically find the corresponding original DAISY book based on the common root filename.

Note that the current r2-streamer-js implementation does not provide an out-of-the-box caching / memory management solution. It is therefore recommended to write additional logic based on server.removePublications(files) or server.uncachePublication(file) in order to ensure that the streamer runtime does not allocate unnecessary memory, and does not keep filesystem handles open during access to zipped publications or unzipped folders. See: https://github.com/readium/r2-streamer-js/blob/a2faa6140074418fc354bca792023b387cb837a3/src/http/server.ts#L304-L332 and: https://github.com/readium/r2-streamer-js/blob/a2faa6140074418fc354bca792023b387cb837a3/src/http/server.ts#L338-L411

Point of interest: Thorium (the desktop app) is currently the most active user of the streamer software component, which powers the application's publication backend service. The server is started and killed automatically based on whether or not publications are opened. Publications are removed from the streamer's internal cache as soon as all opened windows are closed by the user. Naturally, this memory management strategy isn't applicable to a real network client/server context, but it shows versatility. A future revision of r2-streamer-js might include an off-the-shelf cache invalidation strategy, such as least-recently-used / time window. Suggestions welcome! :)

Clone this wiki locally