Using collection Elements as Embedded Package Documents

Informational Document 27 November 2014

Copyright © 2014 International Digital Publishing Forum™

All rights reserved. This work is protected under Title 17 of the United States Code. Reproduction and dissemination of this work with changes is prohibited except with the written permission of the International Digital Publishing Forum (IDPF).

EPUB is a registered trademark of the International Digital Publishing Forum.

Table of Contents

What is a Collection

Why Mimic Package Documents

Collection as Package Document

Metadata

Manifest

Spine

Defining Collections

NOTE

When discussing the general concept/purpose/function of collections in this document, standard font face is applied ("collection"). References to the collection element ‒ its attributes, child elements, etc. ‒ appear in monospace font ("collection").

What is a Collection

The collection element was introduced during the EPUB® 3.0.1 revision to provide an extensible means of defining groups of related resources in the package document. It enables new abstract components to.be defined, whether aggregated from resources of the current publication or drawn from outside the container.

Rather than have IDPF working groups ‒ or any party looking to extend the capabilities of the format ‒ wait for a revision to request new elements, effectively stalling their work, collections were created to easily enhance renditions with new functionality and information as needed.

The collection element itself is a very a simple concept: it provides a way to group together sets of resources, and allows rich metadata to be attached to define those resources and/or their purpose.

To identify the general nature of the collection, a role attribute is required on the element:

<collection role="scriptable-component">

   …

</collection>

These roles are how machines (and humans) distinguish one collection from another, and indicate how the content of the element is to be processed. The roles are typically defined in EPUB specifications, but it is possible for anyone to create a custom collection. The IDPF maintains a registry of recognized roles for reference, which includes:

There is no limit on the number of collections you can include in the package document; each is defined as a separate element after the spine:

<package >

   <metadata></metadata>

   <manifest></manifest>

   <spine></spine>

   <collection role="distributable-object"></collection>

   <collection role="preview"></collection>

   <collection role="scriptable-component"></collection>

   

</package>

Multiple collections can share the same role, as well ‒ a publication might use multiple scripted components, or define several chapters as available for resale. The contents of each collection is typically unique, but multiple collections may reference many of the same resources.

The collection element has a very basic structure: metadata and links.

<collection role="…">

   <metadata>

      …

   </metadata>

   <link href="…"/>

</collection>

The metadata section allows you to provide metadata as rich as is possible in the package document, as the collection borrows the same element. The link elements enable you to specify what resources belong to the collection, whether resources for rendering (inside or outside the container) or simply sets of information (e.g., related web pages).

Additional flexibility is built into collections by allowing the elements to be nested inside each other:

<collection role="…">

   <metadata>…</metadata>

   <collection role="…">

      …

   </collection>

   <link href="…"/>

   …

</collection>

These three basic structures provide great flexibility, and allow you to represent many types of features, including package documents.

Why Mimic Package Documents

A key design goal when creating collections was to allow the identification of extractable components of a larger publication. Chapters and articles, for examples, are often sold separately from the publication they belong to. There is also an increasing need to create and share scripted components now that EPUB 3 supports JavaScript.

While it is possible to create these kinds of components as their own EPUB publications and distribute them separately from their parent, it is not a very efficient model. Distributing a single file that the vendor can parcel up to create the components requires much less effort and coordination, for example. Likewise, a reading system might provide access to a preview without unlocking all the content, saving repeated downloads.

This idea that content within a publication may itself need to be presented as a publication ‒ whether virtually within the reading system or physically as a new container file ‒ was a key influence on the development of the collection element. The working group recognized that although the nature of collections might change from one to the next (e.g., an article v. a chapter), the principal structure needed was going to be the same if both are designed to be extracted and presented to users.

Whether the collection is content for a user or simply a way of sharing components, the best way to represent such smaller components of a publication is as publications, since all that is needed is to leverage the existing EPUB framework.

Collection as Package Document

A collection that acts as a package document consists of the following three core sections:

These sections represent the minimum information necessary to generate a standalone publication. The following side-by-side comparison shows how they are expressed in the package and collection elements:

1

<package …>

<collection role="…">

2

<metadata>

<metadata>

3

     <!-- standard EPUB metadata -->

   

4

</metadata>

</metadata>

5

6

<manifest>

<collection role="manifest">

7

   <item … />

   <link … />

8

</manifest>

</collection>

9

10

<spine>

<!-- spine element is implied -->

11

   <itemref … />

<link … />

12

<spine>

13

</package>

</collection>

There is no single role attribute identifier that denotes that a collection can be processed as a package document, and no single set of production rules for creating these collections. In some cases, the above sections could be omitted entirely.

Why this is the case gets into the needs of the various specifications that use collections to identify embedded components. Requiring every type to follow a strict set of rules wasn't seen as practical. Implementations are instead free to make alterations specific to their needs (e.g., handling fragments of documents, auto-generating repetitive metadata, generating navigation documents, etc.).

These production issues will be looked at in more detail in the following sections.

Metadata

Metadata specific to the collection is contained in a child metadata element.

By default, no metadata is required for a collection, as even a collection that functions as a package document doesn't technically require any metadata depending on its use ‒ title(s), identifier(s), language(s) and last modified date can all be auto-generated, particularly if the object is not destined immediately for end user consumption.

More realistically, however, a collection's metadata will require the core EPUB 3 metadata for all publications.

Manifest

The manifest is created using a nested collection. The IDPF has defined the role and requirements for manifests in the EPUB Manifest Role specification.

<collection role="manifest">

   <link href="xhtml/chapter01.xhtml"/>

   <link href="css/epub.css"/>

   <link href="img/c01img01.jpg"/>

</collection>

The link elements in the manifest accept only a limited set of attributes, but that does not limit the information available to a person or program looking to turn the collection into a standalone publication (or a virtual in-memory publication).

For example, the manifest links omit necessary attributes that are specified on the real manifest item elements (properties, fallback), but this information can be easily looked up using the link href attribute value, which can either point to the ID of a manifest item or be used to do a lookup on the item's href.

The following diagram shows how a lookup can be done by ID reference or resource name to find additional meta attributes.

The manifest is just as necessary for collections as it is for the package document, as it allows a reading system, or any processing agent, to quickly and easily identify all the other resources needed for rendering. The alternative, searching through all items in the spine, is both complex to implement, time-consuming to do on the fly, and prone to error.

Spine

When you reviewed the side-by-side comparison of the package and collection elements above, you might have wondered why the manifest is defined in a nested collection but the spine is implied as the set of child link elements.

Although it is possible for a collection to contain only nested collections (i.e., a "spine" collection could be defined like the manifest), having an implied spine was chosen for simplicity. One of the original design goals for the collection element was to be able to provide a lightweight means of collecting resources together and proving reading order for the linked resources, regardless of whether they represented a publication in miniature or not.

The spine is the one feature that cannot be omitted from a collection acting as a package document: every publication must have at least one document in its logical reading order.

The processing of a collection's spine may be more complex than a package document's in some cases. The EPUB Distributable Objects specification, for example, uses the spine to identify fragments of content documents. To create an actual EPUB publication, a processing agent has to both filter the spine list to remove duplicate content documents, and also filter the content documents to remove all content except what is listed in the spine.

Defining Collections

Every specification that uses the collection element to represent a package document has to follow the general outline detailed in the last section:

Authors and developers need only be aware that these requirements may change from specification to specification, and not assume that one model will exactly match another they've seen before.

In practice, the variations will typically be minor, and both production and processing will entail much the same work, as the aim of having this model for mimicking package documents is to reduce duplication.