PDF/UA Technical Implementation Guide: Understanding ISO 32000-1 (PDF 1.7)

Table of Contents

Statement of Purpose

ISO 32000-1 is the specification for the PDF format as a whole, in which context Tagged PDF appears as a single feature. This Guide is intended to provide information and examples to illuminate and clarify aspects of Section 14 of ISO 32000-1 for implementers of ISO 14289-1 (PDF/UA). Although informative rather than normative, this document reflects the understandings and intentions of the US Committee for PDF/UA, lead authors of ISO 14289-1.

Expectations

It's important to understand what this Guide does not do:

  • This Guide does not substitute for ISO 14289-1; it simply provides software developers with additional information beyond the text of ISO 32000 to assist in developing ISO 14289-1 conforming implementations.
  • This Guide is not a "how to" for tagging PDF documents.

Document Version

This version of the PDF/UA-1 Technical Implementation Guide to ISO 32000-1 is designated 1.0 as approved by a ballot of the US Committee for PDF/UA as reported by the Secretariat on July 17, 2014.

Target Audience

  • Assistive Technology (AT) developers and vendors.
  • Software developers interested in writing and processing Tagged PDF files.

How to Use this Guide

Each section of this Guide should be read side-by-side with ISO 32000-1, section 14. Where the language of the Standard itself is suitably clear the Guide states that "No additional information for this subclause is provided."

NOTE: The subsections of ISO 32000-1 section 14 referenced directly and indirectly by PDF/UA are:

    • 14.3 Metadata
    • 14.6 Marked Content
    • 14.7 Logical Structure
    • 14.8 Tagged PDF
    • 14.9 Accessibility Support

Introduction to Concepts

In PDF, page content is represented by a sequence of graphics objects encoded in content streams of type text, path, image or smooth shade (see ISO 32000-1, 8.2 Graphics ObjectsĀ¯ for a detailed explanation) to be drawn one after another on a virtual canvas the size of the page. In many cases the order of the graphics objects on a given page does not reflect semantic aspects of the page content such as intended reading order. In order to be able to indicate the logical order of page content, further data structures are required. There are three such structures which together provide this mechanism, Marked Content, Logical Structure and Tagged PDF:

  • The Marked Content mechanism provides a means of identifying sequences of graphics objects within a content stream.
  • The Logical Structure mechanism enables a document to contain a tree, describing the logical hierarchy for content within the document. Logical Structure uses the Marked Content mechanism to identify the content belonging to a given node or leaf in the tree (e.g. Heading or Paragraph).
  • Tagged PDF makes it possible to apply semantic typing to content items identified by Logical Structure.

ISO 32000-1 clause 14.8 defines Tagged PDF and establishes a number of mandatory rules and optional recommendations for how to use these semantic types for content items.

Relating ISO 32000-1 to WCAG 2.0

The WAI's Web Content Accessibility Guidelines 2.0 (WCAG 2.0) provide normative text suitable to guiding development of accessible content in the web context. The scope of WCAG 2.0 is broad, encompassing files of many types, specifications, structural requirements and writing requirements. Various qualities are assigned a distinct "Level" for assessing conformance.

The WCAG 2.0 model is broadly applicable to PDF though it lacks the critical technical detail provided by PDF/UA. For those wishing to assess the conformance of a PDF/UA document vis-a-vis WCAG 2.0 success criteria, the ISO TC 171 SC 2 WG 9-sanctioned document "Achieving WCAG 2.0 with PDF/UA" provides the authoritative mapping between PDF/UA provisions and WCAG 2.0 Success Criteria.

Examples

For implementers seeking reference-quality examples of ISO 14289-1-conforming PDF files should be aware that the PDF Association produces reference quality example files including the PDF/UA Competence Center's Matterhorn Protocol.

Guide to Section 14

To ease reading this content alongside ISO 32000-1, the following headings refer directly to the subclauses of Chapter 14 in that International Standard.

14.1

No additional information for this subclause is provided in this guide.

14.2

No additional information for this subclause is provided in this guide.

14.3

The document's title (metadata) is typically what users with disabilities encounter first. It's important to ensure this information is descriptive and concise.

The document's title is dc:title (XMP....). A special data type in XMP, the dc:title is a kind of array which entry is either indicating the default value or an entry reflecting a specific language version of the title (see example below).

EXAMPLE (from XMP spec) a title in English, German, French:

<xmp:Title> <rdf:Alt> <rdf:li xml:lang="x-default"> XMP - Extensible Metadata Platform </rdf:li> <rdf:li xml:lang="en-us">XMP - Extensible Metadata Platform</rdf:li> <rdf:li xml:lang="fr-fr"> XMP - Une Platforme Extensible pour les Métadonnées</rdf:li> <rdf:li xml:lang="it-it"> XMP - Piattaforma Estendible di Metadata</rdf:li> </rdf:Alt> </xmp:Title>

If only one value is provided and no lang is indicated it must be ensured that the lang entry is the document catalog is appropriate for the language of the default entry. 

If the title does not match the document language it must be provided in a language-specific form. XMP metadata allows specification of more than one language as necessary, however, within a given entry there is no mechanism to switch languages.

A common mistake is to include the title inside the XMP as a string. In reality, dc:title needs a lang.

Use XMP rather than the document information dictionary.

14.4

No additional information for this subclause is provided in this guide.

14.5

No additional information for this subclause is provided in this guide.

14.6 Marked Content

14.6.1 General

It is important to understand that logical structure (ISO 32000-1:2008, section 14.7) is built on top of marked content (ISO 32000-1:2008, section 14.6) and that, in turn, Tagged PDF (ISO 32000-1:2008, 14.8) is built on top of both logical structure and marked content. Tagged PDF is the mechanism by which we ensure that content is accessible, but it is important to first understand the history behind marked content and logical structure.

Marked Content

Marked content was introduced in PDF 1.2 as a general mechanism for identifying sequences of content within a content stream. A producer of PDF can identify sequences of content (ISO 32000-1:2008, 14.6.1) and provide them with a name (referred to as a tag, but not to be confused with Tags in Tagged PDF). Note well that marked content, in itself, has nothing to do with logical structure or tagged PDF.

Uses of Marked Content alone

Marked content may be used to aggregate sections of content manipulation or annotative purposes. If you put several pieces of content onto a new page you can remember what you've done.

Logical Structure

In PDF 1.3, the logical structure mechanism was introduced, which provided PDF with the means of describing a structure tree for the document. The logical structure is described in a separate area of the document from the page content and makes use of the marked content mechanism to identify the page content for nodes in the structure tree.

Uses of Logical Structure alone

MCIDs play a key role in logical structure by connecting page content to the logical structure. MCIDs are used to distinguish logical sections of content, but without a declared interoperability mechanism.

Tagged PDF

Tagged PDF, introduced in PDF 1.4, provides a stylized usage of logical structure via a set of standard structure types (ISO 32000-1:2008, section 14.8.4). This resolves the problem with logical structure in that, since it provides no predefined semantics, although a producer can embed an arbitrarily complex logical structure into a document, such structures can only be consumed by another application which understands that specific structure.  Tagged PDF goes beyond just this semantic mechanism and also provides mechanisms that ensure that content within a document is accessible by programmatic means. Accessibility mechanisms (ISO 32000-1:2008, section 14.9) make use of Tagged PDF to provide AT users with access to the semantics, structure and content within a document.

Uses of Tagged PDF

Tagged PDF may be used to repurpose PDF content for accessibility, text-extraction and other purposes.

With the above in mind, the marked content section of ISO 32000-1:2008 makes little reference to its use within logical structure, because it is designed as a generalized mechanism. Similarly, the section describing logical structure does not attempt to describe how it is used by Tagged PDF. All content within the page content streams inside a PDF document must be enclosed using the mechanisms described in this section to make them available to the Tagged PDF structures.

Table 320

Marked Content defines a number of operators which can be used to demarcate content (MP and DP for points, BMC and BDC, terminated by an EMC). Usage of the marked content operators for purposes other than logical structure can safely intermix with those being used for logical structure. Only the BDC operator identifies content referenced from the logical structure tree (note that BMC can be used for the case of Artifacts). These non-structural uses can be safely disregarded by implementers of ISO 14289.

The tag of the BDC operator can be any name. Do not confuse the tag operand to the BDC operator with the term "Tag" from Tagged PDF (14.8). Although the tag's name does not indicate the semantic role of the structure element, confusion is reduced if these names are aligned with the structure tags containing the content.

EXAMPLE: /P<</MCID 123>>

BDC does not imply that structure element 123 has a role of "P". It could be H1, TD or whatever.

The MP and DP operators do not play a role in the logical structure context.

You shall not start a marked content sequence in a page's content stream and end it in a referenced Form XObject, or on the next page. (See ISO 32000-2 14.6.1, Notes 3 and 4).

14.6.2 Property Lists

The following are reserved names within a Property List for the BDC operator: MCID, ActualText, Alt, Lang and E. 

14.6.3 Marked Content and Clipping

No additional information for this subclause is provided in this guide.

14.7 Logical Structure

14.7.1 General

The "Markings" reference in 14.7.1, 4th paragraph should read "MarkInfo."

Suspects entries with a value of true are not allowed in documents conforming to ISO 14289-1.

14.7.2 Structure Hierarchy

Table 323

The T key is irrelevant to accessibility but may be of use in implementations concerned with tag management.

The E key is expressed in the DOM, inline with the text, and replacing the tagged content in the DOM's stream.

ActualText when used as an attribute on a structure element is a complete representation of the content inside the structure element rather than a complete replacement of the structure element as such.

For implementers to whom it is important to accommodate legacy software, place ActualText in a marked content sequence directly inside the structure element instead of on the structure element itself.

14.7.3 Structure Types

Note 3 discusses role mapping for standard structure types.  Prior to PDF 1.5, role mapping of standard structure types was disallowed, but in PDF 1.5 this was changed such that a custom element that just happened to share the same name as a standard structure type could in fact be role mapped its real type. To facilitate this, all elements that are used within a given document must have a corresponding entry in the rolemap, even if that entry simply maps a standard type to itself. ISO 14289 explicitly disallows remapping standard types, so a PDF/UA document's rolemap shall always map a standard type to the same standard type (e.g. "P"->"P" and NOT "P"->"BlockQuote").

14.7.4 Structure Content

14.7.4.1 General

No additional information for this subclause is provided in this guide.

14.7.4.2 Marked Content Sequences as Content Items

The tag applied to a marked content sequence predates logical structure, as described previously. These "tags" are thus not directly related to logical structure, and serve no purpose other than internal labeling. They are, however, required. The recommendation in ISO 32000-1 (bullet 1 in this section) is to have the marked content sequence tag match the Tag in the logical structure; this provision may be safely ignored for PDF/UA conformance purposes.

Table 324

There is an typographical error in this table in the StmOwn key. When a marked content reference has a Stm entry present (i.e. identifies the contents within a stream as belonging to the current structure element), an optional StmKey may also be present, which identifies the owner of that stream (e.g. an annotation dictionary). In ISO 32000-1:2008, the provide reference is "Stems" when it should be "Stm".

14.7.4.3 PDF Objects as Content Items

No additional information for this subclause is provided in this guide.

14.7.4.4 Finding Structure Elements from Content Items

Many tagged PDF files lack a ParentTree. Such back-pointers are required in order for a consumer to find content from the tags. A ParentTree is clearly "Required" - see Table 322!

14.7.5 Structure Attributes

14.7.5.1 General

It's permitted to have multiple dictionaries representing attributes, even when the attribute's Owner (see Table 341) is the same. However, repeated entries of standard attribute types causes an ambiguity for consuming software, and should be avoided.

In the case of repeated entries, consuming software is in a bind; these items have no specific ordering, and thus, there is no correct way to process such files.

14.7.5.2 Attribute Classes

No additional information for this subclause is provided in this guide.

14.7.5.3 Attribute Revision Numbers

No additional information for this subclause is provided in this guide.

14.7.5.4 User Properties

This section has no value from an accessibility standpoint.

14.7.6 Example of Logical Structure

This section is highly recommended for early reading by developers investigating structured PDF, and is a good introduction to the concepts of Logical Structure.

14.8 Tagged PDF

14.8.1 General

Note that Tagged PDF is not restricted (as the paragraph implies) to page content, but includes annotations as well.

A conforming PDF/UA writer is required to produce Tagged PDFs.

Word breaks (something many PDF writers neglect) are described in more detail in section 14.8.2.5.

14.8.2 Tagged PDF and Page Content

14.8.2.1 General

No additional information for this subclause is provided in this guide.

14.8.2.2 Real Content and Artifacts

14.8.2.2.1 General

Note that in PDF, "graphics" includes all classes of page content including text, vector graphics and bitmap images. See chapters 8 and 9 of ISO 32000-1.

Hidden annotations may be present in the structure tree. A conforming processor will not present such annotations to AT.

14.8.2.2.2 Specification of Artifacts

Failing to properly mark objects that are semantically artifacts may reduce predictability and perceived quality when reused (for example, in a reflow scenario) by tags-processing software. Consuming implementations may treat content unmarked as artifact as unrepresentable in general (ie, as an artifact in practice).

14.8.2.2.3 Incidental Artifacts

Bullet 2, Text discontinuities - it's important to note that ISO 32000-1 accepts logical structure as the canonical ordering of text and other content for logical content order purposes.

14.8.2.3 Page Content Order

This section is not very useful from the accessibility perspective and should be ignored. To determine the logical order of a document's contents, page content order should be completely ignored.

14.8.2.4 Extraction of Character Properties

14.8.2.4.1 General

No additional information for this subclause is provided in this guide.

14.8.2.4.2 Unicode Mapping in Tagged PDF

Although Notes 2 and 3 use normative text ("may") they are indeed informative. "May" in these cases should be read as "can".

14.8.2.4.3 Font Characteristics

Font characteristics (italics, bold, color, condensed, etc.) or text characteristics not provided by fonts (e.g. underlines) are not semantic information per se - but the appearance of these may indicate semantic functions. 

There is no official or formal way to identify underlining or boldface text. Such visual effects may be caused in a wide variety of ways, and are generally used to draw special attention to content.

In general there is no way to express such semantics on the inline level in PDF 1.7. Some PDF readers or AT may use font characteristics as a workaround.

Looking forward: ISO 32000-2 will include new tags to assist in correct tagging of semantically meaningful font changes. In the meantime <Span> tags may be used to contain text runs that include semantically significant inline styling such as the use of underline to indicate a heading. See TextDecorationType in Table 345.

14.8.2.5 Identifying Word Breaks

Word breaks are a commonly misunderstood part of Tagged PDF. The incorrect belief is that word-breaks must be determined by heuristic examination of a rendering, akin to OCR(!) This is far from the case; PDF provides for explicit word-breaks and Tagged PDF requires such be used. This section requires that word separators shall be explicitly included in content streams as appropriate for a given language.

Taking English as an example, whitespace characters must be included between every word and even after a period at the end of a sentence. Such inclusion unambiguously splits the words. However, using Japanese as another example, whitespace is not required because whitespace is not used in Japanese to identify word breaks.

14.8.3  Basic Layout Model

Although this section discusses page-layout in a manner that appears to confuse layout with semantics, the intent of the section is to discuss semantic constructs first and foremost.

Location on a page is not necessarily related to semantic significance. Consider the example of a watermark - it need not be tagged within the tag belonging to the text co-located on the page with the watermark.

14.8.4  Standard Structure Types

14.8.4.1 General

No additional information for this subclause is provided in this guide.

14.8.4.2 Grouping Elements

Table 333

Although the Part, Art, Sect and Div structure types are not presently utilized for AT purposes these types offer significant benefits for tags-tree navigation purposes. Support for these types is encouraged, but implementers should note that ISO 32000-2 makes significant changes in this area.

Use of the Caption Tag

For captions to tables and lists, place the caption tag as the first or last item within the table or list. See ISO 32000-1 List Elements (14.8.4.3.3) or Table elements (14.8.4.3.4).

For captions to figures, place the Caption tag inside a grouping tag immediately before or after the Figure tag.

Use of Div Tag

Div is useful for grouping semantically related elements. 

EXAMPLE: Figures often include copyright information. In such cases, provide a grouping tag to enclose both the Figure tag and the copyright information.

14.8.4.3 Block Level Structure Elements

14.8.4.3.1 General

No additional information for this subclause is provided in this guide.

14.8.4.3.2 Paragraphlike Elements
Table 335

Note that H6 is not the lowest-level possible Heading type. Implementers may proceed to Hn but should be aware that many AT implementation may be unaware of headings beyond H6 (e.g., H7, H8, etc.).

14.8.4.3.3 List Elements
Table 336

If list labels have no semantic significance, such labels may be marked as artifact, and thus, not be tagged with <Lbl> tags.

When individual list labels provide semantic value (e.g. use of symbols instead of generic bullets), it's necessary to tag such labels with Lbl.

14.8.4.3.4 Table Elements

There is a typo in NOTE 2 in this subclause: the table referenced is actually Table 349, not Table 347.

The Table elements represent table data structures by semantic type. It is through the use of attributes that the relationship between these structure elements is expressed. Accordingly, it's critical to read Table 349 when reading Table 337. 

Table 337

TBody is only required if implementers include THead / TFoot structure elements in their tables.

The ISO Committee has defined an algorithm (iterated below) to help software find non-explicit TH cells. From ISO 32000-2 (PDF 2.0) 14.8.4.7.3 Table Structure types:

If the Headers attribute (see 14.8.4, “Standard structure attributes") is not specified, the following algorithm determines which headers shall be associated with any given cell by finding an ordered list of row and column headers:

To find headers for any data or header cell, search left/up from the cell’s position to find row/column header cells. The search in a given direction stops when any of these conditions is reached:

    • the edge of the table is reached
    • a data cell is found after a header cell
    • a header cell has the Headers  attribute set - the headers that are specified are appended to the row/column list that is being built

When a header cell is found in the search and the (implicit or explicit) Scope attribute of the header cell is either Both  or Row/Column , the header cell is appended to the end of the list of row/column headers, resulting in a list of headers ordered from most specific to most general.

NOTE: This algorithm works for languages with different intrinsic directionality of the script (such as right-to-left) because the structure always reflects the reading order of the table.

14.8.4.3.5 Usage Guidelines for Block Level Structure

Although vague in ISO 32000-1, this subject is addressed in detail, with explicit nesting relationships for structure elements, in ISO 32000-2.

14.8.4.4 Inline-Level Structure Elements

14.8.4.4.1 General
Table 338

A few clarifications are in order on certain Structure Types in this table:

  • Quote - Quotes may occur from the author in addition to other voices.
  • Note - Note appears to be general, but is actually very specific to content referenced by other content in the same document, the classic example being footnotes and endnotes. Since Notes may be very long, it is advisable to consider them as both inline and block-level elements.
  • Reference - It is ideal (from a logical structure point of view) to nest each Note tag within its referencing Reference tag. Full support for these elements provides consuming software with the ability to present or skip the note, as reader prefer.

EXAMPLE:

<Reference> (e.g: 3)

      <Note> (e.g.: "3 - note text...")

14.8.4.4.2 Link Elements

What links in ISO 32000-1 don't do is provide a means of identifying (structurally) the content they are intended to link to. This is fixed in PDF 2.0 by virtue of the new Structured Destination mechanism. To associate disparate content in PDF 1.7, consider nesting the target content's tag within a Reference tag.

14.8.4.4.3 Annotation Elements

No additional information for this subclause is provided in this guide.

14.8.4.4.4 Ruby and Warichu Elements

No additional information for this subclause is provided in this guide.

14.8.4.5 Illustration Elements

Please see the PDF/UA-1 Technical Implementation Guide for guidance on Figure, Formula and Form tags.

14.8.5 Standard Structure Attributes

14.8.5.1 General

Although many attributes are stylistic, some have structural and/or semantic significance.

14.8.5.2 Standard Attribute Owners

Apart from those specific attributes called out in PDF/UA, PDF/UA conforming processors need not process any standard structure attribute.

14.8.5.3 Attribute Values and Inheritance

No additional information for this subclause is provided in this guide.

14.8.5.4 Layout Attributes

No additional information for this subclause is provided in this guide.

14.8.5.4.1 General

No additional information for this subclause is provided in this guide.

14.8.5.4.2 General Layout Attributes

The Placement and WritingMode attributes are intended for use when reflowing PDF content. As such they are not required for PDF/UA. As of PDF 2.0, an explicit reflow model is no longer specified, limiting the utility of these attributes.

The BackgroundColor, BorderColor, BorderStyle, BorderThickness, Padding and Color attributes might be useful for processors enacting the reflow model from PDF 1.7, but are unlikely to provide utility in other contexts.

14.8.5.4.3 Layout Attributes for BLSEs

The SpaceBefore, SpaceAfter, StartIndent, EndIndent, TextIndent, TextAlign, BlockAlign, InlineAlign, TBorderStyle, TPadding attributes might be useful for processors enacting the reflow model from PDF 1.7, but are unlikely to provide utility in other contexts.

If a computed bounding-box is not likely to reflect the author's intended bounding-box (for example, due to a background artifact), BBox, Height and Width attributes are strongly recommended.

14.8.5.4.4 Layout Attributes for ILSEs

These attributes might be useful for processors enacting the reflow model from PDF 1.7.

14.8.5.4.5 Content and Allocation Rectangles

This section is useful for processors enacting the reflow model from PDF 1.7.

14.8.5.4.6 Illustration Attributes

No additional information for this subclause is provided in this guide.

14.8.5.4.7 Column Attributes

This section is useful for processors enacting the reflow model from PDF 1.7.

In ISO 32000 a normative requirement was accidentally added requiring column attributes. These are not in fact required, and do not pertain to accessibility.

14.8.5.5 List Attribute

In PDF/UA the ListNumbering attribute is required for all L tags.

14.8.5.6 PrintField Attributes

ISO 32000-2 will clarify the definition of this attribute.

14.8.5.7 Table Attributes

In ISO 14289-1:2012, the Scope attribute is required (7.5 paragraph 2). In the forthcoming 2nd edition of ISO 14289-1 this was corrected to clarify when the Scope attribute is required.

ISO 32000-2 will introduce a default algorithm for determining scope, which is described in the PDF/UA-1 Technical Implementation Guide.

14.9 Accessibility Support

14.9.1 General

No additional information for this subclause is provided in this guide.

14.9.2 Natural Language Specification

14.9.2.1 General

Marked Content Considerations

The section of ISO 32000-1 that defines Marked Content specifies that a specific Span container type (not to be confused with the standard structure type <Span>) can be used to identify a specific set of attributes for that content sequence. These attributes only have a defined meaning in the context of this Span container. However, a common mistake is to use attributes of Span containers on arbitrary marked content sequences that are directly referenced from the logical structure tree.

  • Some Consuming processors accommodate this mistake by honoring the intent of attributes of Span containers even though such usage is undefined.
  • PDF producer software should avoid writing these attributes of Span containers on anything other than Span containers.

Accordingly, although the content within each structure element is required to have a defined Lang value, that Lang may be specified in a structure element OR in marked content Span containers. If specified in both places the Marked Content container's Lang value takes precedence.

To simplify: content-specific language takes precedence over structure-specific language which takes precedence over document-specific language.

ISO 32000-2 and ISO 14289-2 will be clarified on this point.

14.9.2.2 Language Identifiers

ISO 32000-1 allows "unknown" as a legitimate value of Lang, and PDF/UA does not expressly prohibit the use of this value. However, accurate representation of the languages present is vital to comprehension via some AT. Accordingly, the "unknown" value must only be used in cases where the language cannot be determined by a reasonable best effort.

14.9.2.3 Language Specification Hierarchy

No additional information for this subclause is provided in this guide.

14.9.2.4 Multi-language text arrays

No additional information for this subclause is provided in this guide.

14.9.3 Alternate Descriptions

If alternate description text contains more than one language the Unicode escape sequence (U+E0000), although deprecated in Unicode as of 5.1 (2008), may nonetheless be used to indicate language changes in ISO 32000-1 conforming files.

EXAMPLE: <U+E0000, U+E0001, U+E006A, U+E0061>

14.9.4 Replacement Text

ActualText is the entirety of the representation for reuse purposes, including accessibility.

14.9.5 Expansion of Abbreviations and Acronyms

No additional information for this subclause is provided in this guide.