# OpenType shaping errata #
This document details errata that shaping engines may encounter, such
as ambiguities or omissions in the existing OpenType or Unicode
specification documents.
**Contents**
- [Unicode](#unicode)
- [ZWJ and ZWNJ](#zwj-and-zwnj)
- [Scope of ZWJ and ZWNJ](#scope-of-zwj-and-zwnj)
- [ZWJ in redundant ligature lookups](#zwj-in-redundant-ligature-lookups)
- [Emoji](#emoji)
- [Skin-tone permutations](#skin-tone-permutations)
- [Gender permutations](#gender-permutations)
- [OpenType](#opentype)
- [Null offsets in GSUB and GPOS](#null-offsets-in-gsub-and-gpos)
- [Sorting of GSUB and GPOS lookups](#sorting-of-gsub-and-gpos-lookups)
- [Per-script applicability of feature tags](#per-script-applicability-of-feature-tags)
- [Ordering of post-base and below-base consonants in Indic2 base-consonant determination](#ordering-of-post-base-and-below-base-consonants-in-indic2-base-consonant-determination)
- [Lookup behavior](#lookup-behavior)
- [Using MultipleSub for glyph deletion](#using-multiplesub-for-glyph-deletion)
- [Processing nested contextual lookups](#processing-nested-contextual-lookups)
- [Adjacent-mark reordering ambiguities](#adjacent-mark-reordering-ambiguities)
- [Merging of glyph properties](#merging-of-glyph-properties)
- [See also](#see-also)
## Unicode ##
This section lists errata pertaining to the Unicode Standard.
### ZWJ and ZWNJ ###
#### Scope of ZWJ and ZWNJ ####
Unicode provides the Zero Width Joiner (ZWJ) and Zero Width Non-Joiner
(ZWNJ) control characters so that a text sequence can "request a
rendering system to have more or less of a connection between
characters than they would otherwise have."
The generic examples used in the standard show how ZWJ and ZWNJ
characters can affect the cursive-joining behavior between two
characters or the ligature-forming behavior between two
characters. However, the standard does not explicitly say whether or
not the presence of a ZWJ or ZWNJ should influence the shaping
behavior of characters for characters not adjacent to the ZWJ or ZWNJ.
For example, in the sequence "a,b,ZWNJ,c,d" the ZWNJ should prevent
the application of a ligature between "b" and "c" (if such a ligature
lookup exists in the active font).
However, if the active font contains a contextual ligature lookup for
"c,d" when preceded by "b", it is not clear whether or not the ZWNJ
in the same "a,b,ZWNJ,c,d" sequence should inhibit the application of
the ligature between "c" and "d".
#### ZWJ in redundant ligature lookups ####
An "Implementation Notes" section in chapter 23.2 of the Unicode
Standard says that font vendors should add ZWJ sequences to ligature
lookups. For example, if the sequence "f,i" triggers the "fi"
ligature, then the font should also include a lookup that triggers the
"fi" ligature for "f,ZWJ,i".
However, the text of chapter 23.2 prior to the "Implementation Notes"
says that ZWJ and ZWNJ "are not to be used in all cases where
ligatures or cursive connections are desired; instead, they are meant
only for over-riding the normal behavior of the text." That logic
makes the suggested "f,ZWJ,i" ligature lookup superfluous, because it
duplicates the effects of the existing "f,i" ligature lookup.
Using ZWJ within lookup patterns in the manner suggested by the
"Implementation Notes" is not common practice.
### Emoji ###
#### Skin-tone permutations ####
It is unclear whether ZWJ multi-person group emoji sequences are
expected to include combinations where some emoji in the sequence are
followed by a Fitzpatrick skin-tone modifier but other emoji in the
sequence are not followed by a Fitzpatrick skin-tone modifier.
For example, it is unclear whether the sequence
"Man,ZWJ,Handshake,Man,SkinTone-2" constitues a valid
ZWJ "Couple holding hands" sequence.
#### Gender permutations ####
It is unclear whether ZWJ multi-person group emoji sequences are
expected to include combinations where some emoji in the sequence are
are an explicit gender but other emoji in the sequence are not
explicit gender.
For example, it is unclear whether the sequence
"Man,ZWJ,Handshake,Person" constitues a valid
ZWJ "Couple holding hands" sequence.
It is also unclear whether the ZWJ multi-person family sequence must
have explicit gender-ordering for the adult humans depicted.
For example, it is unclear whether the sequence
"Man,ZWJ,Woman,ZWJ,Girl" should be rendered identically to the
sequence "Woman,ZWJ,Man,ZWJ,Girl".
## OpenType ##
This section lists errata pertaining to the OpenType specification.
### Null offsets in GSUB and GPOS ###
The headers of the GSUB and GPOS tables include fields that contain
the offsets at which other structures within the font binary are
found. For example, the value of the `featureVariationsOffset` field
indicates the byte value at which the featureVariations structure is
located.
The OpenType specification notes that `featureVariationsOffset` can be
`NULL`, but the specification does not indicate whether or any other
offset values can also be `NULL` (nor, conversely, does it indicate
whether `NULL` should be considered invalid).
In practice, other fields -- such as `scriptListOffset`,
`featureListOffset`, and `lookupListOffset` -- may have `NULL` values.
In such situations, `NULL` is usually intrepreted as meaning that the
structure nominally pointed to by the offset is empty.
Furthermore, font-validation functions may overwrite a `NULL` into an
offset field if the original value encountered was invalid.
### Sorting of GSUB and GPOS lookups ###
The OpenType specification requires that lookups in the GSUB table
must be sorted into numeric order before they are applied.
Lookups in the GPOS table, however, are not expected to be sorted
first, because GPOS lookups are applied in a specified order.
### Per-script applicability of feature tags ###
Some OpenType feature tags are defined only to apply to text runs in
specific scripts. Other feature tags are defined to apply to text in
any script.
However, the definitions of some feature tags list a limited number of
example scripts to which the feature should apply, but do not specify
every supported script.
For example, the `pstf` (post-base forms) tag is
[described](https://docs.microsoft.com/en-us/typography/opentype/spec/features_pt#tag-pstf)
as required for "scripts of south and southeast Asia that have
post-base forms for consonants eg: Gurmukhi, Malayalam, Khmer."
### Ordering of post-base and below-base consonants in Indic2 base-consonant determination ###
The Microsoft script-development specification for all Indic2-model
scripts
[states](https://docs.microsoft.com/en-us/typography/script-development/bengali#reorder-characters)
parenthetically that "post-base forms have to follow below-base forms".
If this statement is taken to be a rule, it would affect the
base-consonant search algorithm.
For example, in the Bengali sequence "Ka,Halant,Ba,Halant,Ya"
(`U+0995`,`U+09CD`,`U+09AC`,`U+09CD`,`U+09AF`), "Ka" would be
identified as the syllable base, with "Ba" designated a below-base
form and "Ya" designated a post-base form. However, in the similar
sequence "Ka,Halant,Ya,Halant,Ba"
(`U+0995`,`U+09CD`,`U+09AF`,`U+09CD`,`U+09AC`), "Ya" would be
identified as the base consonant.
Real-world Bengali texts provide counterexamples that contradict the
assumption that "post-base forms follow below-base forms" is a
requirement.
In other scripts, such as Telugu, the "post-base forms have to follow
below-base forms" statement is, perhaps, statistically likely, but is
certainly not an orthographic rule.
Consequently, it is unclear if the statement should be enforced as a
rule or if it should be regarded as a suggestion, and it is unclear to
what degree that answer varies between the Indic2-model scripts.
### Lookup behavior ###
#### Using MultipleSub for glyph deletion ####
The GSUB specification says that a `MultipleSubst` substitution cannot
be used to delete a glyph: it always substitutes at least one
replacement glyph. However, some implementations allow the
replacement-glyph array to be zero-length.
#### Processing nested contextual lookups ####
The GSUB specification allows contextual substitutions to invoke other
contextual substitutions. It is unclear how implementations ought to
handle certain cases of these nested lookups.
For example:
```
context: 'a'
subst index 0:
context: 'ab'
subst index 1: 'b' → 'ab'
```
This nested set of substitutions could cause an infinite loop on
certain input strings, if it is interpreted in a naive manner:
```
'[]ab' // begin at start of glyph sequence
'[a]b' // context matches
'[ab]' // nested context matches at index 0
'[aab]' // subst applies at index 1
'[a]ab' // return to parent context, uh oh!
'a[]ab' // move on to next glyph
'a[a]b' // context matches, infinite loop!
```
In short, if a nested contextual substitution can insert glyphs ahead
of its parent contextual substitution's context, then it creates a
"stack" that allows Turing-complete computation.
### Adjacent-mark reordering ambiguities ###
The Microsoft script-development specifications
[say](https://docs.microsoft.com/en-us/typography/script-development/devanagari#reorder-characters)
that marks should be reordered "to canonical order" (step 3 in the
linked Devanagari document) in the reordering phase. However, the same
step also describes this step as "Adjacent nukta and halant or nukta
and Vedic sign are always repositioned if necessary, so that the nukta
is first."
Together, it is somewhat ambiguous as to whether only "Halant,Nukta"
and "_Vedic_sign_,Nukta" sequences should be reordered by moving the
"Nukta" to the beginning, or all sequences of marks require reordering
into Unicode canonical combining class order, with "Nukta" moving to
the initial position as a special case.
### Merging of glyph properties ###
When the application of a shaping operation merges two or more
adjacent glyphs (for example, when two adjacent glyphs are substituted
with a single ligature glyph), the OpenType specification does not
dictate how shaping engines should combine (for example, merge,
replace, or drop) the properties of the input glyphs to determine the
properties of the output glyph.
This may result in ambiguities when a sequence of glyphs has several
substitutions applied in series.
For example, when shaping Indic scripts, glyphs may be tagged for the
possible application of multiple features, such as `half` and `rkrf`,
which are applied serially.
HarfBuzz and Uniscribe both take the approach of retaining the
properties of the first input glyph in a sequence, propagating those
properties to the merged output glyph.
## See also ##
Shaping engines may also want to offer explicit compatibility with
Microsoft Uniscribe, for the purpose of ensuring that users' existing
documents do not break. Therefore, implementors may wish to consult
the [Uniscribe compatibility notes](notes/uniscribe-bug-compatibility.md).
These compatibilty notes record test-driven observations about
Uniscribe's behavior, and they include any behavior that is a known
bug or a known deviation from specifications. Consequently, the issues
raised by offering Uniscribe compatiblity cannot be considered errata
in the sense that it is described above.