OpenType GSUB Substitution Engine

Glyph substitution capability surface (v2.119.43 - v2.119.50)

 

Arabic Shaping  CFF / OpenType Subsetting

The OpenType GSUB (Glyph SUBstitution) engine inside HotPDF lets callers query, drive, and embed every kind of glyph substitution an OpenType font declares - ligatures, stylistic alternates, contextual variants, Arabic / Indic shaping forms, CJK alternate forms, and so on. Every OpenType GSUB LookupType 1 through 8 is implemented and exposed as a capability-only query surface; the caller drives text emission and decides which substitute glyph to write to the page content stream.

 

Public API

type

  TGSUBStringArray = array of AnsiString;

 

// LookupType 1 - Single Substitution (one glyph -> one glyph)

function GetSingleSubstituteGlyph(InputGID: Word; const FeatureTag: AnsiString): Word;

 

// LookupType 2 - Multiple Substitution (one glyph -> sequence of glyphs)

function GetMultipleSubstituteGlyphs(InputGID: Word; const FeatureTag: AnsiString;

  var OutGIDs: array of Word): Boolean;

 

// LookupType 3 - Alternate Substitution (one glyph -> one of N alternates)

function GetAlternateGlyphCount(InputGID: Word; const FeatureTag: AnsiString): Integer;

function GetAlternateGlyph(InputGID: Word; const FeatureTag: AnsiString;

  AlternateIndex: Integer): Word;

 

// LookupType 4 - Ligature Substitution (N glyphs -> one ligature)

function ApplyLigatureSubstitution(const InputGIDs: array of Word;

  StartIndex: Integer; const FeatureTag: AnsiString;

  out OutGID: Word; out ConsumedCount: Integer): Boolean;

 

// LookupType 5 + 6 - Contextual / Chained Contextual Substitution

function ApplyContextualSubst(const InputGIDs: array of Word;

  StartIndex: Integer; const FeatureTag: AnsiString;

  var OutGIDs: array of Word;

  out ConsumedLen: Integer): Boolean;

 

// LookupType 8 - Reverse Chained Contextual Single Substitution

function ApplyReverseChainedContextualSubst(const InputGIDs: array of Word;

  StartIndex: Integer; const FeatureTag: AnsiString;

  out OutGID: Word): Boolean;

 

// Script / LangSys selection (Phase 7)

procedure SetGSUBScript(const ScriptTag: AnsiString);

procedure SetGSUBLanguage(const LangTag: AnsiString);

function GetGSUBScripts: TGSUBStringArray;

function GetGSUBLanguages(const ScriptTag: AnsiString): TGSUBStringArray;

function GetGSUBFeatures(const ScriptTag, LangTag: AnsiString): TGSUBStringArray;

 

// TTF subsetter closure (Phase 9)

procedure MarkUnicodeGlyphUsed(GID: Word);

 

Description

The engine activates after RegisterUnicodeTTF has parsed a font and cached its GSUB / GDEF / cmap tables. Each substitution query walks the font's ScriptList / LangSysList / FeatureList / LookupList chain and dispatches to the appropriate LookupType handler. The 12 methods above are the complete public surface; everything else (cmap walk, ScriptList parsing, Coverage table lookup, ClassDef resolution, LookupFlag honour, Extension wrapper unwrapping, SequenceLookupRecord nested dispatch) lives behind that surface.

 

Defensive contract throughout: fonts without a GSUB table, non-4-byte feature tags, features the selected script / language does not advertise, GIDs no subtable covers, and LookupFlag-ignored input glyphs all return a safe no-op (False / OutGID = InputGID / empty OutGIDs / ConsumedCount = 1) so callers never see exceptions for routine "no substitution applies" cases.

 

LookupType matrix

LookupType 1 (Single Substitution) - one glyph maps to one substitute. Canonical features: salt, ss01-ss20, smcp, onum, liga when LookupType 1 is wired, plus init / medi / fina / isol Arabic positional forms in fonts that drive them through GSUB. Use GetSingleSubstituteGlyph.

LookupType 2 (Multiple Substitution) - one glyph splits into a sequence of substitute glyphs. Canonical user: ccmp Glyph Composition / Decomposition (precomposed accented Latin letters split into base + combining marks for downstream mark positioning). Use GetMultipleSubstituteGlyphs.

LookupType 3 (Alternate Substitution) - one glyph maps to one of N alternates. Canonical features: aalt (Access All Alternates), salt when wired as Type 3, titl (Titling Alternates), ss01-ss20 stylistic sets when the font designer offers more than one alternate per slot. Use GetAlternateGlyphCount + GetAlternateGlyph.

LookupType 4 (Ligature Substitution) - N input glyphs fold into one ligature. Canonical features: liga (Standard Ligatures: fi / fl / ffi / ffl), clig (Contextual Ligatures), dlig (Discretionary Ligatures), hlig (Historical Ligatures), rlig (Required Ligatures - Arabic LAM-ALEF and similar), Indic script ligatures (akhn, pres, blws, psts). Use ApplyLigatureSubstitution.

LookupType 5 (Contextual Substitution) + LookupType 6 (Chained Contextual Substitution) - matches an input glyph sequence and dispatches nested lookups at specific positions inside the match. All three Format variants (1 literal sequence, 2 ClassDef sequence, 3 Coverage sequence) are implemented; the SequenceLookupRecord dispatcher re-enters the LookupList and handles Single / Multiple / Alternate (first) / Ligature nested lookups with live MatchPositions tracking. Canonical features: rclt (Required Contextual Alternates - Arabic init/medi/fina/isol when GSUB-driven), clig, calt, Indic shaping pres / blws / psts / half / pstf / cjct. Use ApplyContextualSubst (one entry point covers both LookupType 5 and 6).

LookupType 7 (Extension Substitution) - pure indirection layer the OpenType spec defines for fonts whose substitution subtable lives beyond the 16-bit reach of the LookupList. Every public API transparently follows the 32-bit Offset32 indirection to the real LookupType 1 / 2 / 3 / 4 / 5 / 6 / 8 subtable. Unblocks heavy CJK / Indic fonts (Noto Sans CJK, Noto Sans Devanagari) whose GSUB exceeds 64 KB. No separate API - the unwrap is automatic.

LookupType 8 (Reverse Chained Contextual Single Substitution) - context-aware 1:1 substitution whose distinguishing feature is that callers must apply it in REVERSE scan order over a multi-glyph run (end -> start) because each substitute may depend on FUTURE lookahead context that must not have been substituted yet. Canonical use: Arabic / Syriac / N'Ko / Indic contextual alternates whose final form depends on the following glyph. Use ApplyReverseChainedContextualSubst; the caller drives the reverse scan loop.

 

Script / LangSys selection

By default the engine prefers the DFLT script (or the first script the font declares) and the default LangSys. Call SetGSUBScript('latn' / 'arab' / 'cyrl' / 'hani' / 'kana' / 'deva' / 'beng' / 'taml' / etc.) and SetGSUBLanguage('ENG ' / 'TUR ' / 'AZE ' / 'JAN ' / 'KOR ' / 'ARA ' / etc., trailing-space padded to 4 bytes) to pin queries to a specific script / language pair. Empty string restores the default-path baseline. Selections persist across queries and are cleared on RegisterUnicodeTTF('', nil).

 

Strict-vs-fallback semantics: an unknown ScriptTag makes subsequent queries return empty no-op results (so callers can detect their chosen script is unavailable); an unknown LangTag falls back to the script's default LangSys per OpenType convention. GetGSUBScripts / GetGSUBLanguages / GetGSUBFeatures enumerate what the loaded font actually advertises.

 

LookupFlag honour and GDEF

Every query reads each Lookup table's LookupFlag (and the optional trailing markFilteringSet uint16 when useMarkFilteringSet is set) and skips input glyphs flagged for ignore. The spec-defined bits are honoured: ignoreBaseGlyphs (0x0002, skip GDEF class 1), ignoreLigatures (0x0004, skip class 2), ignoreMarks (0x0008, skip class 3), useMarkFilteringSet (0x0010), and the high byte markAttachmentType. ClassDef Format 1 + 2 are both parsed; GDEF v1.0 / v1.1 / v1.2 headers are all accepted. Fonts without a GDEF table fall back to "no glyph is ignored" so output stays byte-identical for callers using GDEF-less fonts.

 

TTF subsetter closure (MarkUnicodeGlyphUsed)

HotPDF's v2.84.0 TTF subsetter derives its used-glyph set from FUnicodeUsedCps through the cmap. GSUB substitute glyphs (stylistic alternates, ligatures, contextual variants - everything the 7 query APIs above return) typically have no codepoint reaching them via the cmap, so they were previously invisible to the subsetter and the consumer reader rendered .notdef in their place.

 

After emitting any GID returned by GetSingleSubstituteGlyph / GetMultipleSubstituteGlyphs / GetAlternateGlyph / ApplyLigatureSubstitution / ApplyContextualSubst / ApplyReverseChainedContextualSubst into a PDF text stream, call MarkUnicodeGlyphUsed(GID) once per emitted GID to pull it into the embedded subset. The helper is idempotent, defensive (out-of-range GIDs are silently dropped), and integrates with the v2.84.0 composite-glyph closure pass: callers only need to mark the top-level substitute GID - composite components are auto-pulled.

 

Typical workflow (Latin small caps)

 

PDF.RegisterUnicodeTTF('myFont', 'C:\\Windows\\Fonts\\arial.ttf');

PDF.SetGSUBScript('latn');

PDF.SetGSUBLanguage('');  // default LangSys

SmallCapGID := PDF.GetSingleSubstituteGlyph(InputGID, 'smcp');

if SmallCapGID <> InputGID then

begin

  // emit SmallCapGID into the page content stream...

  PDF.MarkUnicodeGlyphUsed(SmallCapGID);  // pull into subset

end;

 

Typical workflow (Arabic LAM-ALEF ligature)

 

PDF.SetGSUBScript('arab');

Run := [LamGID, FathaGID, AlefGID];  // post-cmap GIDs

if PDF.ApplyLigatureSubstitution(Run, 0, 'rlig', LigGID, ConsumedCount) then

begin

  // emit LigGID + advance by ConsumedCount

  PDF.MarkUnicodeGlyphUsed(LigGID);

end;

 

Scope and limitations

The engine is a capability-only query surface: it answers "what would GSUB do here", but it does not run an automatic shaping pipeline (Harfbuzz-class layout, cluster-aware reordering for Indic, BiDi resolution, GPOS positioning, mark attachment). Callers are responsible for driving the scan loop, choosing which substitute / alternate to emit, calling MarkUnicodeGlyphUsed for each emitted substitute GID, and applying any GPOS / mark positioning a font requires.

 

Producer-side Arabic / Persian / Urdu shaping (LAM-ALEF mandatory ligature + Arabic Presentation Forms-A) is implemented as a separate built-in pipeline that runs automatically during text emission - see Arabic / Persian / Urdu Shaping.

 

Version trace

v2.119.43 Single Substitution + Phase 1. v2.119.44 Multiple + Alternate (Phase 2). v2.119.45 Ligature (Phase 3). v2.119.46 Extension + GDEF + LookupFlag honour (Phase 4). v2.119.47 Contextual + Chained Contextual + SequenceLookupRecord dispatcher (Phase 5). v2.119.48 Reverse Chained Contextual - LookupType 1-8 matrix closed (Phase 6). v2.119.49 Script / LangSys selection API (Phase 7). v2.119.50 TTF subsetter closure via MarkUnicodeGlyphUsed (Phase 9; Phase 8 was a producer-side shaping integration spike and split into 8a-8f for future revisions).

 

See also: Arabic / Persian / Urdu Shaping, CFF / OpenType Font Subsetting Functions, THotPDF.EnableFontSubsetting