Automatic Shaping Pipeline (Phase 8)

Opt-in producer-side GSUB pipeline (v2.119.59 - v2.120.10)

 

Arabic Shaping  Multi-Script Shaping  GSUB Engine

The automatic shaping pipeline elevates the OpenType GSUB engine from a capability-only query surface into a producer-side feature that is applied automatically as text is emitted into PDF content streams. Callers enable specific GSUB features through a typed set (ShapingFeatures: THPDFShapingFeatures) and HotPDF takes care of running the right substitutions, marking substitute glyphs into the embedded font subset, and emitting the ToUnicode CMap reverse-mapping entries needed for accessibility.

 

Opt-in framework (v2.119.59 / Phase 8a)

A new enum and property control which automatic substitutions run during text emission:

 

type

  THPDFShapingFeature = (

    sfArabicGSUB,    // font-defined 'rlig' (Required Ligatures) for Arabic

    sfStandardLigatures, // Latin 'liga' (ff / fi / fl / ffi / ffl / sft / st)

    sfContextualLigatures,// Latin 'clig' (contextual ligatures)

    sfContextualAlternates,// 'rclt' (Required Contextual Alternates)

    sfIndicShaping); // Devanagari Repha + pre-base I-matra reorder

  THPDFShapingFeatures = set of THPDFShapingFeature;

 

property ShapingFeatures: THPDFShapingFeatures read ... write ...;

 

Default is [] (empty set), which preserves byte-identical output for callers who depend on the v2.119.32-58 static post-pass shaper. Setting one or more flags elevates the engine into automatic mode for the corresponding features.

 

sfArabicGSUB - Phase 8c.2 (v2.119.63)

When sfArabicGSUB is set, font-defined rlig (Required Ligatures) substitutions are applied to Arabic text runs automatically. ApplyArabicGSUBRefinement walks the cmap to build a GID array, calls ApplyLigatureSubstitution with the rlig feature tag, maps substitute GIDs back through the reverse cmap (covering FB50-FDFF + FE70-FEFF Arabic Presentation Forms) to a Unicode codepoint, and calls MarkUnicodeGlyphUsed so the substitute glyph is kept in the embedded font subset. This covers font-specific ligatures beyond the four hard-coded Arabic ligature families (LAM-ALEF v2.119.32, YEH-HAMZA v2.119.58, Allah v2.119.60, Bismillah v2.119.62).

 

Setting sfArabicGSUB implicitly bypasses the v2.85.0 static 4-position shaper for Arabic - callers who need the static shaper to keep handling codepoints outside what the font's GSUB declares should leave sfArabicGSUB off.

 

sfStandardLigatures / sfContextualLigatures - Phase 8b (v2.119.65)

When sfStandardLigatures is set, Latin Standard Ligatures are folded automatically using the font's liga feature. ApplyLatinLigatureRefinement targets the Alphabetic Presentation Forms block (U+FB00-FB4F) - typically FB00 ff, FB01 fi, FB02 fl, FB03 ffi, FB04 ffl, FB05 long-s + t, FB06 st. sfContextualLigatures adds a second pass for the font's clig feature. Both passes use the same reverse-cmap mechanism as sfArabicGSUB and emit 7 new ToUnicode CMap reverse-mapping entries (FB00-FB06) so consumer-reader copy / paste resolves the ligature back to the source letters.

 

sfContextualAlternates - GSUB 'rclt' (v2.119.66)

When sfContextualAlternates is set, the font's rclt (Required Contextual Alternates) feature is applied. ApplyArabicGSUBContextualRefinement uses the v2.119.47 ApplyContextualSubst entry point and handles variable-length N-to-M output (substitution is only committed when every replacement GID is reachable through the reverse cmap). Reverse cmap range is extended to FB00-FDFF + FE70-FEFF to cover Latin + Arabic + Hebrew Presentation Forms.

 

Canonical users of rclt: Arabic init / medi / fina / isol when the font drives positional shaping through GSUB instead of through Unicode Presentation Forms codepoints; certain Latin sequence disambiguation rules; Indic shaping pres / blws / psts / half / pstf / cjct features when registered as rclt by the font designer.

 

sfIndicShaping - Phase 8e (v2.119.67)

When sfIndicShaping is set, the v2.119.55 Devanagari capability ApplyDevanagariReorder is promoted from a manual method to an automatic pre-pass applied inside the three BuildUnicode*FieldContent helpers. Devanagari runs get Repha (Ra + Halant at cluster start) moved to the post-base position, and pre-base I-matra (U+093F) moved before the cluster base consonant, so the consumer reader's GSUB engine picks up the syllable in the correct rendering order. Other Indic reorders (above-base / below-base matra, conjunct formation) remain in the font's GSUB.

 

Advance-query support (v2.119.64 / Phase 8c.5)

A companion API exposes the cached /W em-fraction so callers can compute word-wrap correctly when emitting GSUB-substituted glyphs:

 

function GetCodepointAdvance(CP: Cardinal): Single;

 

Returns the hmtx-derived advance width as a fraction of em for the cmap-resolved glyph at CP. The same release also fixed CodeUnitAdvance to classify Arabic Presentation Forms (U+FB50-FDFF + U+FE70-FEFF) as NARROW instead of WIDE (the heuristic fallback was wrong before v2.65).

 

Typical workflow (full Arabic auto-shaping)

 

PDF.RegisterUnicodeTTF('NotoArab', 'NotoSansArabic-Regular.ttf');

PDF.ShapingFeatures :=

  [sfArabicGSUB,  // font-defined 'rlig'

   sfContextualAlternates];  // 'rclt' (positional shaping in GSUB-driven fonts)

PDF.SetGSUBScript('arab');

PDF.BeginDoc;

PDF.CurrentPage.SetFont('NotoArab', [], 14);

PDF.CurrentPage.RtLTextOut(100, 700, 0,

  UnicodeString(#$0628#$0633#$0645#$0020#$0627#$0644#$0644#$0647));

PDF.EndDoc;

 

Typical workflow (Latin standard ligatures + Devanagari reorder)

 

PDF.RegisterUnicodeTTF('NotoSans', 'NotoSans-Regular.ttf');

PDF.RegisterUnicodeTTF('NotoDeva', 'NotoSansDevanagari-Regular.ttf');

PDF.ShapingFeatures :=

  [sfStandardLigatures, // FB00-FB06 Latin liga

   sfContextualLigatures, // + clig

   sfIndicShaping]; // Devanagari Repha + I-matra reorder

 

Phase 8 roadmap closure

Phase 8a (v2.119.59) opt-in framework + Arabic capability; Phase 8b (v2.119.65) Latin standard ligatures; Phase 8c.1 (v2.119.60) Allah; Phase 8c.2 (v2.119.63) GID-level GSUB rlig; Phase 8c.3 (v2.119.61) ToUnicode reverse mapping; Phase 8c.4 (v2.119.62) Bismillah; Phase 8c.5 (v2.119.64) advance query + heuristic fix; Phase 8c.6 (v2.119.68) PUA synthetic codepoint emit; Phase 8d was rolled into 8c sub-phases; Phase 8e (v2.119.67) Devanagari auto-reorder. With v2.119.68 the Phase 8 capability matrix is closed; further refinements (additional Indic scripts, OpenType GPOS positioning, BiDi resolution) are tracked under separate roadmap items.

 

Scope and limitations

The opt-in pipeline is intentionally additive over the static post-pass shaper - existing callers see no behavior change unless they opt in. The default [] set is the safe choice for byte-stable regression. Fonts without the requested GSUB feature tables produce safe no-op output (callers see no substitution applied, no exceptions raised).

 

No OpenType GPOS positioning is applied - the pipeline is substitution-only. No BiDi (Bidirectional) algorithm - callers still order mixed-direction runs in visual order or use a separate BiDi library. No automatic Indic shaping for scripts beyond Devanagari yet; future revisions are tracked in dev-notes/GSUB-Engine-Roadmap.md.

 

See also: OpenType GSUB Substitution Engine, Arabic / Persian / Urdu Shaping Support, Syriac / Mongolian / Devanagari Shaping, THotPDF.AssignSyntheticCodepointForGID, CFF / OpenType Font Subsetting