Arabic / Persian / Urdu Shaping Support

Producer-side shaping pipeline (v2.85.0 - v2.119.68)

 

OpenType GSUB Engine  CFF / OpenType Subsetting

HotPDF runs a producer-side shaping pipeline that folds Unicode-input Arabic, Persian, and Urdu runs into their Arabic Presentation Forms during PDF text emission, so consumer readers receive ready-to-render positional / ligature glyphs without needing a Harfbuzz-class shaper of their own.

 

What the pipeline does

Positional shaping (v2.85.0): every Arabic base letter has up to four positional forms - isolated (isol), initial (init), medial (medi), final (fina). HotPDF inspects each Arabic character's join class and the join classes of its immediate neighbours, then maps the input codepoint to the appropriate Arabic Presentation Forms-B (U+FE70 - U+FEFC) glyph before emission. Non-joining letters, joiners, transparent characters (combining marks), and the Tatweel kashida are all handled per the Unicode Arabic Shaping algorithm.

 

LAM-ALEF mandatory ligature (v2.119.32): the Arabic Unicode Shaping spec mandates that any LAM (U+0644) immediately followed by an ALEF (U+0627 plain, U+0622 with madda above, U+0623 with hamza above, U+0625 with hamza below) folds into a single ligature glyph (U+FEFB - U+FEFC isolated / final forms with appropriate hamza / madda variants). This is one of the very few non-optional ligatures in Arabic typography and is required for correctness; rendering them as separate glyphs produces text that native readers immediately recognise as malformed. HotPDF performs the fold during emission so callers do not need a Harfbuzz-class shaper in their own code path.

 

Persian / Urdu core 9 letters (v2.119.35): Persian and Urdu extend Arabic with letters not in the Arabic Presentation Forms-B block (U+FE70 - U+FEFC). The 9 most-used such letters - including PEH (U+067E), TCHEH (U+0686), JEH (U+0698), KEH (U+06A9 / U+06AF), GAF (U+06AF), NOON GHUNNA (U+06BA), HEH DOACHASHMEE (U+06BE), YEH BARREE (U+06D2), and HEH GOAL (U+06C1) - have presentation forms in Arabic Presentation Forms-A (U+FB50 - U+FDFF). HotPDF now maps these letters to the appropriate Forms-A glyph during positional shaping, so Persian and Urdu text renders with the correct init / medi / fina / isol forms in any consumer reader.

 

Arabic Extended-A + Supplement (v2.119.52, v2.119.56): the joining-class table now covers the remaining Arabic Extended-A characters (ALEF WASLA / NOON GHUNNA / HEH variants), Arabic Supplement U+0750-U+077F, and the higher Arabic Extended-A U+08A0-U+08FF range. Characters with a static Presentation Forms-A encoding are mapped through the existing 4-position shaper; characters without (most of Extended-A) get joining-class classification only so neighbours shape correctly even when the character itself is passed through unchanged. v2.119.56 corrected two wrong Forms-A mappings introduced by v2.119.52 (U+06C2 / U+06C3).

 

Persian / Urdu Form-B full coverage (v2.119.57): the joining-class table was extended to span the full U+0672-U+06D5 range (about 80 characters covering REH / DAL / SEEN / SAD / TAH / AIN / FEH / QAF / KAF / GAF / LAM / NOON / HEH / WAW / YEH variants), and 26 new Presentation Forms-A mappings were added (15 D-class 4-form + 11 R-class 2-form). Notably, Urdu's standard 'h' HEH DOACHASHMEE (U+06BE → FBAA-FBAD) and Urdu word-final yeh YEH BARREE (U+06D2 → FBAE-FBAF) are now correctly mapped to the Forms-A slots that were misused by the v2.119.52 ALEF WASLA / NOON GHUNNA temporary fix. Static Forms-A coverage now spans 40+ source characters.

 

YEH-HAMZA + vowel ligature post-pass (v2.119.58): a post-pass added to _ApplyArabicShaping after positional shaping covers the 8 ligature pairs in Forms-A block U+FBEA-U+FBFB - YEH-HAMZA + ALEF / AE / WAW / U / OE / YU / E / ALEF MAKSURA. Each pair emits the isolated form (FBEA / FBEC / FBEE / FBF0 / FBF2 / FBF4 / FBF6 / FBF9) plus the base+1 final form. The starting / medial form slots FBF8 / FBFB are left to the GSUB engine via sfArabicGSUB opt-in. Built on the same skeleton as v2.119.32 LAM-ALEF.

 

Allah ligature (v2.119.60): the four-character sequence ALEF + LAM + LAM + HEH folds into U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM. Implemented as a codepoint-level static post-pass after v2.119.32 LAM-ALEF and v2.119.58 YEH-HAMZA in the _ApplyArabicShaping chain.

 

Bismillah phrase ligature (v2.119.62): the standard 22-codepoint Bismillah phrase "بسم الله الرحمن الرحيم" folds into a single glyph U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM. Runs as the very first pre-pass in _ApplyArabicShaping (before LAM-ALEF), so the folded Bismillah glyph reaches the rest of the pipeline as one codepoint and is not retouched by downstream substitutions.

 

How callers invoke it

The shaping pipeline runs automatically inside the Unicode text emission methods - no explicit "shape this" call is required. Pass Unicode-input Arabic / Persian / Urdu strings to THPDFPage.UnicodeTextOut or THPDFPage.RtLTextOut and HotPDF maps each input codepoint to its presentation form before writing the PDF text-showing operator. The original Unicode bytes are also captured into FUnicodeUsedCps for ToUnicode CMap generation, so reader-side copy / paste and screen readers still see the original Unicode payload.

 

Font requirements

The font registered via RegisterUnicodeTTF must contain glyphs for both the Arabic Presentation Forms-B block (U+FE70 - U+FEFC) and (for Persian / Urdu) the relevant Arabic Presentation Forms-A range (U+FB50 - U+FDFF). Recommended fonts that ship the full set:

 

Noto Sans Arabic (Latin + Arabic; Forms-A + Forms-B + Arabic Extended-A).

Noto Naskh Arabic, Noto Naskh Arabic UI.

Amiri (traditional Naskh, full Forms-A + Forms-B coverage).

Scheherazade New (SIL, designed for languages of the Muslim world).

Microsoft Arabic Typesetting / Tahoma / Times New Roman (Windows-bundled, full Forms-B; some have Forms-A coverage).

 

When the registered font lacks a glyph for a derived presentation form HotPDF falls back to the base Unicode codepoint and emits it unchanged; consumer readers may then attempt their own shaping (Acrobat, Foxit) or render .notdef.

 

Typical workflow

 

PDF.RegisterUnicodeTTF('NotoArab', 'NotoSansArabic-Regular.ttf');

PDF.BeginDoc;

PDF.CurrentPage.SetFont('NotoArab', [], 14);

PDF.CurrentPage.RtLTextOut(100, 700, 0,

  UnicodeString(#$0645#$0631#$062D#$0628#$0627));  // "marhaba" (hello)

PDF.EndDoc;

 

Automatic Phase 8 pipeline integration (v2.119.59 - v2.119.68)

v2.119.59 introduced the opt-in ShapingFeatures: THPDFShapingFeatures property and the THPDFShapingFeature enum, which elevates the shaping pipeline beyond static post-pass folding into automatic GSUB-driven font-specific substitution. Set PDF.ShapingFeatures := [sfArabicGSUB] to apply font-defined rlig substitutions automatically as text is emitted (v2.119.63 - ApplyArabicGSUBRefinement); add sfStandardLigatures for Latin Standard Ligatures (FB00-FB06 ff / fi / fl / ffi / ffl / ſt / st via v2.119.65 ApplyLatinLigatureRefinement); add sfContextualAlternates for rclt contextual alternates (v2.119.66 ApplyArabicGSUBContextualRefinement); add sfIndicShaping to enable the Devanagari pre-pass reorder (v2.119.67). Default [] preserves byte-identical output for callers who depend on the v2.119.32-58 static pipeline.

 

When sfArabicGSUB is set, the v2.85.0 static shaper is bypassed in favour of the font's own GSUB rules - callers who want the static shaper to keep running for codepoint coverage outside what the font's GSUB declares should leave sfArabicGSUB off and rely on the static post-pass chain.

 

ToUnicode CMap ligature reverse mapping (v2.119.61, v2.119.62, v2.119.65)

The Adobe-Identity-UCS ToUnicode CMap emitted by RegisterUnicodeTTF ships bfchar reverse-mapping entries for every ligature codepoint the post-pass chain can produce. v2.119.61 added 27 entries (8 LAM-ALEF + 18 YEH-HAMZA family + 1 Allah); v2.119.62 added Bismillah (U+FDFD); v2.119.65 added 7 Latin Standard Ligature entries (FB00-FB06). Consumer-reader copy / paste resolves any ligature glyph back to the source codepoint sequence, so the rendered PDF stays accessibility-friendly.

 

PUA synthetic codepoint emit (v2.119.68)

AssignSyntheticCodepointForGID(GID; out CP): Boolean + GetSyntheticCodepointForGID(GID): Word let producer code emit GSUB substitute GIDs that have no natural Unicode codepoint reachable through the font's cmap. The allocator hands out codepoints in the Private Use Area (U+E000 - U+F8FF, 6400 slots) and mirrors the assignment into FUnicodeCpToGid (so /CIDToGIDMap resolves the synthetic CP back to the target GID at the consumer reader), FAcroFormUnicodeAdvances (so v2.65 word-wrap finds the correct em-fraction), and a per-GID reverse-lookup table (so repeat assignment requests are idempotent). Use this for Devanagari cluster shapes, stylistic alternates, and CJK ideographic variation sequences that GSUB introduces but cmap does not reach.

 

Relationship to the OpenType GSUB engine

The static post-pass shaper described above (v2.85.0 + v2.119.32 / 58 / 60 / 62) is independent of the OpenType GSUB engine by default. When the automatic Phase 8 pipeline is enabled through ShapingFeatures, the GSUB engine becomes part of the producer-side emission path: each BuildUnicode*FieldContent helper consults the cmap to build a GID array, runs the appropriate GSUB feature query (rlig / liga / clig / rclt), maps the substitute GID back through the reverse cmap to a Presentation Form codepoint, and calls MarkUnicodeGlyphUsed to keep the substitute glyph inside the embedded font subset.

 

For substitute glyphs without a Unicode codepoint, callers combine ShapingFeatures with the v2.119.68 PUA synthetic-codepoint allocator so the substitute GID is still reachable through the standard hex pipeline. The static post-pass chain remains the default fallback for codepoints that the font's GSUB declarations do not cover.

 

Scope and limitations

No BiDi (Bidirectional) algorithm: callers must order the input string in visual order if mixing LTR + RTL runs, or use a separate BiDi library. No GPOS mark positioning: combining diacritics ride at their default offsets; fonts that need GPOS-driven adjustments may show slightly imprecise mark placement. No Indic / Khmer / Tibetan shaping: those scripts need cluster-aware reordering that the built-in pipeline does not perform. No Hebrew shaping: Hebrew does not have positional variants in the same way Arabic does, so it does not need this pipeline; pass Unicode Hebrew to UnicodeTextOut directly.

 

See also: OpenType GSUB Substitution Engine, Automatic Shaping Pipeline (Phase 8), Syriac / Mongolian / Devanagari Shaping, THotPDF.AssignSyntheticCodepointForGID, THPDFPage.RtLTextOut, THPDFPage.UnicodeTextOut, CFF / OpenType Font Subsetting