|
Arabic / Persian / Urdu Shaping Support Producer-side shaping pipeline (v2.85.0 - v2.119.68)
|
OpenType GSUB Engine CFF / OpenType Subsetting |
|
HotPDF runs a producer-side shaping pipeline that folds Unicode-input Arabic, Persian, and Urdu runs into their Arabic Presentation Forms during PDF text emission, so consumer readers receive ready-to-render positional / ligature glyphs without needing a Harfbuzz-class shaper of their own.
What the pipeline does Positional shaping (v2.85.0): every Arabic base letter has up to four positional forms - isolated (
LAM-ALEF mandatory ligature (v2.119.32): the Arabic Unicode Shaping spec mandates that any LAM (U+0644) immediately followed by an ALEF (U+0627 plain, U+0622 with madda above, U+0623 with hamza above, U+0625 with hamza below) folds into a single ligature glyph (U+FEFB - U+FEFC isolated / final forms with appropriate hamza / madda variants). This is one of the very few non-optional ligatures in Arabic typography and is required for correctness; rendering them as separate glyphs produces text that native readers immediately recognize as malformed. HotPDF performs the fold during emission so callers do not need a Harfbuzz-class shaper in their own code path.
Persian / Urdu core 9 letters (v2.119.35): Persian and Urdu extend Arabic with letters not in the Arabic Presentation Forms-B block (U+FE70 - U+FEFC). The 9 most-used such letters - including PEH (U+067E), TCHEH (U+0686), JEH (U+0698), KEH (U+06A9 / U+06AF), GAF (U+06AF), NOON GHUNNA (U+06BA), HEH DOACHASHMEE (U+06BE), YEH BARREE (U+06D2), and HEH GOAL (U+06C1) - have presentation forms in Arabic Presentation Forms-A (U+FB50 - U+FDFF). HotPDF now maps these letters to the appropriate Forms-A glyph during positional shaping, so Persian and Urdu text renders with the correct init / medi / fina / isol forms in any consumer reader.
Arabic Extended-A + Supplement (v2.119.52, v2.119.56): the joining-class table now covers the remaining Arabic Extended-A characters (ALEF WASLA / NOON GHUNNA / HEH variants), Arabic Supplement U+0750-U+077F, and the higher Arabic Extended-A U+08A0-U+08FF range. Characters with a static Presentation Forms-A encoding are mapped through the existing 4-position shaper; characters without (most of Extended-A) get joining-class classification only so neighbors shape correctly even when the character itself is passed through unchanged. v2.119.56 corrected two wrong Forms-A mappings introduced by v2.119.52 (U+06C2 / U+06C3).
Persian / Urdu Form-B full coverage (v2.119.57): the joining-class table was extended to span the full U+0672-U+06D5 range (about 80 characters covering REH / DAL / SEEN / SAD / TAH / AIN / FEH / QAF / KAF / GAF / LAM / NOON / HEH / WAW / YEH variants), and 26 new Presentation Forms-A mappings were added (15 D-class 4-form + 11 R-class 2-form). Notably, Urdu's standard 'h' HEH DOACHASHMEE (U+06BE → FBAA-FBAD) and Urdu word-final yeh YEH BARREE (U+06D2 → FBAE-FBAF) are now correctly mapped to the Forms-A slots that were misused by the v2.119.52 ALEF WASLA / NOON GHUNNA temporary fix. Static Forms-A coverage now spans 40+ source characters.
YEH-HAMZA + vowel ligature post-pass (v2.119.58): a post-pass added to
Allah ligature (v2.119.60): the four-character sequence ALEF + LAM + LAM + HEH folds into U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM. Implemented as a codepoint-level static post-pass after v2.119.32 LAM-ALEF and v2.119.58 YEH-HAMZA in the
Bismillah phrase ligature (v2.119.62): the standard 22-codepoint Bismillah phrase "بسم الله الرحمن الرحيم" folds into a single glyph U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM. Runs as the very first pre-pass in
How callers invoke it The shaping pipeline runs automatically inside the Unicode text emission methods - no explicit "shape this" call is required. Pass Unicode-input Arabic / Persian / Urdu strings to THPDFPage.UnicodeTextOut or THPDFPage.RtLTextOut and HotPDF maps each input codepoint to its presentation form before writing the PDF text-showing operator. The original Unicode bytes are also captured into
Font requirements The font registered via
Noto Sans Arabic (Latin + Arabic; Forms-A + Forms-B + Arabic Extended-A). Noto Naskh Arabic, Noto Naskh Arabic UI. Amiri (traditional Naskh, full Forms-A + Forms-B coverage). Scheherazade New (SIL, designed for languages of the Muslim world). Microsoft Arabic Typesetting / Tahoma / Times New Roman (Windows-bundled, full Forms-B; some have Forms-A coverage).
When the registered font lacks a glyph for a derived presentation form HotPDF falls back to the base Unicode codepoint and emits it unchanged; consumer readers may then attempt their own shaping (Acrobat, Foxit) or render
Typical workflow
PDF.RegisterUnicodeTTF('NotoArab', 'NotoSansArabic-Regular.ttf'); PDF.BeginDoc; PDF.CurrentPage.SetFont('NotoArab', [], 14); PDF.CurrentPage.RtLTextOut(100, 700, 0, UnicodeString(#$0645#$0631#$062D#$0628#$0627)); // "marhaba" (hello) PDF.EndDoc;
Automatic Phase 8 pipeline integration (v2.119.59 - v2.119.68) v2.119.59 introduced the opt-in
When
ToUnicode CMap ligature reverse mapping (v2.119.61, v2.119.62, v2.119.65) The Adobe-Identity-UCS ToUnicode CMap emitted by
PUA synthetic codepoint emit (v2.119.68)
Relationship to the OpenType GSUB engine The static post-pass shaper described above (v2.85.0 + v2.119.32 / 58 / 60 / 62) is independent of the OpenType GSUB engine by default. When the automatic Phase 8 pipeline is enabled through
For substitute glyphs without a Unicode codepoint, callers combine
Scope and limitations No BiDi (Bidirectional) algorithm: callers must order the input string in visual order if mixing LTR + RTL runs, or use a separate BiDi library. No GPOS mark positioning: combining diacritics ride at their default offsets; fonts that need GPOS-driven adjustments may show slightly imprecise mark placement. No Indic / Khmer / Tibetan shaping: those scripts need cluster-aware reordering that the built-in pipeline does not perform. No Hebrew shaping: Hebrew does not have positional variants in the same way Arabic does, so it does not need this pipeline; pass Unicode Hebrew to
See also: OpenType GSUB Substitution Engine, Automatic Shaping Pipeline (Phase 8), Syriac / Mongolian / Devanagari Shaping, THotPDF.AssignSyntheticCodepointForGID, THPDFPage.RtLTextOut, THPDFPage.UnicodeTextOut, CFF / OpenType Font Subsetting |