ApplyIndicReorder

Signature

function ApplyIndicReorder(const Wide: UnicodeString): UnicodeString;

Purpose

Total entry point for Indic-script reorder pre-pass. Walks Wide left-to-right, dispatches each codepoint to the matching registered TIndicScriptInfo entry by Unicode-block range, and applies that script's syllable + reorder callbacks. Non-Indic codepoints pass through byte-identical. Script-boundary transitions inside Wide automatically segment.

Registered scripts (v2.120.10 Phase 8f.10 — batch complete)

  • Devanagari ('deva', U+0900–U+097F) — complete shaper (R1-R5)
  • Bengali ('beng', U+0980–U+09FF) — complete shaper (R1, R2, R4, R5 + split-matra decomposition)
  • Gujarati ('gujr', U+0A80–U+0AFF) — complete shaper (R1-R5; no split matras)
  • Tamil ('taml', U+0B80–U+0BFF) — complete shaper (R2-R5 + 3 split-matra decompositions; NO Repha — Tamil-specific)
  • Telugu ('telu', U+0C00–U+0C7F) — complete shaper (R1+R3+R4 + 1 split-matra decomposition; no pre-base matras)
  • Kannada ('knda', U+0C80–U+0CFF) — complete shaper (R1+R3+R4+R5 + 5 split-matra decompositions including 1 three-part split for U+0CCB OO; no pre-base matras)
  • Malayalam ('mlym', U+0D00–U+0D7F) — complete shaper (R1+R2+R4+R5 + 3 split-matra decompositions). I-matra (U+0D3F) is post-base (Tamil-like, unique vs Devanagari/Bengali/Gujarati). Chillu letters (U+0D54U+0D56, U+0D7AU+0D7F) and DOT REPH (U+0D4E) classified as consonants.
  • Sinhala ('sinh', U+0D80–U+0DFF) — complete shaper (R1+R2+R3+R4+R5 + 3 split-matra decompositions). Three pre-base matras (E=U+0DD9, EE=U+0DDA, AI=U+0DDB) — the highest pre-base-matra count among Phase 8f scripts. U+0DDD OO is a three-part split (pre + post + post). Completes the Brahmic SIA (South Indic Aryan) family.
  • Khmer ('khmr', U+1780–U+17FF) — first South-East Asian script; independent syllable FSM (not Brahmic R1-R5). NO Repha. COENG (U+17D2) + Consonant pairs form stacked subscripts and stay in-cluster (GSUB handles subscript positioning). Six pre-base vowels (E/AE/AI/OE/OO/AU) move to syllable start; register shifters (U+17C9 MUUSIKATOAN, U+17CA TRIISAP) and other signs route to above-base; Bindu (NIKAHIT U+17C6) → above; Visarga (REAHMUK / YUUKALEAPINTU U+17C7/U+17C8) → post.
  • Myanmar ('mymr', U+1000–U+109F + U+AA60–U+AA7F Extended-A) — most complex syllable structure in the batch. NO Repha. Kinzi 3-CP prefix (U+1004 + U+103A + U+1039) detected at syllable start, held aside, and emitted at output start per R8. Pre-base vowel E (U+1031) moves to syllable start per R10. Four medial consonants (U+103B YA, U+103C RA, U+103D WA, U+103E HA) emitted in fixed Y → R → W → H order regardless of source order per R9. ASAT (U+103A) and VIRAMA (U+1039) both treated as virama. Reorder algorithm uses 8 buffer slots: Kinzi + PreVowel + Base + (MedialY/R/W/H) + Above + Below + Post. Two IndicScripts entries (main block + Extended-A) share the same Myanmar reorder functions.

11-Phase non-Devanagari Indic shaping batch complete after this phase: Phases 8f.0 (infrastructure) and 8f.1–8f.10 (10 registered scripts — Brahmic SIA family + 2 South-East Asian scripts). Future shaping work may add Myanmar Extended-B / Extended-C, Tibetan, Lao / Thai SE Asian scripts, or other Unicode §12-§16 ranges in later Phase 8g+.

Producer-side automatic application

When sfIndicShaping is included in FShapingFeatures, ApplyIndicReorder is invoked automatically inside the three BuildUnicode*FieldContent helpers used for AcroForm appearance stream generation. Callers that bypass BuildUnicode* can call ApplyIndicReorder directly before feeding text into the cmap + GSUB pipeline (via SetGSUBScript('deva'), etc.).

Example

var
  Wide: UnicodeString;
begin
  Wide:= Doc.ApplyIndicReorder('Hello '+ #$0915#$093F+ ' world.');
  // Result: 'Hello ' + I-matra + KA + ' world.'
  // (Latin segments unchanged; Devanagari segment reordered.)
end;

See also

Standards

  • Unicode 16.0 chapters 12 (South Asian) and 16 (Southeast Asian)
  • ISO 32000-1 §9.10 (extraction of text content)
  • OpenType per-script shaping specs (Devanagari and siblings)

Version history

  • v2.119.69 — Introduced in Phase 8f.0. Ships with Devanagari registered (R1 + R2 only, inherited from Phase 8e).
  • v2.119.70 — Devanagari upgraded to complete shaper (R1-R5 + conjunct preservation) in Phase 8f.1.
  • v2.119.71 — Bengali registered as second Indic script (Phase 8f.2).
  • v2.119.72 — Gujarati registered as third Indic script (Phase 8f.3).
  • v2.119.73 — Tamil registered as fourth Indic script (Phase 8f.4).
  • v2.119.74 — Telugu registered as fifth Indic script (Phase 8f.5).
  • v2.119.75 — Kannada registered as sixth Indic script (Phase 8f.6). First script to demonstrate a three-part split-matra decomposition.
  • v2.119.76 — Malayalam registered as seventh Indic script (Phase 8f.7). Adds chillu consonants (U+0D54-U+0D56, U+0D7A-U+0D7F) and DOT REPH (U+0D4E) as Malayalam-specific consonant categories; I-matra post-base shared with Tamil.
  • v2.119.77 — Sinhala registered as eighth Indic script (Phase 8f.8). Three pre-base matras (E/EE/AI) — most among Phase 8f scripts. U+0DDD OO is a three-part split (pre + post + post). Completes the Brahmic SIA (South Indic Aryan) family.
  • v2.120.9 — Khmer registered as ninth Indic script (Phase 8f.9). First South-East Asian script; independent syllable FSM with COENG (U+17D2) subscript handling that stays in-cluster (no Repha, no Brahmic R1-R5). Six pre-base vowels rotate to syllable start; register shifters and other signs route to above-base; Bindu and Visarga get dedicated above / post routing. Per Unicode 16.0 §16.4 and OpenType Khmer shaping spec.
  • v2.120.10 — Myanmar registered as tenth and final Indic script (Phase 8f.10). Two IndicScripts entries cover Myanmar core block (U+1000U+109F) and Extended-A (U+AA60U+AA7F). Most complex syllable structure in the batch: Kinzi 3-CP prefix detection (R8), pre-base E vowel rotation (R10), fixed Y → R → W → H medial sorting (R9), ASAT / VIRAMA stacked-consonant handling, DOT BELOW / tone mark routing. 8-slot buffer model. Completes the 11-Phase non-Devanagari Indic shaping batch (Phases 8f.0–8f.10): infrastructure + 10 registered Indic scripts covering the Brahmic SIA family plus 2 South-East Asian scripts. Per Unicode 16.0 §16.3 and OpenType Myanmar shaping spec.