ApplySinhalaReorder

Signature

function ApplySinhalaReorder(const Wide: UnicodeString): UnicodeString;

Purpose

Applies the Sinhala reorder pre-pass to Wide and returns the reordered UnicodeString ready for cmap + GSUB consumption. Non-Sinhala content passes through byte-identical.

Sinhala specifics

  • R1 Repha enabled — Ra (U+0DBB) + Halant (U+0DCA AL-LAKUNA) at syllable start is detected, the pair is stripped from the cluster and re-emitted after the reordered output so the font's 'rphf' GSUB feature can substitute the Repha glyph.
  • Three pre-base matras (MatraPos = 1): E (U+0DD9), EE (U+0DDA), AI (U+0DDB). Sinhala is unique among the Phase 8f Brahmic scripts in having three logically pre-base matras — EE and AI have Top_And_Left visual positions but are stored pre-base; the visual top component is rendered by font GSUB.
  • Three split matras with Unicode 16.0 canonical decompositions:
    • U+0DDC O → U+0DD9 (pre) + U+0DCF (post).
    • U+0DDD OO → U+0DD9 (pre) + U+0DCF (post) + U+0DCA (post) — three-part split; the trailing AL-LAKUNA is part of the canonical decomposition, not a syllable-level virama.
    • U+0DDE AU → U+0DD9 (pre) + U+0DDF (post).
  • Above-base matras (MatraPos = 3): I (U+0DD2), II (U+0DD3).
  • Below-base matras (MatraPos = 4): U (U+0DD4), UU (U+0DD6).
  • Post-base matras (MatraPos = 2): AA (U+0DCF), AE / AAE (U+0DD0U+0DD1), Vocalic R matra (U+0DD8), L matra (U+0DDF), LL / LLL matras (U+0DF2U+0DF3).
  • Halant in Sinhala is called AL-LAKUNA (U+0DCA); RA (Repha trigger) is U+0DBB.

Reorder rules applied

  • R1 Repha: Ra + AL-LAKUNA at syllable start re-emitted at the end of the syllable.
  • R2 Pre-base matras: E / EE / AI emit before the base block.
  • R3 Above-base matras: I / II emit after the base block.
  • R4 Below-base matras: U / UU emit after the above-base block.
  • R5 Post-base matras: AA / AE / AAE / Vocalic R / L / LL / LLL matras emit after the below-base block.
  • Split matra decomposition: O / OO / AU expanded per Unicode 16.0 canonical decomposition; OO is the first three-part split this shaper family produces from a non-Kannada source codepoint.

Output layout per syllable: [pre-matras] + [base + halant + bindu/visarga] + [above-matras] + [below-matras] + [post-matras] + [Repha: Ra AL-LAKUNA]?. Conjuncts (C + AL-LAKUNA + C) preserved in the base block. Single-pass; idempotent on 2-part splits.

Example

var
  Wide: UnicodeString;
begin
  // Input: KA (U+0D9A) + O-matra (U+0DDC, 2-part split: pre + post)
  Wide:= Doc.ApplySinhalaReorder(#$0D9A#$0DDC);
  // Wide is now: E (U+0DD9, pre) + KA + AA (U+0DCF, post)
end;

See also

Standards

  • Unicode 16.0 §12.11 (Sinhala)
  • Unicode 16.0 IndicSyllabicCategory.txt, IndicPositionalCategory.txt, and UnicodeData.txt (canonical decomposition source)
  • ISO 32000-1 §9.10 (extraction of text content)
  • OpenType Sinhala shaping spec (script tag 'sinh')

Version history

  • v2.119.77 — Introduced in Phase 8f.8. Complete shaper (R1 + R2 + R3 + R4 + R5 + three split-matra decompositions). Sinhala becomes the eighth registered Indic script and completes the Brahmic SIA (South Indic Aryan) family. Notable for having three pre-base matras (more than any other Phase 8f shaper) and the three-part canonical decomposition of U+0DDD OO.