ApplyKhmerReorder

Signature

function ApplyKhmerReorder(const Wide: UnicodeString): UnicodeString;

Purpose

Applies the Khmer reorder pre-pass to Wide and returns the reordered UnicodeString ready for cmap + GSUB consumption. Non-Khmer content passes through byte-identical. Khmer is the first South-East Asian script registered in HotPDF and uses an independent syllable structure distinct from the Brahmic R1–R5 family handled by Phases 8f.1–8f.8.

Khmer specifics

  • NO Repha — Khmer does not form a Repha visual. Ra + COENG + Consonant stays in original order rather than rotating to the end of the cluster.
  • COENG (U+17D2) is a subscript joiner: each COENG + Consonant pair forms a stacked-consonant cluster. The pair stays in the base buffer in original order and the font's GSUB 'pres' / 'blws' features handle subscript positioning. Nested coeng (C + COENG + C + COENG + C) is supported by the FSM.
  • VIRIAM (U+17D1) is a separate virama-like sign distinct from COENG: it does not stack a following consonant. The syllable FSM tracks the previous codepoint (not just the previous category) so it can distinguish COENG from VIRIAM when deciding whether a trailing consonant continues this syllable.
  • Pre-base vowels (MatraPos = 1): E (U+17C1), AE (U+17C2), AI (U+17C3), OE (U+17BE), OO (U+17C4), AU (U+17C5). OE / OO / AU have visual top or right components that are GSUB-rendered; the reorder pre-pass moves only the logical pre-base component to syllable start.
  • Register shifters — MUUSIKATOAN (U+17C9, 1st series) and TRIISAP (U+17CA, 2nd series). These plus other signs (U+17CBU+17D0, U+17D3, U+17DD) are categorised as 12 and route to the above-base buffer.
  • Bindu (NIKAHIT U+17C6) is categorised as 6 and routes to above-base.
  • Visarga (REAHMUK U+17C7, YUUKALEAPINTU U+17C8) is categorised as 7 and routes to post-base.
  • Above-base vowels (MatraPos = 3): I (U+17B7), II (U+17B8), Y (U+17B9), YY (U+17BA).
  • Below-base vowels (MatraPos = 4): U (U+17BB), UU (U+17BC), UA (U+17BD).
  • Post-base vowels (MatraPos = 2): AA (U+17B6), YA (U+17BF), IE (U+17C0).

Reorder behavior

  • Pre-base vowels emit before the base block (analogous to Brahmic R2).
  • Above-base vowels emit after the base block.
  • Below-base vowels emit after the above-base block.
  • Post-base vowels and Visarga emit after the below-base block.
  • Register shifters, Bindu, and other above signs all route to the above-base block so they render after the consonant stack.
  • Consonant cluster (consonants + COENG pairs) stays in the base block in original logical order — GSUB 'pres' / 'blws' handle the visual subscript stacking.
  • NO Repha extraction: Khmer syllables containing Ra + COENG + Consonant stay verbatim in the base block.

Output layout per syllable: [pre-vowels] + [consonants + COENG pairs] + [above-vowels / register-shifters / Bindu / signs] + [below-vowels] + [post-vowels / Visarga]. Single-pass; idempotent on simple inputs.

Example

var
  Wide: UnicodeString;
begin
  // Input: KA (U+1780) + E-vowel (U+17C1, pre-base)
  Wide:= Doc.ApplyKhmerReorder(#$1780#$17C1);
  // Wide is now: E (U+17C1) + KA (U+1780)

  // Input: KA + COENG + KHA + AA-vowel (stacked consonant + post vowel)
  Wide:= Doc.ApplyKhmerReorder(#$1780#$17D2#$1781#$17B6);
  // Wide unchanged: KA + COENG + KHA + AA
  //   (COENG cluster stays in BaseBuf in original order; AA in PostBuf)
end;

See also

Standards

  • Unicode 16.0 §16.4 (Khmer)
  • Unicode 16.0 IndicSyllabicCategory.txt, IndicPositionalCategory.txt
  • ISO 32000-1 §9.10 (extraction of text content)
  • OpenType Khmer shaping spec (script tag 'khmr')

Version history

  • v2.120.9 — Introduced in Phase 8f.9. Complete shaper with COENG subscript handling, register shifters, and pre-base vowel rotation. Khmer becomes the ninth registered Indic script and the first South-East Asian script in the registry.