THotPDF.AssignSyntheticCodepointForGID / GetSyntheticCodepointForGID

THotPDF PUA synthetic codepoint allocator (v2.119.68)

 

GSUB Engine  Auto Shaping Pipeline  Arabic Shaping

Allocates and queries Private Use Area (U+E000 - U+F8FF) synthetic codepoints for OpenType GSUB substitute GIDs that have no natural Unicode codepoint reachable through the font's cmap. Closes the producer-side GID-level emission gap left by the v2.119.43-66 GSUB query and refinement APIs.

 

Delphi syntax:

function AssignSyntheticCodepointForGID(GID: Word; out SyntheticCP: Word): Boolean;

function GetSyntheticCodepointForGID(GID: Word): Word;

 

Why the API exists

The v2.119.32-67 producer-side automatic shaping pipeline (Arabic / Latin / Devanagari) requires substitute GIDs returned by the GSUB engine to be reachable through a Unicode codepoint - the existing hex-encoded text pipeline emits codepoints, not GIDs, and the consumer reader resolves the codepoint back to a GID through the embedded /CIDToGIDMap. For substitute GIDs that have a natural Unicode codepoint via the font's cmap (Arabic Presentation Forms, Latin Standard Ligatures FB00-FB06), the existing pipeline works fine.

 

But font-specific substitutes that land on font-internal GIDs - most Devanagari cluster shapes, stylistic alternates the font designer ships only as numbered GIDs, CJK ideographic variation sequences (IVS), discretionary ligatures with no corresponding Presentation Form - have no codepoint at all in the font's cmap. Pre-v2.119.68 these GIDs were unreachable through the producer-side hex pipeline; v2.119.68 closes that gap by letting callers allocate a synthetic codepoint in the Private Use Area for any GID.

 

AssignSyntheticCodepointForGID semantics

Allocates the next available PUA codepoint (starting at U+E000) for the supplied GID and mirrors the assignment into every cache that the existing producer-side hex pipeline + consumer-reader resolution chain depends on:

 

1. FUnicodeCpToGid[SyntheticCP] := GID - so the producer-side hex pipeline emits SyntheticCP into the text-showing operator and the consumer reader resolves SyntheticCP back to GID through /CIDToGIDMap at render time.

2. FAcroFormUnicodeAdvances[SyntheticCP] := em-fraction - so the v2.65 word-wrap calculator finds the correct hmtx advance for the synthetic codepoint when it appears in AcroForm text-field content.

3. FUnicodeSyntheticCpForGID[GID] := SyntheticCP - the per-GID reverse-lookup table used by GetSyntheticCodepointForGID to make repeat AssignSyntheticCodepointForGID calls idempotent (the second call with the same GID returns the already-allocated SyntheticCP).

 

Returns True on success with SyntheticCP set to the allocated codepoint. Returns False (and leaves SyntheticCP at 0) under any of these defensive conditions: no font registered (RegisterUnicodeTTF never called or called with empty arguments to reset state), invalid GID (zero or beyond the cmap's glyph count), PUA range exhausted (all 6400 slots U+E000 - U+F8FF allocated), cache uninitialised on entry.

 

GetSyntheticCodepointForGID semantics

Pure-functional query of any existing assignment. Returns the synthetic codepoint allocated for GID if AssignSyntheticCodepointForGID(GID, ...) has been called previously; otherwise returns 0 (which is not a valid PUA codepoint, so it doubles as a "no assignment" sentinel). Does not allocate. Safe to call before any AssignSyntheticCodepointForGID has run.

 

Allocator state lifecycle

FUnicodeSyntheticCpForGID and the next-available-PUA cursor (FUnicodeNextSyntheticCp) are lazy-allocated on first AssignSyntheticCodepointForGID call. The cursor starts at 0 (uninitialised) and bumps to $E000 on first allocation; subsequent allocations move it through $E001, $E002, ..., $F8FF. Both fields are reset to empty / 0 on every RegisterUnicodeTTF('', nil) together with the rest of the per-font subset state, so callers that re-use a THotPDF instance across multiple documents start each document with a fresh PUA cursor.

 

Typical workflow (Devanagari cluster shape)

 

PDF.RegisterUnicodeTTF('NotoDeva', 'NotoSansDevanagari-Regular.ttf');

PDF.ShapingFeatures := [sfIndicShaping];

PDF.SetGSUBScript('deva');

 

// Get a font-internal cluster GID through the GSUB engine

ClusterGID := PDF.GetSingleSubstituteGlyph(BaseGID, 'nukt');

if ClusterGID <> BaseGID then

begin

  // Check if cmap reaches the substitute - usually no for Indic

  // clusters, since cluster GIDs are font-internal

  // Allocate a synthetic codepoint that the producer-side

  // hex pipeline can emit

  if PDF.AssignSyntheticCodepointForGID(ClusterGID, SyntheticCP) then

  begin

    // SyntheticCP is now in the U+E000-F8FF range; emit it

    // through UnicodeTextOut just like a normal codepoint

    PDF.CurrentPage.UnicodeTextOut(X, Y, 0, UnicodeChar(SyntheticCP));

    PDF.MarkUnicodeGlyphUsed(ClusterGID);

  end;

end;

 

Idempotency example

 

PDF.AssignSyntheticCodepointForGID(150, CP1);  // CP1 = $E000

PDF.AssignSyntheticCodepointForGID(151, CP2);  // CP2 = $E001

PDF.AssignSyntheticCodepointForGID(150, CP3);  // CP3 = $E000 (idempotent)

CP4 := PDF.GetSyntheticCodepointForGID(150);  // CP4 = $E000

CP5 := PDF.GetSyntheticCodepointForGID(999);  // CP5 = 0 (no assignment)

 

Consumer-reader behaviour

The consumer reader sees the PUA codepoint in the text-showing operator and resolves it through the document-embedded /CIDToGIDMap to the target GID, then renders that GID using the embedded font program. From the reader's perspective there is no difference between a "natural" Unicode codepoint that the cmap routes to GID and a PUA synthetic codepoint that /CIDToGIDMap routes to GID - both produce the same rendered glyph.

 

Copy / paste behaviour: PUA codepoints round-trip as themselves through copy / paste when the ToUnicode CMap declares them as identity mappings. Callers that want the source Unicode characters (the input run that produced the substitute) to round-trip instead can register a reverse mapping with RegisterToUnicodeReverseMapping or author ActualText marked-content sequence properties through BeginTaggedContent and emit the synthetic codepoints inside the bracketed content. HotPDF uses the same internal-CID pattern automatically for RegisterUnicodeTTF-backed AcroForm appearance streams that contain supplementary-plane Unicode characters.

 

Phase 8 roadmap closure

v2.119.68 / Phase 8c.6 closes the Phase 8 GSUB engine roadmap: every LookupType 1-8 query API (Phase 1-6), the Script / LangSys selection API (Phase 7), the TTF subsetter closure entry point (Phase 9), the static post-pass ligature folding (v2.119.32 / 58 / 60 / 62), the opt-in automatic pipeline (v2.119.59), Arabic rlig + Latin liga / clig + rclt automatic emission (Phase 8b / 8c.2 / 8b / GSUB 'rclt'), ToUnicode reverse-mapping (v2.119.61 / 62 / 65), advance query (v2.119.64), Devanagari Indic reorder pre-pass (v2.119.67), and now PUA synthetic codepoint GID-level emit (v2.119.68) all integrate into a single producer-side shaping surface that handles every kind of substitute glyph an OpenType font can produce.

 

See also: OpenType GSUB Substitution Engine, Automatic Shaping Pipeline (Phase 8), Arabic / Persian / Urdu Shaping Support, Syriac / Mongolian / Devanagari Shaping, THPDFPage.BeginTaggedContent