PDFiumVCL Docs

CharacterIsHyphen property

Component: TPdf  ·  Unit: PDFium
Returns True when the character at the specified index is recognised as a soft hyphen used for end-of-line word breaking. Useful when reconstructing words across line wraps without retaining the visible hyphen mark.

Syntax

property CharacterIsHyphen[Index: Integer]: Boolean; // read only

IndexZero-based character index on the current page, in the range 0 to CharacterCount - 1.

Description

CharacterIsHyphen returns True when PDFium classifies the character at Index as a soft (line-break) hyphen rather than a regular hyphen that belongs to a compound word. The classification is heuristic: it looks at the character's Unicode code point, the line-end position, and the proximity of the following text run to decide whether the hyphen should be kept when joining lines.

Soft hyphens are inserted by typesetting engines (LaTeX, Word, InDesign) to break long words across lines. When you copy text from a PDF viewer the soft hyphen is typically dropped so that "high-
way" pasted as highway rather than as high-way. This property lets your own extraction code reproduce that behaviour.

The property does not distinguish the Unicode soft hyphen (U+00AD) from a regular hyphen-minus (U+002D) that happens to sit at a line end — both will report True in those situations. If you only want the U+00AD case, compare Character[Index] directly.

Remarks

Example

// Reconstruct words across line wraps, dropping soft hyphens
var
  I: Integer;
  S: WString;
begin
  S := '';
  for I := 0 to Pdf.CharacterCount - 1 do
    if not Pdf.CharacterIsHyphen[I] then
      S := S + Pdf.Character[I];
  Memo1.Text := S;
end;

See Also

Character, CharacterGenerated, CharacterCount, Text