property CharacterIsHyphen[Index: Integer]: Boolean; // read only
| Index | Zero-based character index on the current page, in the range 0 to CharacterCount - 1. |
CharacterIsHyphen returns True when PDFium classifies the
character at Index as a soft (line-break) hyphen rather than a regular hyphen
that belongs to a compound word. The classification is heuristic: it looks at the
character's Unicode code point, the line-end position, and the proximity of the
following text run to decide whether the hyphen should be kept when joining lines.
Soft hyphens are inserted by typesetting engines (LaTeX, Word, InDesign) to break
long words across lines. When you copy text from a PDF viewer the soft hyphen is
typically dropped so that "high-
way" pasted as highway rather than as
high-way. This property lets your own extraction code reproduce that
behavior.
The property does not distinguish the Unicode soft hyphen (U+00AD) from
a regular hyphen-minus (U+002D) that happens to sit at a line end — both
will report True in those situations. If you only want the
U+00AD case, compare Character[Index]
directly.
False; the same hyphen at line end reports True.
// Reconstruct words across line wraps, dropping soft hyphens
var
I: Integer;
S: WString;
begin
S := '';
for I := 0 to Pdf.CharacterCount - 1 do
if not Pdf.CharacterIsHyphen[I] then
S := S + Pdf.Character[I];
Memo1.Text := S;
end;