Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332

MaxBorn22 · 2024-02-29T11:21:49Z

pdf-testfile: Minimal set with UTF-8 characters encoded
ill01.pdf
ill00.pdf

to start the extract of text, use:
node pdf2json -cvf ill01.pdf

I expect true character mappings if there are UTF-8 characters encoded, see at end for details.
See extracted text of ill01.pdf
See extracted text of ill00.pdf and search for terms that include 'ff' ot 'ft' or "n's"

ill01.pdf
PDF file(s) that cause the issue. See top: ill01.pdf

content of the pdf-file (seen at end):


/Encoding /Identity-H
/DescendantFonts [147 0 R]
/ToUnicode 148 0 R>>
endobj

What does the "CMap/encoding Identity-H" tell us?
the character codes (CIDs) are the same as the glyph indices (GIDs), so there's no need to remap them.

However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.

0000059775 00000 n 
0000000192 00000 n 
0000000392 00000 n 
0000000591 00000 n 
0000020809 00000 n 
0000006174 00000 n 
0000006481 00000 n 
0000006776 00000 n 
0000021054 00000 n 
0000012089 00000 n 
0000012374 00000 n 
0000012642 00000 n

[email protected] [https://github.com/modesty/pdf2json]
-------------
json2pdf-log:
Warning: Output file will be replaced - ill01.json
Info: Transcoding File ill01.pdf to - ill01.json
Info: about to load PDF file ill01.pdf
Info: Load OK: ill01.pdf
Warning: Setting up fake worker.
Info: PDF loaded. pagesCount = 1
Info: start to parse page:1
Warning: TT: complementing a missing function tail
Info: Skipped: tiny fill: 0 x 0
Info: Success: Page 1
Info: complete parsing page:1
Info: PDF parsing completed.

Note that both viewers tested, Chromium or Edge, are able to map the UTF-8-characters as given,
pdf.js does not
pypdf does not

The text was updated successfully, but these errors were encountered:

MaxBorn22 · 2024-02-29T11:22:30Z

The file "ill.pdf" to test is a small one, the picture above shows the result in json.
ff is a latin character, &#64256 see https://www.compart.com/en/unicode/U+FB00
but there is also missing the "ft" in ". Left unchecked" in the json-output of the test file "ill.pdf".

MaxBorn22 · 2024-03-01T09:31:49Z

ill01.pdf
content of the pdf-file (seen at end):

/Encoding /Identity-H
/DescendantFonts [147 0 R]
/ToUnicode 148 0 R>>
endobj

What does the "CMap/encoding Identity-H" tell us?
the character codes (CIDs) are the same as the glyph indices (GIDs), so there's no need to remap them.

However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.

ill01.pdf

0000059775 00000 n 
0000000192 00000 n 
0000000392 00000 n 
0000000591 00000 n 
0000020809 00000 n 
0000006174 00000 n 
0000006481 00000 n 
0000006776 00000 n 
0000021054 00000 n 
0000012089 00000 n 
0000012374 00000 n 
0000012642 00000 n

![image](https://github.com/modesty/pdf2json/assets/130582247/9dfe54ae-90ed-4fad-b486-38097f920d8c)
[email protected] [https://github.com/modesty/pdf2json]
-------------
json2pdf-log:
Warning: Output file will be replaced - ill01.json
Info: Transcoding File ill01.pdf to - ill01.json
Info: about to load PDF file ill01.pdf
Info: Load OK: ill01.pdf
Warning: Setting up fake worker.
Info: PDF loaded. pagesCount = 1
Info: start to parse page:1
Warning: TT: complementing a missing function tail
Info: Skipped: tiny fill: 0 x 0
Info: Success: Page 1
Info: complete parsing page:1
Info: PDF parsing completed.

Note that both viewers tested, Chromium or Edge, are able to map the UTF-8-characters as given,
pdf.js does not
pypdf does not

MaxBorn22 changed the title ~~Who makes \u0000 out of each "ff" , e.g. "trafficking" goes "tra\00icking ?~~ Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332

Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332

MaxBorn22 commented Feb 29, 2024 •

edited

Loading

MaxBorn22 commented Feb 29, 2024 •

edited

Loading

MaxBorn22 commented Mar 1, 2024

Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332

Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332

Comments

MaxBorn22 commented Feb 29, 2024 • edited Loading

However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.

MaxBorn22 commented Feb 29, 2024 • edited Loading

MaxBorn22 commented Mar 1, 2024

However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.

MaxBorn22 commented Feb 29, 2024 •

edited

Loading

MaxBorn22 commented Feb 29, 2024 •

edited

Loading