-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332
Comments
The file "ill.pdf" to test is a small one, the picture above shows the result in json. |
ill01.pdf
What does the "CMap/encoding Identity-H" tell us? However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.
Note that both viewers tested, Chromium or Edge, are able to map the UTF-8-characters as given, |
ill.pdf
![image](https://private-user-images.githubusercontent.com/130582247/308902923-b4a7ad54-798d-41d3-99e7-f5ac200df8c7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk3ODMyNzEsIm5iZiI6MTcxOTc4Mjk3MSwicGF0aCI6Ii8xMzA1ODIyNDcvMzA4OTAyOTIzLWI0YTdhZDU0LTc5OGQtNDFkMy05OWU3LWY1YWMyMDBkZjhjNy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjMwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYzMFQyMTI5MzFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mNmZhODk2NDkyMTVjNDcyMmY2NmQ4MDAwNTdiMDY0YjY3MmQxZTBmNzEzMDk0MWE1NTZjYmY5ODUzMzYyMGVhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.2Yuisy5YxNtsdiREU4a1W3jzJJsllEegQG91w0HIBtY)
pdf-testfile: Minimal set with UTF-8 characters encoded
ill01.pdf
ill00.pdf
to start the extract of text, use:
node pdf2json -cvf ill01.pdf
I expect true character mappings if there are UTF-8 characters encoded, see at end for details.
See extracted text of ill01.pdf
See extracted text of ill00.pdf and search for terms that include 'ff' ot 'ft' or "n's"
ill01.pdf
PDF file(s) that cause the issue. See top: ill01.pdf
content of the pdf-file (seen at end):
What does the "CMap/encoding Identity-H" tell us?
the character codes (CIDs) are the same as the glyph indices (GIDs), so there's no need to remap them.
However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.
Note that both viewers tested, Chromium or Edge, are able to map the UTF-8-characters as given,
pdf.js does not
pypdf does not
The text was updated successfully, but these errors were encountered: