Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing spaces in extract_text() method #1328

Open
Sunguru opened this issue Sep 6, 2022 · 1 comment
Open

Missing spaces in extract_text() method #1328

Sunguru opened this issue Sep 6, 2022 · 1 comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@Sunguru
Copy link

Sunguru commented Sep 6, 2022

Missing spaces in extract_text() method.
See attached PDFs.
Text is being extracted nice, but it comes with no spaces from almost all fields.

Environment

$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0

Code + PDF

PDF: 0004.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("0004.pdf")

page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)

gives:

 Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.

expected (copy-pasted with Google chrome):

Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.

0000.pdf

Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx

p,s, Thank you for the great package!

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Sep 6, 2022
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
@tpcgold
Copy link

tpcgold commented Sep 11, 2023

any workaround on this so far?
I ran into the exact same issue with pypdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

3 participants