Missing spaces in extract_text() method #1328

Sunguru · 2022-09-06T15:51:16Z

Missing spaces in extract_text() method.
See attached PDFs.
Text is being extracted nice, but it comes with no spaces from almost all fields.

Environment

$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0

Code + PDF

PDF: 0004.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("0004.pdf")

page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)

gives:

 Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.

expected (copy-pasted with Google chrome):

Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.

0000.pdf

Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx

p,s, Thank you for the great package!

The text was updated successfully, but these errors were encountered:

tpcgold · 2023-09-11T12:41:47Z

any workaround on this so far?
I ran into the exact same issue with pypdf

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Sep 6, 2022

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing spaces in extract_text() method #1328

Missing spaces in extract_text() method #1328

Sunguru commented Sep 6, 2022 •

edited by MartinThoma

Loading

tpcgold commented Sep 11, 2023

Missing spaces in extract_text() method #1328

Missing spaces in extract_text() method #1328

Comments

Sunguru commented Sep 6, 2022 • edited by MartinThoma Loading

Environment

Code + PDF

tpcgold commented Sep 11, 2023

Sunguru commented Sep 6, 2022 •

edited by MartinThoma

Loading