Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.1 extract_text() misses newline characters #957

Closed
S1SYPHOS opened this issue Jun 7, 2022 · 14 comments
Closed

v2.1 extract_text() misses newline characters #957

S1SYPHOS opened this issue Jun 7, 2022 · 14 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@S1SYPHOS
Copy link

S1SYPHOS commented Jun 7, 2022

Hey there,
when updating from v2.0 to v2.1, extracted words that were separated by whitespaces whitespaces before are now glued together, (see below for example).

Environment

Machine: Linux-5.17.5-76051705-generic-x86_64-with-glibc2.34
PyPDF: 2.1.0

Code

This is a minimal, complete example that shows the issue:

import PyPDF2

# First page
page = PyPDF2.PdfReader('tests/fixtures/test.pdf').pages[0]

print(page.extract_text())

Now, output with v2.0 was like this:

Staatsanwaltschaft Freiburg
Berliner Allee 1, 79114 Freiburg im Breisgau
Tel.:07612050
FreiburgimBreisgau,19.11.2021
Sitzungsplan der Staatsanwaltschaft
Zeitraum
29.11.2021
-
03.12.2021
Eildienst
26.11.2021
-
29.11.2021
Plattner, Adalbert , EOAA

Using v2.1, I get this:

Staatsanwaltschaft FreiburgBerliner Allee 1, 79114 Freiburg im BreisgauTel.: 0761 20 50Freiburg im Breisgau, 19.11.2021

Sitzungsplan der Staatsanwaltschaft



Zeitraum



29.11.2021-03.12.2021Eildienst26.11.2021-29.11.2021Plattner, Adalbert , EOAA

PDF

PDF file from example can be found here. The names were redacted, so no personal information despite the looks of it.

@S1SYPHOS S1SYPHOS added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 7, 2022
@MartinThoma
Copy link
Member

Thank you for sharing and putting the time into writing an awesome Bug report!
We will investigate it.

In the mean time, you can use _extract_text_old to get the pre-2.1 behavior.

@S1SYPHOS
Copy link
Author

S1SYPHOS commented Jun 7, 2022

Yeah, for the time being I constrained its version like 'PyPDF2==2.0.0', but you are right!

@MartinThoma
Copy link
Member

  • 2.0 Had issues with spaces between words, e.g. FreiburgimBreisgau
  • 2.1 Has issues with newlines, e.g. Staatsanwaltschaft FreiburgBerliner Allee 1

@MartinThoma
Copy link
Member

Might be related to #591

@pubpub-zz
Copy link
Collaborator

I've started to have a look at the file, and the pdf shows cases I would have never guess. the Tm matrix shows an inverted which means that the document is filled upside/down.

Correction is under analysis...

@S1SYPHOS
Copy link
Author

S1SYPHOS commented Jun 7, 2022

Glad it's an edge case, never would've guessed ;)

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jun 7, 2022
@MartinThoma MartinThoma changed the title v2.1 output of extract_text() glued together v2.1 extract_text() misses newline characters Jun 11, 2022
@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 19, 2022
@MartinThoma
Copy link
Member

I just confirmed that this is still an issue with the current master (soon PyPDF2==2.3.0) 😢

@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 19, 2022
@pubpub-zz
Copy link
Collaborator

the fix was not issued still working on...

@pubpub-zz
Copy link
Collaborator

improved by PR #1084

MartinThoma pushed a commit that referenced this issue Jul 13, 2022
* ENH : extract width from CIDFontType0/2
* ENH  : improve cr/lf and space extraction
* BUG : fix error in decoding #1075
* FIX: in ToUnicode  ignore comments (starting with %)
* FIX: extend utf16 for min of 4 characters

Improves #234
Improves #957
Closes #1003
Closes #1019

Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022
* ENH : extract width from CIDFontType0/2
* ENH  : improve cr/lf and space extraction
* BUG : fix error in decoding py-pdf#1075
* FIX: in ToUnicode  ignore comments (starting with %)
* FIX: extend utf16 for min of 4 characters

Improves py-pdf#234
Improves py-pdf#957
Closes py-pdf#1003
Closes py-pdf#1019

Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
@creepiepanda
Copy link

Not sure if this is related. I was using 2.8.1 and everything worked perfectly but any version above (2.9.0 and higher) had the same issue for me. With 2.9.0 and higher the output for reader.pages[0].extract_text() came entirely without newlines.

@pubpub-zz
Copy link
Collaborator

@creepiepanda can you confirm this was detected with the same pdf ? if so can you provide it?

@creepiepanda
Copy link

Yes it's the same PDF every time. I switched pypdf versions mutliple times while trying with the same file.
crazy_bean.pdf

@pubpub-zz
Copy link
Collaborator

extract_text() has now layout extraction_mode.
This solves now this very old issue.

@MartinThoma
Copy link
Member

Just for reference

import pypdf
from pypdf import PdfReader

print(f"pypdf=={pypdf.__version__}")
print(PdfReader("test.pdf").pages[0].extract_text())

gives:

pypdf==4.1.0
Staatsanwaltschaft Freiburg
Berliner Allee 1, 79114 Freiburg im Breisgau
Tel.: 0761 20 50 Freiburg im Breisgau, 19.11.2021
Sitzungsplan der Staatsanwaltschaft
Zeitraum 29.11.2021 -03.12.2021
Eildienst 26.11.2021 -29.11.2021 Plattner, Adalbert , EOAA
Eildienst 29.11.2021 -03.12.2021 Häberling, Max , StA
Eildienst 03.12.2021 -06.12.2021 Dr. Eisner, Julien , StA
Tag
Gericht-SpruchkörperSaal,
GebäudeZeitF+Aktenzeichen Sitzungsvertreter
Anfahrt
Montag 29.11.2021
LG Freiburg im Breisgau , -
Strafkammer XVI-IV, 2. OG 09:00F+668 Js 13432/21 Lederer, Chloe , StA'in
Böhm , Lara, StA'in
AG Emmendingen , - Schöffengericht
-09:00 429 Js 16713/21 Pistorius, Adalbert , OAA
AG Freiburg im Breisgau , - Abt. 24 - EG 09:00 214 Js 38759/19 Herrmann(640), Lia, Ref'in
AG Freiburg im Breisgau , - Abt. 26 - IX 09:15 580 Js 39067/19 Mayer (650), Leo, Ref
AG Freiburg im Breisgau , - Abt. 32 - EG 09:00F216 Js 29564/19 Dr. Eisner, Julien , StA
AG Freiburg im Breisgau , - Abt. 34 - XI, 1. OG 13:00+277 Js 26886/19 Braunke, Kurt , StA
AG Freiburg im Breisgau , - Abt. 17 - 09:00 168 Js 24544/22 Bader, Alina-Karla , StA'in
10:30 537 Js 24193/22 Bader, Alina-Karla , StA'in
VIII, Holzmarkt 6 11:00 678 Js 23165/22 Bader, Alina-Karla , StA'in
11:45 116 Js 45639/19 Bader, Alina-Karla , StA'in
AG Freiburg im Breisgau , - Abt. 18 - VII 09:00 171 Js 33617/22
10:45 132 Js 38282/21
13:00 141 Js 45191/20
13:30 169 Js 50915/22
14:15 583 Js 18397/22
AG Freiburg im Breisgau , - Abt. 20 - III, EG 09:00 617 Js 44625/19 Raineke, Patrick , EStA
16:30+635 Js 46988/19 Hofmann, Krause, OStA
Raineke, Patrick , EStA
AG Freiburg im Breisgau , - Abt. 21 - IV, 1. OG 09:00+248 Js 53757/21 Häberling, Max , StA
AG Kenzingen , Strafabteilung 4, EG 09:00 637 Js 20701/21 Sägezahn, Ida , StA'in
10:00 168 Js 52660/20 Sägezahn, Ida , StA'in
11:30 187 Js 19607/20 Sägezahn, Ida , StA'in
AG Müllheim , - Strafabteilung - 09:00 345 Js 50760/22 Bauer (210), Joel, Ref
AG Staufen im Breisgau , -
Strafabteilung -09:00+474 Js 50679/19 Freygang, Ole , EStA
Dienstag 30.11.2021
LG Freiburg im Breisgau , -
Strafkammer II -IV 09:00F512 Js 40456/20 Brodesser, Boris , EStA
LG Freiburg im Breisgau , -
Strafkammer V -09:00 340 Js 22587/19 Luhmann, Jasmin, StA'in
LG Freiburg im Breisgau , -
Strafkammer XIV -09:00 289 Js 21296/22 Knorzig , Kathleen , StA'in
09:00 273 Js 55642/21 Knorzig , Kathleen , StA'in
Seite 1

and print(PdfReader("test.pdf").pages[0].extract_text(extraction_mode="layout"))

gives:

Staatsanwaltschaft Freiburg
Berliner Allee 1, 79114 Freiburg im Breisgau
Tel.: 0761 20 50                                                                                                                         Freiburg im Breisgau, 19.11.2021

Sitzungsplan der Staatsanwaltschaft
Zeitraum                    29.11.2021     - 03.12.2021


Eildienst                   26.11.2021     - 29.11.2021           Plattner, Adalbert , EOAA
Eildienst                   29.11.2021     - 03.12.2021           Häberling, Max , StA
Eildienst                   03.12.2021     - 06.12.2021           Dr. Eisner, Julien , StA


Tag                                          Saal,                     Zeit     F+  Aktenzeichen               Sitzungsvertreter
Gericht-Spruchkörper                         Gebäude                                                           Anfahrt

Montag                 29.11.2021
LG Freiburg im Breisgau , -                  IV, 2. OG                 09:00    F+  668 Js 13432/21            Lederer, Chloe , StA'in
Strafkammer XVI-                                                                                               Böhm , Lara, StA'in
AG Emmendingen , - Schöffengericht                                     09:00        429 Js 16713/21            Pistorius, Adalbert , OAA
-
AG Freiburg im Breisgau , - Abt. 24 -EG                                09:00        214 Js 38759/19            Herrmann(640), Lia, Ref'in
AG Freiburg im Breisgau , - Abt. 26 -IX                                09:15        580 Js 39067/19            Mayer (650), Leo, Ref
AG Freiburg im Breisgau , - Abt. 32 -EG                                09:00    F   216 Js 29564/19            Dr. Eisner, Julien , StA
AG Freiburg im Breisgau , - Abt. 34 -XI, 1. OG                         13:00     +  277 Js 26886/19            Braunke, Kurt , StA
AG Freiburg im Breisgau , - Abt. 17 -                                  09:00        168 Js 24544/22            Bader, Alina-Karla , StA'in
                                                                       10:30        537 Js 24193/22            Bader, Alina-Karla , StA'in
                                             VIII, Holzmarkt 6         11:00        678 Js 23165/22            Bader, Alina-Karla , StA'in
                                                                       11:45        116 Js 45639/19            Bader, Alina-Karla , StA'in
AG Freiburg im Breisgau , - Abt. 18 -VII                               09:00        171 Js 33617/22
                                                                       10:45        132 Js 38282/21
                                                                       13:00        141 Js 45191/20
                                                                       13:30        169 Js 50915/22
                                                                       14:15        583 Js 18397/22
AG Freiburg im Breisgau , - Abt. 20 -III, EG                           09:00        617 Js 44625/19            Raineke, Patrick , EStA
                                                                       16:30     +  635 Js 46988/19            Hofmann, Krause, OStA
                                                                                                               Raineke, Patrick , EStA
AG Freiburg im Breisgau , - Abt. 21 -IV, 1. OG                         09:00     +  248 Js 53757/21            Häberling, Max , StA
AG Kenzingen , Strafabteilung                4, EG                     09:00        637 Js 20701/21            Sägezahn, Ida , StA'in
                                                                       10:00        168 Js 52660/20            Sägezahn, Ida , StA'in
                                                                       11:30        187 Js 19607/20            Sägezahn, Ida , StA'in
AG Müllheim , - Strafabteilung -                                       09:00        345 Js 50760/22            Bauer (210), Joel, Ref
AG Staufen im Breisgau , -                                             09:00     +  474 Js 50679/19            Freygang, Ole , EStA
Strafabteilung -


Dienstag               30.11.2021
LG Freiburg im Breisgau , -                  IV                        09:00    F   512 Js 40456/20            Brodesser, Boris , EStA
Strafkammer II -
LG Freiburg im Breisgau , -                                            09:00        340 Js 22587/19            Luhmann, Jasmin, StA'in
Strafkammer V -
LG Freiburg im Breisgau , -                                            09:00        289 Js 21296/22            Knorzig , Kathleen , StA'in
Strafkammer XIV -
                                                                       09:00        273 Js 55642/21            Knorzig , Kathleen , StA'in




                                                                                                                                              Seite 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants