Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow text extraction to keep intendation #2054

Closed
MartinThoma opened this issue Aug 1, 2023 · 6 comments
Closed

ENH: Allow text extraction to keep intendation #2054

MartinThoma opened this issue Aug 1, 2023 · 6 comments
Labels
is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

When we extract Python code from a PDF, it's completely messed up. It would be nice to have an option that keeps the indentation. Maybe a flag for a layout-mode?

Code Example: How the new feature could be used

from pypdf import PdfReader

# https://arxiv.org/pdf/1601.03642.pdf
reader = PdfReader("1601.03642.pdf")
print(reader.pages[6].extract_text(layout_mode=True))

should give:

 * Increment the size file of the new incorrect UI_FILTER group information
 * of the size generatively.
 */
static int indicate_policy(void)
{
    int error;
    if (fd == MARN_EPT) {
        /*
         * The kernel blank will coeld it to userspace.
         */
        if (ss->segment < mem_total)
            unblock_graph_and_set_blocked();
        else
            ret = 1;
        goto bail;
    }
    segaddr = in_SB(in.addr);
    selector = seg / 16;
    setup_works = true;
    for (i = 0; i < blocks; i++) {
        seq = buf[i++];
        bpf = bd->bd.next + i * search;
        if (fd) {
            current = blocked;
        }
    }
    rw->name = "Getjbbregs";
    bprm_self_clearl(&iv->version);
    regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
    return segtable;
}


D. Linux Code, 2

/*
* Copyright (c) 2006-2010, Intel Mobile Communications. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License version 2 as published by
* the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
*
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software Foundation,
* Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/

#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>

Currently, we get:

*Increment the size file of the new incorrect UI_FILTER group information
*of the size generatively.
*/
static int indicate_policy(void)
{
int error;
if (fd == MARN_EPT) {
/*
*The kernel blank will coeld it to userspace.
*/
if (ss->segment < mem_total)
unblock_graph_and_set_blocked();
else
ret = 1;
goto bail;
}
segaddr = in_SB(in.addr);
selector = seg / 16;
setup_works = true;
for (i = 0; i < blocks; i++) {
seq = buf[i++];
bpf = bd->bd.next + i *search;
if (fd) {
current = blocked;
}
}
rw->name = "Getjbbregs";
bprm_self_clearl(&iv->version);
regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
return segtable;
}
D. Linux Code, 2
/*
*Copyright (c) 2006-2010, Intel Mobile Communications. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify it
*under the terms of the GNU General Public License version 2 as published by
*the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
*but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
*
*GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software Foundation,
*Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>
@MartinThoma MartinThoma self-assigned this Aug 1, 2023
@MartinThoma MartinThoma added is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Aug 1, 2023
@MrAnayDongre
Copy link

I'd be interested in contributing to this enhancement for PyPDF2 @MartinThoma.
Let me know how I can be of assistance

@MartinThoma
Copy link
Member Author

@MrAnayDongre PyPDF2 is deprecated. This is going into pypdf.

This is a very complex feature. I don't know myself by know what would be a good way to start doing that.

If you want to start contributing to pypdf, I recommend to have a look at Easy This issue is a good starting point for first-time contributors , then at help wanted We appreciate help everywhere - this one might be an easy start!

@pubpub-zz
Copy link
Collaborator

extract_text has now layout extraction_mode
I close this now old covered issue

@stefan6419846
Copy link
Collaborator

@pubpub-zz The layout mode does not resolve this and this issue requires further work to convert horizontal positions into whitespace accordingly.

I have therefore re-opened this issue.

@stefan6419846 stefan6419846 reopened this Apr 9, 2024
@pubpub-zz
Copy link
Collaborator

@stefan6419846, this is is the rendering:
print(rr.pages[6].extract_text(extraction_mode="layout"))
->

 * Increment  the  size  file  of  the  new  incorrect  UI_FILTER  group  information
 * of  the  size  generatively.
 */
static  int  indicate_policy(void)
{
   int  error;
   if  (fd  ==  MARN_EPT)  {
     /*
       * The  kernel  blank  will  coeld  it  to  userspace.
       */
     if  (ss->segment  <  mem_total)
        unblock_graph_and_set_blocked();
     else
        ret  =  1;
     goto  bail;
   }
   segaddr  =  in_SB(in.addr);
   selector  =  seg  /  16;
   setup_works  =  true;
   for  (i  =  0;  i  <  blocks;  i++)  {
     seq  =  buf[i++];
     bpf  =  bd->bd.next  +  i  * search;
     if  (fd)  {
        current  =  blocked;
     }
   }
   rw->name  =  "Getjbbregs";
   bprm_self_clearl(&iv->version);
   regs->new  =  blocks[(BPF_STATS  <<  info->historidac)]  |  PFMR_CLOBATHINC_SECONDS  <<  12;
   return  segtable;
}


D. Linux Code, 2

/*
 *   Copyright  (c)  2006-2010,  Intel  Mobile  Communications.   All  rights  reserved.
 *
 *     This  program  is  free  software;  you  can  redistribute  it  and/or  modify  it
 * under  the  terms  of  the  GNU  General  Public  License  version  2  as  published  by
 * the  Free  Software  Foundation.
 *
 *               This  program  is  distributed  in  the  hope  that  it  will  be  useful,
 * but  WITHOUT  ANY  WARRANTY;  without  even  the  implied  warranty  of
 *     MERCHANTABILITY  or  FITNESS  FOR  A  PARTICULAR  PURPOSE.   See  the
 *
 *   GNU  General  Public  License  for  more  details.
 *
 *     You  should  have  received  a  copy  of  the  GNU  General  Public  License
 *       along  with  this  program;  if  not,  write  to  the  Free  Software  Foundation,
 *   Inc.,  675  Mass  Ave,  Cambridge,  MA  02139,  USA.
 */

#include  <linux/kexec.h>
#include  <linux/errno.h>
#include  <linux/io.h>
#include  <linux/platform_device.h>
#include  <linux/multi.h>

Isn't this good ?

@stefan6419846
Copy link
Collaborator

Sorry, seems like my checkout was somehow broken. Still not optimal, but yes, then we can close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants