Skip to content

Commit

Permalink
Add line_overlap and boxes_flow to LAParams
Browse files Browse the repository at this point in the history
  • Loading branch information
Arnie97 committed Dec 17, 2020
1 parent 7709e58 commit 0dee385
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 3 deletions.
10 changes: 8 additions & 2 deletions camelot/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -838,23 +838,27 @@ def compute_whitespace(d):

def get_page_layout(
filename,
line_overlap=0.5,
char_margin=1.0,
line_margin=0.5,
word_margin=0.1,
boxes_flow=0.5,
detect_vertical=True,
all_texts=True,
):
"""Returns a PDFMiner LTPage object and page dimension of a single
page pdf. See https://euske.github.io/pdfminer/ to get definitions
of kwargs.
page pdf. To get the definitions of kwargs, see
https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
Parameters
----------
filename : string
Path to pdf file.
line_overlap : float
char_margin : float
line_margin : float
word_margin : float
boxes_flow : float
detect_vertical : bool
all_texts : bool
Expand All @@ -872,9 +876,11 @@ def get_page_layout(
if not document.is_extractable:
raise PDFTextExtractionNotAllowed(f"Text extraction is not allowed: {filename}")
laparams = LAParams(
line_overlap=line_overlap,
char_margin=char_margin,
line_margin=line_margin,
word_margin=word_margin,
boxes_flow=boxes_flow,
detect_vertical=detect_vertical,
all_texts=all_texts,
)
Expand Down
2 changes: 1 addition & 1 deletion docs/user/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -618,7 +618,7 @@ Tweak layout generation

Camelot is built on top of PDFMiner's functionality of grouping characters on a page into words and sentences. In some cases (such as `#170 <https://github.com/camelot-dev/camelot/issues/170>`_ and `#215 <https://github.com/camelot-dev/camelot/issues/215>`_), PDFMiner can group characters that should belong to the same sentence into separate sentences.

To deal with such cases, you can tweak PDFMiner's `LAParams kwargs <https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L33>`_ to improve layout generation, by passing the keyword arguments as a dict using ``layout_kwargs`` in :meth:`read_pdf() <camelot.read_pdf>`. To know more about the parameters you can tweak, you can check out `PDFMiner docs <https://euske.github.io/pdfminer/>`_.
To deal with such cases, you can tweak PDFMiner's `LAParams kwargs <https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L33>`_ to improve layout generation, by passing the keyword arguments as a dict using ``layout_kwargs`` in :meth:`read_pdf() <camelot.read_pdf>`. To know more about the parameters you can tweak, you can check out `PDFMiner docs <https://pdfminersix.rtfd.io/en/latest/reference/composable.html>`_.

::

Expand Down

0 comments on commit 0dee385

Please sign in to comment.