Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep in-document hyperlinks after merged #22

Open
no1xsyzy opened this issue May 10, 2018 · 10 comments
Open

Keep in-document hyperlinks after merged #22

no1xsyzy opened this issue May 10, 2018 · 10 comments

Comments

@no1xsyzy
Copy link

Background: I used pandoc+texlive for my thesis and pdfmerge for another submission (I study in cooperated college).
After merges, TOC links and reference links don't work anymore. They are still links but clicking it will not navigate to the link target.
I think it a bug because they should've been there. They are hyperlinks, and their direction is definite.

@metaist
Copy link
Owner

metaist commented May 10, 2018

Hi @no1xsyzy! Thanks for using pdfmerge. I ran a few tests to try and figure out what's going on.

Inputs

  • cover.pdf (a 1-page pdf with the "cover")
  • body.pdf (a 3-page pdf with a TOC, and two pages that the TOC links to).

Test 1: pdfmerge cover.pdf body.pdf -o test-1.pdf

  • Works as expected; all the links still work.

Test 2: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[3]" "body.pdf[2]" -o test-2.pdf

  • Links no longer work.

Test 3: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[2]" "body.pdf[3]" -o test-3.pdf

  • Links still don't work.

Test 4: pdfmerge cover.pdf "body.pdf[1..2]" "body.pdf[3]" -o test-4.pdf

  • First link works, second one doesn't.

I'm not exactly sure what is happening, but it seems that if the page with a link and it's target aren't written to the output stream at the same time, the link gets broken.

pdfmerge is built on pyPDF2, so I'm going to see if there's any information about how this works and if there's anything I can do to prevent that from happening.

Is there any other information about what you were trying to do that I should know in diagnosing this error?

@no1xsyzy
Copy link
Author

Actually what I did:
pdfmerge cover.pdf body.pdf[2..-1] -o test.pdf

Example files: (I tried to make TOC but failed to put that to the second page, so here's a citation hyperlink)
body.pdf
cover.pdf

@metaist
Copy link
Owner

metaist commented May 11, 2018

This is very interesting because it disproved my hypothesis. I need to learn more about how links get put into the output stream, but for the record, this is where I'm adding pages to the output stream which just calls addPage using pyPDF2.

Not sure at which point the links are getting dropped.

@exptom
Copy link

exptom commented May 17, 2018

I would also be very interested in a fix for this. My use case is merging multiple complete pdf documents (not picking specific pages from any).
My first pdf always has a TOC on the 2nd page but I then merge any additional number of pdfs on the end (these are appendices) and the TOC from the first pdf is broken.

I get this warning when I do the merge (I'm not sure if its relevant or not?):

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]

@metaist
Copy link
Owner

metaist commented May 17, 2018

Hi @exptom! Thanks for reaching out about your situation and the warning you're getting. This might all be related, so we'll keep it in this issue for now.

I just did a test with adding a pdf to the end of a pdf that starts with a TOC and the links still work. What are you using to generate the separate pdfs?

@exptom
Copy link

exptom commented May 17, 2018

@metaist thanks for getting back to me. The initial pdf that includes the TOC is created using wkhtmltopdf (https://github.com/wkhtmltopdf/wkhtmltopdf) and the additional pdfs that are merged as appendicies can come from anywhere. (Users upload them)

@metaist
Copy link
Owner

metaist commented May 17, 2018

Oh, so they literally start out as HTML links, are converted to PDF links. Interesting. Will begin my deep dive into how PDF links actually work and are encoded. This may require an upstream patch to PyPDF2 once I figure out how their stuff works.

I'm also looking at other places where people have issues with PDF links (e.g., combine_pdf) to see if I can learn anything from their general experience.

Unfortunately, I do not have an easy short-term fix, but will keep this issue open and post here as I learn new things.

@exptom
Copy link

exptom commented May 17, 2018

They aren't actually HTML links. What happens is that wkhtmltopdf converts the HTML page to a PDF document and scans the HTML pulling out all the heading tags (<h1>,<h2>,etc..) and uses them to generate a TOC.

@metaist
Copy link
Owner

metaist commented Jun 5, 2023

I just released pdfmerge 1.0.0 which uses the newer version of pypdf and I went back to check if this issue still exists. Unfortunately, it does. Anybody have any ideas on how links in PDF work?

@metaist
Copy link
Owner

metaist commented Jul 16, 2024

It seems like pdftk can correctly merge documents. Perhaps I should make pdfmerge a wrapper around pdftk instead of pypdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants