Keep in-document hyperlinks after merged #22

no1xsyzy · 2018-05-10T08:13:48Z

Background: I used pandoc+texlive for my thesis and pdfmerge for another submission (I study in cooperated college).
After merges, TOC links and reference links don't work anymore. They are still links but clicking it will not navigate to the link target.
I think it a bug because they should've been there. They are hyperlinks, and their direction is definite.

metaist · 2018-05-10T13:29:24Z

Hi @no1xsyzy! Thanks for using pdfmerge. I ran a few tests to try and figure out what's going on.

Inputs

cover.pdf (a 1-page pdf with the "cover")
body.pdf (a 3-page pdf with a TOC, and two pages that the TOC links to).

Test 1: pdfmerge cover.pdf body.pdf -o test-1.pdf

Works as expected; all the links still work.

Test 2: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[3]" "body.pdf[2]" -o test-2.pdf

Links no longer work.

Test 3: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[2]" "body.pdf[3]" -o test-3.pdf

Links still don't work.

Test 4: pdfmerge cover.pdf "body.pdf[1..2]" "body.pdf[3]" -o test-4.pdf

First link works, second one doesn't.

I'm not exactly sure what is happening, but it seems that if the page with a link and it's target aren't written to the output stream at the same time, the link gets broken.

pdfmerge is built on pyPDF2, so I'm going to see if there's any information about how this works and if there's anything I can do to prevent that from happening.

Is there any other information about what you were trying to do that I should know in diagnosing this error?

no1xsyzy · 2018-05-10T14:37:15Z

Actually what I did:
pdfmerge cover.pdf body.pdf[2..-1] -o test.pdf

Example files: (I tried to make TOC but failed to put that to the second page, so here's a citation hyperlink)
body.pdf
cover.pdf

metaist · 2018-05-11T01:15:18Z

This is very interesting because it disproved my hypothesis. I need to learn more about how links get put into the output stream, but for the record, this is where I'm adding pages to the output stream which just calls addPage using pyPDF2.

Not sure at which point the links are getting dropped.

exptom · 2018-05-17T11:21:53Z

I would also be very interested in a fix for this. My use case is merging multiple complete pdf documents (not picking specific pages from any).
My first pdf always has a TOC on the 2nd page but I then merge any additional number of pdfs on the end (these are appendices) and the TOC from the first pdf is broken.

I get this warning when I do the merge (I'm not sure if its relevant or not?):

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]

metaist · 2018-05-17T12:31:16Z

Hi @exptom! Thanks for reaching out about your situation and the warning you're getting. This might all be related, so we'll keep it in this issue for now.

I just did a test with adding a pdf to the end of a pdf that starts with a TOC and the links still work. What are you using to generate the separate pdfs?

exptom · 2018-05-17T12:34:00Z

@metaist thanks for getting back to me. The initial pdf that includes the TOC is created using wkhtmltopdf (https://github.com/wkhtmltopdf/wkhtmltopdf) and the additional pdfs that are merged as appendicies can come from anywhere. (Users upload them)

metaist · 2018-05-17T12:42:45Z

Oh, so they literally start out as HTML links, are converted to PDF links. Interesting. Will begin my deep dive into how PDF links actually work and are encoded. This may require an upstream patch to PyPDF2 once I figure out how their stuff works.

I'm also looking at other places where people have issues with PDF links (e.g., combine_pdf) to see if I can learn anything from their general experience.

Unfortunately, I do not have an easy short-term fix, but will keep this issue open and post here as I learn new things.

exptom · 2018-05-17T12:45:00Z

They aren't actually HTML links. What happens is that wkhtmltopdf converts the HTML page to a PDF document and scans the HTML pulling out all the heading tags (<h1>,<h2>,etc..) and uses them to generate a TOC.

metaist · 2023-06-05T12:33:58Z

I just released pdfmerge 1.0.0 which uses the newer version of pypdf and I went back to check if this issue still exists. Unfortunately, it does. Anybody have any ideas on how links in PDF work?

metaist · 2024-07-16T15:03:26Z

It seems like pdftk can correctly merge documents. Perhaps I should make pdfmerge a wrapper around pdftk instead of pypdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep in-document hyperlinks after merged #22

Keep in-document hyperlinks after merged #22

no1xsyzy commented May 10, 2018

metaist commented May 10, 2018

no1xsyzy commented May 10, 2018

metaist commented May 11, 2018

exptom commented May 17, 2018 •

edited

Loading

metaist commented May 17, 2018

exptom commented May 17, 2018

metaist commented May 17, 2018

exptom commented May 17, 2018 •

edited

Loading

metaist commented Jun 5, 2023

metaist commented Jul 16, 2024

Keep in-document hyperlinks after merged #22

Keep in-document hyperlinks after merged #22

Comments

no1xsyzy commented May 10, 2018

metaist commented May 10, 2018

no1xsyzy commented May 10, 2018

metaist commented May 11, 2018

exptom commented May 17, 2018 • edited Loading

metaist commented May 17, 2018

exptom commented May 17, 2018

metaist commented May 17, 2018

exptom commented May 17, 2018 • edited Loading

metaist commented Jun 5, 2023

metaist commented Jul 16, 2024

exptom commented May 17, 2018 •

edited

Loading

exptom commented May 17, 2018 •

edited

Loading