-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outline (other metadata?) not preserved using cat #31
Comments
Thank you for both of your very thorough issues! I wonder if we can somehow feed the outline back in to the FileWriter... The workaround you mentioned with the bookmarks sounds a little tedious and not very satisfying to use. |
@hellerbarde Thanks for taking a look, and happy to submit for a piece of software I think is just so great to have around (even more so when I can show windows users at work what's possible :) ). One other issue with trying to pull together the indices, is will you, the programmer, ever know what derivation of original indices I want if I'm catting multiple files? Or say there was a bookmark on pg 3 and someone cats 4-n; should you ditch the original since they extracted after, or include it since pg 4 is still part of that section? No easy answers, and thanks for taking a look! |
I have looked into this now and it seems the underlying library doesn't offer much assistance here unfortunately. I'll have to see if I can do anything about it. |
No worries at all, and you can close if you want. I'm guessing this is pretty fringe, as I would only have expected it to work on full docs. As mentioned in the above comment, if someone is merging a bunch of extractions from different files, I don't see a good way to guess what original bookmarks/sections should be included. Thanks for looking, and this is still an amazing tool :) |
OK. I'll leave this open as a reminder to look into it again for merging complete documents. But I don't think I'll dive into the nuts and bolts of figuring out how PDF does TOCs... :) Thanks for the kind words and stuff. PS: I'm working on a GUI for concatenating files. Psst, mum's the word! |
Apologies for the off topic remark in advance but I figured this could be useful to some. For GUIs there's also a useful little program called pdfshuffler, but I'm not sure how it handles TOCs. I usually turn to pdftk cat for that kind of thing because it deals with it fairly well. |
tl;dr this got long as I tried to investigate. The short summary is it appears the
cat
function strips off the handy outline/bookmark index of the document.PyPDF2
appears to support this, so this can serve as both 1) a notice this is happening if you weren't aware and 2) a feature request.In my opinion, if running
cat
just to spit out the file,stapler
should maintain whatever features already existed. I could also see it being a handy option in general, but merging the TOC/index locations of multiple files might get messy?I just submitted #30 and during my testing I used
stapler cat
in full, and noticed that the file size was different. Opening them both up, I noticed that inevince
, at least, there was no outline/table of contents in thestapler
generated version, but there was in the original. Here's both open (evince, arch linux) with the original on the left (Outline view in side pane) and thestapler
generated on the right, showing that this view is not available. Also of note is the "meta title" is removed from thestapler
version.I thought perhaps
PyPDF
doesn't provide this, but in looking around it appears it might, or at least some ability:PdfFileReader
PdfFileWriter
This seemed straight foward: just translate
getOutlines()
toaddBookmark
? Not as easy... I couldn't seem to find a way to get the page number from the result (though I'm an absolutepython
novice, so no surprise there). After some fiddling, I was able to use some example code to manually add a bookmark, and found that at least two answers tried to tackle the issue of convertinggetOutlines()
return location ID thingy into a page number.[1] [2]Find attached:
test-original.pdf
: file generated using Org-mode/LaTeX that I knew would feature a TOC/outlinetest-stapler.pdf
: file produced withstapler cat test.pdf test-cat.pdf
test-pypdf2.pdf
: file produced from the following codeattachments
footnotes
The text was updated successfully, but these errors were encountered: