Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse embedded metadata in PDF files #3108

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

microtherion
Copy link

@microtherion microtherion commented Aug 12, 2024

Added

  • Added: Added the ability to read embedded metadata from PDFs that conform to Calibre's embedding.

Implements PDF metadata processing as discussed in #3103. Some of this metadata is generic, and available in many PDFs. Other fields (series and index) are not as standardized, and for these we're using the calibre version of the fields.

This is not only my first Kavita contribution, but also my first time programming C#, so I hope I managed to write reasonable code, and would appreciate feedback if I did not.

Copy link
Member

@majora2007 majora2007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary look. I'm mainly concerned with the implementation to fetch the data from the pdf. I would like to investigate grabbing the string via docnet, which we already use for other things in Kavita.

API/Services/BookService.cs Outdated Show resolved Hide resolved
API/Services/BookService.cs Outdated Show resolved Hide resolved
API/Services/BookService.cs Outdated Show resolved Hide resolved
API/Services/BookService.cs Outdated Show resolved Hide resolved
API/Services/BookService.cs Outdated Show resolved Hide resolved
API/Services/BookService.cs Outdated Show resolved Hide resolved
@microtherion
Copy link
Author

Implemented review feedback. Was not sure about the placement of PdfMetadataExtractor. Is API.Helpers OK?

Copy link
Member

@majora2007 majora2007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks good, minus a few style points. Once the code is in shape, I can spare some time to pull it down and ensure it doesn't break anything.

API.Tests/Services/BookServiceTests.cs Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Outdated Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Outdated Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Outdated Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Outdated Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Outdated Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Outdated Show resolved Hide resolved
API/Helpers/PdfMetadataExtractor.cs Outdated Show resolved Hide resolved
@majora2007
Copy link
Member

Another thing that would be stellar is if you could update the wiki along with this to write the mapping of fields like we do with the epub reader and perhaps any settings needed on our calibre guide for ensuring metadata is written in the pdf.

@microtherion
Copy link
Author

Addressed second round of comments, and added documentation in Kareadita/Wiki-Nextra#13

@majora2007 majora2007 added the enhancement New feature or request label Aug 14, 2024
@DieselTech
Copy link
Collaborator

Did some initial testing of the PR and came across 2 aspects that need to be looked into:

1 - File locking - Once the scanner hits the files it doesn't seem to let go of the file lock. I couldn't move files around or modify them on the host file system after kavita scanned them.

https://cdn.discordapp.com/attachments/1273694551739072532/1273694566226333736/image.png?ex=66bf8c00&is=66be3a80&hm=3ef14bddff709d1a7e9f748db3066d070e509ee8419c48240b2ffe25cc9e9a4a&

2 - Long scan times per file. Even on my fairly powerful desktop, scan times were taking 5-6 seconds per file scanned in. It took almost an hour to bring in the 749 files I have in a test library. This is only going to get worse on lower powered hardware. Especially if it has to load the entire file to search for metadata. People TTRPG collections are going to have 300MB+ files and hundreds of them, if not thousands. That can easily start to add up to weeks of scanning time.

@microtherion
Copy link
Author

@DieselTech do you see the same behavior re: (1) for epub and pdf?

For (2) were those 749 files pdf? If so, do you have comparison numbers for epub, so I know what I ought to aim for? It might inherently be a difficult problem, as there may not be a way around reading the whole file, but there may be an exploitable file structure in PDF that could be used.

@DieselTech
Copy link
Collaborator

2 was all PDF files. I didn't test any epub with this branch but I can just to get some comparison

@microtherion
Copy link
Author

It turns out that PDF files indeed have an index that allows much faster access to the metadata than the implementation here, though accessing it is a bit trickier.

I'm still working on the new approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

3 participants