-
Notifications
You must be signed in to change notification settings - Fork 222
Description
Hello!
Thank you for providing this library; it has been incredibly helpful!
Problem
I need to be able to merge together many PDFs (potentially 100 or more).
Then, I need to reduce the file size by removing duplicate assets like fonts and images.
It also needs to be able to handle optimising content inside embedded forms.
Ideally I would be able to generate a single PDF containing all of the documents at the beginning, but this is not possible in my scenario.
Solution
I have written some code below which achieves this.
It's not as efficient as generating a single PDF.
And I was wondering if you have any recommendations on how I could improve it.
I have found many historical forum posts and stack overflow questions asking how to do the same, but I was unable to find any solutions or example code.
So this seems like something which would be useful for others.
Summary of how it works:
- Check each image/font on each page & embedded form
- Create a hash of the image/font data stream
- If we've already seen the hash before, re-use the existing PdfReference object for that image/font
public class PDFSharpOptimiseMergedPDFsTest
{
public byte[] CombinePdfs(List<byte[]> pdfs)
{
// https://www.pdfsharp.net/wiki-1.5/ConcatenateDocuments-sample.ashx#Source_Code_2
using var resultPDF = MergePDFsSimple(pdfs);
// Remove duplicated images and fonts
var pageNumber = 0;
foreach (PdfPage page in resultPDF.Pages)
{
pageNumber++;
Console.WriteLine($"Scanning page {pageNumber}");
OptimiseResources(page);
Console.WriteLine();
}
using var ms = new MemoryStream();
resultPDF.Save(ms);
return ms.ToArray();
}
private PdfDocument MergePDFsSimple(List<byte[]> pdfs)
{
var resultPDF = new PdfDocument();
foreach (var pdf in pdfs)
{
using var src = new MemoryStream(pdf);
using var srcPDF = PdfReader.Open(src, PdfDocumentOpenMode.Import);
for (var i = 0; i < srcPDF.PageCount; i++)
{
resultPDF.AddPage(srcPDF.Pages[i]);
}
}
return resultPDF;
}
private readonly Dictionary<string, PdfReference> ImagesCache = [];
private readonly Dictionary<string, PdfReference> FontsCache = [];
private void OptimiseResources(PdfDictionary pdfItem)
{
var resources = pdfItem.Elements.GetDictionary("/Resources");
if (resources == null) return;
var fonts = resources.Elements.GetDictionary("/Font") ?? new PdfDictionary();
var xobjects = resources.Elements.GetDictionary("/XObject") ?? new PdfDictionary();
// Fonts
foreach (var itemKey in fonts.Elements.Keys)
{
if (fonts.Elements[itemKey] is not PdfReference reference) continue;
if (reference.Value is not PdfDictionary dictionary) continue;
var fontName = dictionary.Elements.GetName("/BaseFont");
if (fontName == null) continue;
var fontFile2 = dictionary.Elements
.GetArray("/DescendantFonts")?.Elements
.GetDictionary(0)?.Elements
.GetDictionary("/FontDescriptor")?.Elements
.GetDictionary("/FontFile2");
// TODO: There is also a FontFile3 and other types
// https://github.com/empira/PDFsharp/blob/master/src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Advanced/PdfFontDescriptor.cs#L328
if (fontFile2?.Stream == null) continue;
// Use the stream of the font data, to ensure the font is actually the same
// For some reason, fonts like ArialUnicodeMS are different on every page
// Maybe they are font subsets?
var hash = ByteArrayToString(SHA512.HashData(fontFile2.Stream.Value));
var useCache = FontsCache.TryGetValue(hash, out var cachedValue);
Console.WriteLine($"Font {hash[..8]} - {fontName} {(useCache ? "(cached)" : "")}");
if (!useCache)
{
// Add to cache
FontsCache[hash] = reference;
continue;
}
// Use cached value
fonts.Elements[itemKey] = cachedValue;
}
Console.WriteLine();
foreach (var itemKey in xobjects.Elements.Keys)
{
if (xobjects.Elements[itemKey] is not PdfReference reference) continue;
if (reference.Value is not PdfDictionary dictionary) continue;
// Images
if (dictionary.Elements.GetString("/Subtype") == "/Image")
{
var imageLength = dictionary.Elements.GetInteger("/Length");
if (imageLength == 0) continue;
var hash = ByteArrayToString(SHA512.HashData(dictionary.Stream.Value));
var useCache = ImagesCache.TryGetValue(hash, out var cachedValue);
Console.WriteLine($"Image {hash[..8]} {(useCache ? "(cached)" : "")}");
if (!useCache)
{
// Add to cache
ImagesCache[hash] = reference;
continue;
}
// Use cached value
xobjects.Elements[itemKey] = cachedValue;
}
// Embedded forms/labels
else if (dictionary.Elements.GetString("/Subtype") == "/Form")
{
Console.WriteLine($"Scanning embedded Form");
OptimiseResources(dictionary);
}
}
}
// Consider using a custom IEqualityComparer<byte[]> instead of hashing every resource
// Equals must be 100% accurate, but GetHashCode may sometimes generate collisions for different items
private static string ByteArrayToString(byte[] array)
{
var output = "";
for (int i = 0; i < array.Length; i++)
{
output += $"{array[i]:X2}";
if ((i % 4) == 3) output += " ";
}
return output;
}
}Tests
In my test scenario, I have 3 PDFs & some images and fonts can be found in each PDF.
- File size if all 3 documents are generated as a single PDF
- 81,711 bytes
- Total size of individual files
- 221,695 bytes (2.71x)
- Size when merged using PDFSharp
- 221,225 bytes (2.71x)
- Size when merged using PDFSharp & using the optimisation above
- 123,345 bytes (1.51x)
Further problems
- Some fonts (e.g. ArialUnicodeMS) seem to be different on each PDF
- This might be because of subsets?
- Is there a way to combine these?
- Is there a simpler way to compare embedded fonts?
- I can see there are multiple properties that contain the font stream- FontFile, FontFile2, FontFile3
- Should I just check all 3?
- Are there other resources I should optimise?