Proposal: Optimizing Filter List Processing via Deduplication and Persistent Binary Cache #7787

DavidOsipov · 2025-04-23T12:45:37Z

DavidOsipov
Apr 23, 2025

Hi Everyone,

Back in August 2020, issue #2041 ("Adguard Home has high cpu usage, high memory usage...") highlighted significant performance challenges, particularly on resource-constrained devices like Raspberry Pis, when processing large numbers or sizes of filter lists. Key issues included:

High CPU/memory usage during initial startup and filter list updates.
Slow processing times (minutes in some cases reported).
Temporary unresponsiveness of the DNS service during these processing phases.

While AdGuard Home has likely seen many improvements since v0.103.3 (discussed in #2041), the fundamental approach of parsing numerous, potentially large, and often redundant text-based filter lists on each startup or update remains a potential bottleneck, especially as list sizes grow.

Problem Recap:

The current reliance on processing plain text lists directly leads to:

Redundancy: Multiple lists often contain duplicate domain/rule entries, leading to wasted processing effort and potentially inflated memory structures.
Parsing Overhead: Parsing millions of lines of text (especially complex ABP syntax) is computationally intensive and contributes significantly to CPU spikes and delays.
Startup/Update Delays: The need to re-process lists fully on startup or major updates impacts availability and user experience.

Proposal:

I'd like to propose exploring an optimization strategy focused on deduplication and persistent binary caching of the processed filter rules, implemented entirely in pure Go (respecting the preference mentioned in #2041 to avoid CGO).

The core idea involves a workflow like this:

Initial Processing: On the first load or after a filter list update trigger:
- Download all enabled text-based filter lists.
- Parse the rules from these lists.
- Deduplicate the rules/domains efficiently (e.g., using Go maps like map[string]struct{} or similar techniques).
- Store the final, unique set of rules/domains in an optimized, persistent binary format.
Subsequent Loads: On subsequent AdGuard Home startups:
- Check for the existence of the pre-processed binary cache file.
- If present and up-to-date (e.g., based on timestamps or checksums of list sources), load the rules directly from the binary cache, bypassing the expensive text parsing and deduplication steps almost entirely.
Updates: When lists are updated:
- Re-run the processing and deduplication steps.
- Overwrite or update the binary cache file with the new rule set. (Incremental updates could be a further optimization, but a full rebuild might be simpler initially).

Potential Pure Go Implementation Options for Binary Cache:

encoding/gob: Go's native binary serialization. Simple and efficient for Go data structures.
Protocol Buffers (Protobuf): Industry standard, language-agnostic, highly efficient format with pure Go generated code.
Pure Go Embedded Key/Value Stores: Libraries like BoltDB or BadgerDB could provide a persistent, indexed store for the rules, potentially offering efficient lookups and updates without requiring CGO like SQLite.

Expected Benefits:

Significantly Faster Startups: Loading a pre-processed binary cache should be much faster than parsing text files.
Faster Filter Updates: While the initial processing still takes time, subsequent updates might only need to load deltas or rebuild the cache, potentially quicker than full re-parsing every time.
Reduced CPU/Memory Spikes: Shifting the heavy parsing/deduplication to a less frequent event, or optimizing it via binary formats, should lessen resource spikes during normal operation and updates.
Improved Responsiveness: Especially on lower-powered hardware.

Discussion Points:

Is this performance bottleneck (filter list processing) still considered a significant issue in recent AGH versions?
Has a similar optimization strategy already been considered or partially implemented?
What are the team's thoughts on the feasibility of this approach?
Are there preferences or concerns regarding the potential binary formats (gob, Protobuf, BoltDB/BadgerDB, etc.)?
What potential challenges or edge cases should be considered (e.g., cache invalidation, handling different rule types, memory usage during the initial processing)?

I believe implementing such a mechanism could significantly improve the user experience, especially for users managing extensive blocklists or running AGH on modest hardware.

Looking forward to hearing your thoughts and feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Optimizing Filter List Processing via Deduplication and Persistent Binary Cache #7787

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Proposal: Optimizing Filter List Processing via Deduplication and Persistent Binary Cache #7787

Uh oh!

DavidOsipov Apr 23, 2025

Replies: 0 comments

DavidOsipov
Apr 23, 2025