Skip to content

Latest commit

 

History

History
168 lines (134 loc) · 10.4 KB

COMPARISON.md

File metadata and controls

168 lines (134 loc) · 10.4 KB

Biased comparison of Ruby HTML sanitization libraries

This is a feature comparison of three widely used Ruby HTML sanitization libraries. It's heavily biased because I'm the author of Sanitize, and I've chosen to list the features that are important to me.

If these features are also important to you, then you may find this comparison useful. If other features are important to you, or if you'd prefer to get this information from someone who isn't the author of one of the libraries in the comparison, then I recommend you do your own research.

Spot a mistake or some outdated info? Please file an issue and let me know!

Feature comparison

Feature Sanitize 4.0.0 Loofah 2.0.1 HTMLFilter 1.3.0
Actually parses HTML (not with regexes)
HTML5-compliant parser
Fixes up badly broken/malicious markup
Fully configurable whitelists
Global attribute whitelist ✓ (hard-coded)
Element-specific attribute whitelist
Attribute-specific protocol whitelist
Supports HTML5 data- attributes ✓ (hard-coded)
Optionally escapes unsafe HTML instead of removing it
Allows custom HTML manipulation (transformers)
Built-in MathML support always enabled
Built-in SVG support always enabled
Basic CSS sanitization regex-based regex-based
Advanced whitelist-based CSS sanitization

Notes

  • Sanitize and Loofah both use Nokogiri to manipulate parsed HTML, but Loofah uses Nokogiri's non-HTML5-compliant libxml2 parser. Sanitize uses the HTML5-compliant Google Gumbo parser, which uses the exact same parsing algorithm as modern browsers. HTMLFilter performs all its manipulation using regexes and does not actually parse HTML.

  • Both Sanitize and Loofah fix up badly broken, misnested, or maliciously crafted HTML and output syntactically valid markup. Sanitize's HTML5 parser handles a wider variety of edge cases than Loofah's libxml2 parser. HTMLFilter does basic tag balancing but not much more, and garbage in generally results in garbage out.

  • Loofah's whitelist configuration is hard-coded and can only be customized by either editing its source or monkeypatching. Sanitize and HTMLFilter both have easily customizable whitelist configurations.

  • Loofah has a single global whitelist for attributes, which it uses for all elements. HTMLFilter has per-element attribute whitelists, but provides no way to whitelist global attributes (i.e., attributes that should be allowed on any element, such as class). Sanitize supports both global and element-specific attribute whitelists.

  • Sanitize and Loofah both support HTML5 data attributes. In Sanitize, data attributes can be enabled or disabled in either the global or element-specific whitelists. Loofah always allows data attributes on all elements, and this is not configurable. HTMLFilter does not support data attributes.

  • Both Sanitize and Loofah allow you to write blocks or methods that can perform custom manipulation on HTML nodes as they're traversed. Sanitize calls them "transformers", whereas Loofah calls them "scrubbers". They're more or less equivalent in terms of functionality.

  • Loofah has hard-coded whitelists for sanitizing MathML and SVG, which cannot be disabled via configuration. Sanitize does not provide built-in configs for sanitizing MathML or SVG, but it would be fairly trivial to add MathML and SVG elements and attributes to a custom whitelist config.

  • Sanitize performs advanced whitelist-based CSS sanitization using Crass, a full-fledged CSS parser compliant with the CSS Syntax Module Level 3 parsing spec. Loofah and HTMLFilter both perform rudimentary regex-based CSS sanitization, but I wouldn't trust either of them to actually sanitize maliciously crafted CSS.

Performance comparison

Based on this synthetic benchmark. Smaller numbers are better (they indicate faster completion).

Benchmark Sanitize 4.0.0 Loofah 2.0.1 HTMLFilter 1.3.0
Small HTML fragment (757 bytes) x 1000 0.461s 0.595s 0.413s
Large HTML fragment (33,531 bytes) x 100 2.966s 4.618s 3.054s
Small HTML document (25,286 bytes) x 100 1.629s 2.099s ERROR
Medium HTML document (86,685 bytes) x 100 5.938s 8.531s ERROR
Huge HTML document (7,172,510 bytes) x 5 28.682s 48.375s 31.930s

To run this benchmark yourself:

git clone https://github.com/rgrove/sanitize.git
cd sanitize
bundle install
gem install loofah hitimes htmlfilter
ruby benchmark/benchmark.rb

Notes

  • During the small and medium document benchmarks, HTMLFilter raised an Encoding::CompatibilityError exception. Both documents are UTF-8 encoded, but HTMLFilter's source files don't declare their own encoding, which is a bug.

Raw benchmark results

Ruby version      : 2.2.2
Sanitize version  : 4.0.0
Loofah version    : 2.0.1
HTMLFilter version: 1.3.0

Nokogiri version: {"warnings"=>[], "nokogiri"=>"1.6.6.2", "ruby"=>{"version"=>"2.2.2", "platform"=>"x86_64-darwin14", "description"=>"ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]", "engine"=>"ruby"}, "libxml"=>{"binding"=>"extension", "source"=>"packaged", "libxml2_path"=>"/Users/rgrove/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin14.3.0/libxml2/2.9.2", "libxslt_path"=>"/Users/rgrove/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin14.3.0/libxslt/1.1.28", "libxml2_patches"=>["0001-Revert-Missing-initialization-for-the-catalog-module.patch", "0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch"], "libxslt_patches"=>["0001-Adding-doc-update-related-to-1.1.28.patch", "0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch", "0003-Initialize-pseudo-random-number-generator-with-curre.patch", "0004-EXSLT-function-str-replace-is-broken-as-is.patch", "0006-Fix-str-padding-to-work-with-UTF-8-strings.patch", "0007-Separate-function-for-predicate-matching-in-patterns.patch", "0008-Fix-direct-pattern-matching.patch", "0009-Fix-certain-patterns-with-predicates.patch", "0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch", "0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch", "0014-Fix-for-bug-436589.patch", "0015-Fix-mkdir-for-mingw.patch"], "compiled"=>"2.9.2", "loaded"=>"2.9.2"}}

These values are time measurements. Lower is faster!

-- Benchmark --
  Small HTML fragment (757 bytes) x 1000
                                   total    single    rel
               Sanitize#fragment   0.461 (0.000461)     -
               Sanitize.fragment   1.221 (0.001221)  2.65x
   Loofah.scrub_fragment (strip)   0.595 (0.000595)  1.29x
               HTMLFilter#filter   0.413 (0.000413)  0.90x

  Large HTML fragment (33531 bytes) x 100
                                   total    single    rel
               Sanitize#fragment   2.966 (0.029664)     -
               Sanitize.fragment   3.006 (0.030057)  1.01x
   Loofah.scrub_fragment (strip)   4.618 (0.046183)  1.56x
               HTMLFilter#filter   3.054 (0.030538)  1.03x

  Small HTML document (25286 bytes) x 100
                                   total    single    rel
               Sanitize#document   1.629 (0.016286)     -
               Sanitize.document   1.683 (0.016829)  1.03x
   Loofah.scrub_document (strip)   2.099 (0.020991)  1.29x
        HTMLFilter#filter ERROR!

  Medium HTML document (86685 bytes) x 100
                                   total    single    rel
               Sanitize#document   5.938 (0.059375)     -
               Sanitize.document   6.010 (0.060097)  1.01x
   Loofah.scrub_document (strip)   8.531 (0.085315)  1.44x
        HTMLFilter#filter ERROR!

  Huge HTML document (7172510 bytes) x 5
                                   total    single    rel
               Sanitize#document  28.682 (5.736398)     -
               Sanitize.document  29.550 (5.910002)  1.03x
   Loofah.scrub_document (strip)  48.375 (9.674925)  1.69x
               HTMLFilter#filter  31.930 (6.386075)  1.11x