-
Notifications
You must be signed in to change notification settings - Fork 253
Description
Right now, if you enter a facet suggest/search/filter query in a different unicode normalization form than the facet is stored in the index, you may get a false negative 0 results, even though your entry was visually indistinguishable from a query that would have gotten results.
This is ironically easily seen in the blacklight test demo app, as it winds up adding MARC records to index with at least one unicode term in unicode nfd (normalization form decomposed), but in most software actual keyboard entry is in nfc (normalization form composed), so for "era" term "koryŏ", you can demonstrate with (do in console and then copy paste) "koryŏ".unicode_normalize(:nfc) vs "koryŏ".unicode_normalize(:nfd)
some background
-
solr facets are normally (and in our examples) stored in solr
stringtypes which receive no analysis, and that includes no Unicode normalization. What bytes the indexing process supplies are what goes into index, so any unicode normalization on indexing would have to be provided by client. (not in solr config). -
the solr
facet.containsquery we are using does byte by byte matching (modulo the ignoreCase param, which does not help here, oddly). Sofacet.containsquery must match unicode normalization form in solr index. so I guess ideally your indexer would be supplying some normalization (many may not be by default), and then we'd normalize query to match. -
The
nkcandnfdnormalized forms are visually identical (if renderering software has no bugs!), with different bytes, making this especially confusing. The "compatibility" normalizations atnfkcandnfkdintroduce lossy visual differences, for instance "dumbing down"①to1, and then applying the decomposed/composed normalization. -
I think
nfcis often considered the most common default encoding (certain XML specs recommend but do not require NFC), but our fixture data used in CI happens to be in nfd for some reason (becuase that's often how it is in MARC for legacy reasons maybe?)
questions
-
Should we do anything, or just leave it alone for application devs to deal with in custom SearchBuilder logic if desired?
-
If we are to do something, should we add configuration for facet suggest normalization, or just hard-code it into SearchBuilder logic, which application devs can override using the normal ways of overriding SearchBuilder logic
-
Either way, what should the default be?
- i'd suggest definitely nfc or nfd and not the lossy "downsampling" of one of the compatibiltiy encodings, which I think will be unexpected. But @thatbudakguy in slack preferred NFKC as it's what some Solr analysis does (not used for
stringfields, but by analogy) - i'd normally suggest
nfcbecause it seems the most common/suggested -- except that our fixture data uses nfd, what? We could change our fixture data... but I don't know if that means many app devs have data in nfd already too! (which means we prob do need normalization, as nfd data is particularly likely to cause a conflict with keyboard-entered input!) - I guess this discord and confusion does mean we should maybe make a config to easily change it, not require searchbuilder customization?
- i'd suggest definitely nfc or nfd and not the lossy "downsampling" of one of the compatibiltiy encodings, which I think will be unexpected. But @thatbudakguy in slack preferred NFKC as it's what some Solr analysis does (not used for
what i guess i'd do
This is pretty easy to implement. And it seems clear to me that the current behavior is going ot lead to things considered bugs... But confusion/discord over what it should do almost make me throw my hands up and just implement a local fix... but desire to help upstream has led ot this ticket instead. :)
I guess if I was just going to do something, I would make it configurable with a blacklight_config.index.facet_suggest_query_normalize config, which is perhaps a proc allowing for ultimate flexibilty in normalization. And the default would either be nil (same as present, I do think it's a gotcha, but at least an easily fixable one) or default { |q| q.unicode_normalize(:nfc) }, and also normalize fixture data so tests still pass?
Feedback welcome, if zero or no consensus I guess I'll just fix it locally and leave this here, please and thank you!