Add unicode normalization to facet filter/suggest/search

Right now, if you enter a facet suggest/search/filter query in a different unicode normalization form than the facet is stored in the index, you may get a false negative 0 results, even though your entry was visually indistinguishable from a query that would have gotten results. 

This is ironically easily seen in the blacklight test demo app, as it winds up adding MARC records to index with at least one unicode term in unicode `nfd` (normalization form decomposed), but in most software actual keyboard entry is in `nfc` (normalization form composed), so for "era" term "koryŏ", you can demonstrate with (do in console and then copy paste) `"koryŏ".unicode_normalize(:nfc)` vs `"koryŏ".unicode_normalize(:nfd)`

### some background

* solr facets are normally (and in our examples) stored in solr `string` types which receive _no_ analysis, and that includes no Unicode normalization. What bytes the indexing process supplies are what goes into index, so any unicode normalization on indexing would have to be provided by client. (not in solr config). 

* the solr `facet.contains` query we are using does byte by byte matching (modulo the ignoreCase param, which does not help here, oddly). So `facet.contains` query must match unicode normalization form in solr index.  so I guess ideally your indexer would be supplying some normalization (many may not be by default), and then we'd normalize query to match. 
 
* The `nkc` and `nfd` normalized forms are *visually identical* (if renderering software has no bugs!), with different bytes, making this especially confusing.  The "compatibility" normalizations at `nfkc` and `nfkd` introduce lossy visual differences, for instance "dumbing down" `①` to `1`, and _then_ applying the decomposed/composed normalization.  

* I think `nfc` is often considered the most common default encoding (certain [XML specs recommend but do not require NFC](https://www.w3.org/TR/charmod-norm/)), but [our fixture data used in CI happens to be in nfd](https://github.com/projectblacklight/blacklight/blob/main/spec/fixtures/sample_solr_documents.yml) for some reason (becuase that's often how it is in MARC for legacy reasons maybe?)


### questions

* Should we do anything, or just leave it alone for application devs to deal with in custom SearchBuilder logic if desired?

* If we are to do something, should we add configuration for facet suggest normalization, or just hard-code it into SearchBuilder logic, which application devs can override using the normal ways of overriding SearchBuilder logic

* Either way, what should the default be?
  * i'd suggest definitely nfc or nfd and not the lossy "downsampling" of one of the compatibiltiy encodings, which I think will be unexpected. But @thatbudakguy in slack preferred NFKC as it's what some Solr analysis does (not used for `string` fields, but by analogy)
  * i'd normally suggest `nfc` because it seems the most common/suggested -- except that our fixture data uses nfd, what?  We could change our fixture data... but I don't know if that means many app devs have data in nfd already too! (which means we prob do need normalization, as nfd data is particularly likely to cause a conflict with keyboard-entered input!)
  * I guess this discord and confusion does mean we should maybe make a config to easily change it, not require searchbuilder customization?

### what i guess i'd do

This is pretty easy to implement. And it seems clear to me that the current behavior is going ot lead to things considered bugs... But confusion/discord over what it should do almost make me throw my hands up and just implement a local fix... but desire to help upstream has led ot this ticket instead. :)

I guess if I was just going to do something, I _would_ make it configurable with a `blacklight_config.index.facet_suggest_query_normalize` config, which is perhaps a *proc* allowing for ultimate flexibilty in normalization. And the default would either be `nil` (same as present, I do think it's a gotcha, but at least an easily fixable one) or default `{ |q| q.unicode_normalize(:nfc) }`, and also normalize fixture data so tests still pass? 

Feedback welcome, if zero or no consensus I guess I'll just fix it locally and leave this here, please and thank you! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unicode normalization to facet filter/suggest/search #3770

some background

questions

what i guess i'd do

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add unicode normalization to facet filter/suggest/search #3770

Description

some background

questions

what i guess i'd do

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions