Skip to content

Add unicode normalization to facet filter/suggest/search #3770

@jrochkind

Description

@jrochkind

Right now, if you enter a facet suggest/search/filter query in a different unicode normalization form than the facet is stored in the index, you may get a false negative 0 results, even though your entry was visually indistinguishable from a query that would have gotten results.

This is ironically easily seen in the blacklight test demo app, as it winds up adding MARC records to index with at least one unicode term in unicode nfd (normalization form decomposed), but in most software actual keyboard entry is in nfc (normalization form composed), so for "era" term "koryŏ", you can demonstrate with (do in console and then copy paste) "koryŏ".unicode_normalize(:nfc) vs "koryŏ".unicode_normalize(:nfd)

some background

  • solr facets are normally (and in our examples) stored in solr string types which receive no analysis, and that includes no Unicode normalization. What bytes the indexing process supplies are what goes into index, so any unicode normalization on indexing would have to be provided by client. (not in solr config).

  • the solr facet.contains query we are using does byte by byte matching (modulo the ignoreCase param, which does not help here, oddly). So facet.contains query must match unicode normalization form in solr index. so I guess ideally your indexer would be supplying some normalization (many may not be by default), and then we'd normalize query to match.

  • The nkc and nfd normalized forms are visually identical (if renderering software has no bugs!), with different bytes, making this especially confusing. The "compatibility" normalizations at nfkc and nfkd introduce lossy visual differences, for instance "dumbing down" to 1, and then applying the decomposed/composed normalization.

  • I think nfc is often considered the most common default encoding (certain XML specs recommend but do not require NFC), but our fixture data used in CI happens to be in nfd for some reason (becuase that's often how it is in MARC for legacy reasons maybe?)

questions

  • Should we do anything, or just leave it alone for application devs to deal with in custom SearchBuilder logic if desired?

  • If we are to do something, should we add configuration for facet suggest normalization, or just hard-code it into SearchBuilder logic, which application devs can override using the normal ways of overriding SearchBuilder logic

  • Either way, what should the default be?

    • i'd suggest definitely nfc or nfd and not the lossy "downsampling" of one of the compatibiltiy encodings, which I think will be unexpected. But @thatbudakguy in slack preferred NFKC as it's what some Solr analysis does (not used for string fields, but by analogy)
    • i'd normally suggest nfc because it seems the most common/suggested -- except that our fixture data uses nfd, what? We could change our fixture data... but I don't know if that means many app devs have data in nfd already too! (which means we prob do need normalization, as nfd data is particularly likely to cause a conflict with keyboard-entered input!)
    • I guess this discord and confusion does mean we should maybe make a config to easily change it, not require searchbuilder customization?

what i guess i'd do

This is pretty easy to implement. And it seems clear to me that the current behavior is going ot lead to things considered bugs... But confusion/discord over what it should do almost make me throw my hands up and just implement a local fix... but desire to help upstream has led ot this ticket instead. :)

I guess if I was just going to do something, I would make it configurable with a blacklight_config.index.facet_suggest_query_normalize config, which is perhaps a proc allowing for ultimate flexibilty in normalization. And the default would either be nil (same as present, I do think it's a gotcha, but at least an easily fixable one) or default { |q| q.unicode_normalize(:nfc) }, and also normalize fixture data so tests still pass?

Feedback welcome, if zero or no consensus I guess I'll just fix it locally and leave this here, please and thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions