-
In my document, I wrap some technical terms in a command. Right now this command just typesets them as-is, but later I may decide to do something more interesting with them, maybe:
The correct approach would be to stem these words before adding them to a database, after all, plural and singular versions can be considered identical for my purposes. A manual solution would be to add a |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
Glossaries, or rather indexes, have been discussed here #1339. It does cover a bit of what you mention (= distinguishing the sorting/collecting key from the actual input). Automated language-specific stemming is inherently hard (some languages don't have just plural and singular, but e.g. dual, paucal... and don't forget languages which have genders or similar distinctions). I am not even sure that this is a reasonable task for SILE (core). But in own long overdue and needed indexing requirements, yes, something could be improved (whether in SILE core or as a 3rd-party package. The current indexer is very very light -- EDIT: not to say pretty broken ;) -- in what it does and ought to be replaced). One problem I have considered (but not yet addressed further than a mere thought) is that a command option (your "key" here) is likely not sufficient, as you might want some formatting to occur non-systematically, and we cannot have this via command parameters (= on par with XML attributes). E.g. I could want to index "John Doe", "M. Doe", etc. in a flowing text at "Doe, John" in the index, but with the family name in smallcaps in the index ;-). As noted in #1339, (La)TeX does allow this kind of things, but with complex strategies (= multiple arguments, specific separators) that seem contrived to me (= not working well with the SIL language or even less with XML input files). |
Beta Was this translation helpful? Give feedback.
-
You don't necessarily have to add a key like that, you can just defined |
Beta Was this translation helpful? Give feedback.
-
No one answered that part yet: (EDIT, oh, I had actually, somewhat. I forgot it!)
In brief, no, SILE doesn't have any support for automated stemming. But you may understand that stemming is not that of an obvious topic, far more complex than hyphenation patterns. There are several algorithm/grammar-based general stemmers available, with support for a varying number of languages (e.g. Snowball supports English and some romance or germanic language), however with varying results -- that is, a tendency to either under-stem (and not recognize to inflections of a same word) or over-stem (and conflate different words as one). One might invoke such stemmers in one's workflow, either before invoking SILE or from within a dedicated package. (Some of these stemmers have C APIs, so it would even be possible to work on a Lua wrapper. Later, maybe Rust will be an option too.) Then there are fuller dictionary-based stemmers, usually part of big NLP (natural language processing) software, often paired with a PoS-tagger (identifying parts of speech), often coming with huge databases... And then some even more advance AI-based NLP software, which can't run on everyone's laptop :) I'm not sure what you expected in this question and how it may be Answered? |
Beta Was this translation helpful? Give feedback.
-
@raphCode Is there anything else you'd expect for this question to be answered? |
Beta Was this translation helpful? Give feedback.
Glossaries, or rather indexes, have been discussed here #1339. It does cover a bit of what you mention (= distinguishing the sorting/collecting key from the actual input).
Automated language-specific stemming is inherently hard (some languages don't have just plural and singular, but e.g. dual, paucal... and don't forget languages which have genders or similar distinctions). I am not even sure that this is a reasonable task for SILE (core).
But in own long overdue and needed indexing requirements, yes, something could be improved (whether in SILE core or as a 3rd-party package. The current indexer is very very light -- EDIT: not to say pretty broken ;) -- in what it does and ought to be r…