Skip to content

Latest commit

 

History

History
30 lines (23 loc) · 852 Bytes

README-Natural-TODO.md

File metadata and controls

30 lines (23 loc) · 852 Bytes

Natural Language TODO List

This is a short list of projects that are "ready to go" but have not been started yet.

Tokenization

Best results are obtainable with the "freedom models" (freedom at character transitions) as described in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655800/

Morphology

Most European languages have conjugated verbs, meaning that there is a verb stem, and a varying suffix indicating tense and number. Effectively all syntactic structure is carried by the suffix, whereas fundamental semantic contents is in the stem.

To deal with morophology, words need to

Chinese

Chinese segmentation can be learned, in the sense of "set phrases".

Translation/parallel tests

This too should work.

Infrastructure dev is needed for parallel texts.