-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
foldfilter breaks translation from language without spaces to language with spaces #21
Comments
Oh crap… that is not something I had thought about. By default it just uses the delimiter it chopped out as glue. If there is no delimiter, there is no glue. So no space. It makes sense in the I'd suggest adding a space between words when the separator was empty (i.e. at a mid-word break instead of some soft line wrapping) except when the This should be a simple fix, just an extra if-statement inside https://github.com/kpu/preprocess/blob/master/preprocess/foldfilter_main.cc#L132 |
it's worse than first thought. Even languages with spaces are losing them. Let's make a
build/bin/foldfilter -w 4 ./remove_space.py <<<"hello hi how are you"
hellohihowareyou |
I misinterpreted my own documentation… The passing spaces to the wrapped command is the default, but paracrawl using it with the With
So damage is not that bad (I would have noticed it earlier!) but the original case (input without delimiters) is still a problem. Also maybe the defaults are bad. |
Yeah ok we should have used |
@jelmervdl Problem with foldfilter: if we translate from e.g. ko (without spaces) to en then the output concatenates the last English word of a preceding sentence with the first English word of the following sentence, without an intervening space.
This impacts quality of zh and ko and probably other paracrawls.
The text was updated successfully, but these errors were encountered: