-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy cs to sk for prototyping #132
base: main
Are you sure you want to change the base?
Conversation
src/rules/sk.toml
Outdated
@@ -0,0 +1,17 @@ | |||
allowed_symbols_regex="[A-Za-zěščřžýáíéóďťňúůĚŠČŘŽÝÁÍÉÓĎŤŇäöüÚ‚–\\. \"„“]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowed_symbols_regex="[A-Za-zěščŕřžýáíéóôďťňúůĺľÁÄĚŠČŔŘŽÝÁÍÉÓÔĎŤŇĹĽäöüÚ‚–\. "„“]"
src/rules/sk.toml
Outdated
needs_uppercase_start = true | ||
even_symbols = ["\""] | ||
broken_whitespace = [" ", " ,", " .", " ?", " !", " ;"] | ||
abbreviation_patterns = ["[A-ZĚŠČŘŽÝÁÍÉĎŤŇÓÚ]+\\.*[a-z]*[A-ZĚŠČŘŽÝÁÍÉĎŤŇÓÚ]+", "atd\\.", "\\baj\\.", "tj\\.", "\\brec\\.", "[nN]apř\\.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abbreviation_patterns = ["[A-ZĹĽĚŠČŔŘŽÝÁÍÉĎŤŇÓÔÚ]+\.[a-z][A-ZĹĽĚŠČŔŘŽÝÁÍÉĎŤŇÓÔÚ]+", "a i\.", "a pod\.", "atď\.", "\baj\.", "tj\
.", "\brec\.", "[nN]apr\.",
""."", "\s[^aikosuvzáó]\s", "zkr\.", "[Tt]zv\.", "[dD]r\.", "\b[aAeE]d\.", "\b[sS]?[tT]r\.", "[aA]rch\.", "Inc\.", "Ltd\.", "[pP]opr\.",
"\b[fF]r\.", "\b[A-Z]+DR\b", "[pP]ozn\.", "[sS]rov\.", "\b[eE][a-z]\.", "[zZ]ejm\.", "[JS]r\.", "\b[lL][lL]",
"Mgr\.", "[mM]j\.", "\b[sS]tol\.", "\b[pP]ol\.", "Ing\.", "[cCkK]pt\.", "\b[lL]t\.", "Mr?s?\.", "\s[^\\s]{1,2}\.", "\bviz\.", "\b[sS]at\."]
Blocklist generated from words of frequency 60 and lower |
Downloaded and sent for review to five native speakers. |
Sorry, I missed that comment.
No. We can't accept corrected sentences, because we need to run a new, fresh export once the rules are added. This is needed to make sure that we fulfil all legal requirements. As sentences are picked at random, any changes to them would be lost. |
Mainly to get the extraction running and to get an idea how much more work will need to be done.