-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for English #41
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using regexes is probably OK (but you should use JVM API, not Scala one here), but it is better to skip ~50% of input document for language detection as I have explained. Header section usually contains a lot of not interesting stuff written in ASCII only and filtering only scripts won't help that much. There are also comments, inline stylesheets and other things we can ignore if we start language detection from the first tag after ~50% of text content
output.put(char) | ||
private def copyMeaningfulContent(input: CharBuffer, output: CharBuffer): Unit = { | ||
// Convert the input to a string | ||
val content = input.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to avoid creating this string completely, you do not need it.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#matcher(java.lang.CharSequence) can use CharBuffers directly as inputs as they implement CharSequence interface.
1. Used java regex pattern 2. estimate from 50% ~ 3. dismiss css, etc
I changed the code according to your instruction, including:
However, it still cannot recognize English properly. |
Is the regex-based method too rudimentary? Should I use Jsoup? |
I also tried on a relatively big English corpus, but I got this
There is only English disappearing. I think it's strange, so I tried an English html in this corpus, and the LangEstimator could estimate it to English correctly. Is there anything I need to change in uzushio? Any insights would be appreciated. |
No description provided.