Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for English #41

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

ZeekYin
Copy link

@ZeekYin ZeekYin commented Oct 11, 2024

No description provided.

@ZeekYin ZeekYin marked this pull request as ready for review October 11, 2024 07:43
Copy link
Collaborator

@eiennohito eiennohito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using regexes is probably OK (but you should use JVM API, not Scala one here), but it is better to skip ~50% of input document for language detection as I have explained. Header section usually contains a lot of not interesting stuff written in ASCII only and filtering only scripts won't help that much. There are also comments, inline stylesheets and other things we can ignore if we start language detection from the first tag after ~50% of text content

output.put(char)
private def copyMeaningfulContent(input: CharBuffer, output: CharBuffer): Unit = {
// Convert the input to a string
val content = input.toString
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to avoid creating this string completely, you do not need it.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#matcher(java.lang.CharSequence) can use CharBuffers directly as inputs as they implement CharSequence interface.

1. Used java regex pattern
2. estimate from 50% ~
3. dismiss css, etc
@ZeekYin
Copy link
Author

ZeekYin commented Oct 20, 2024

I changed the code according to your instruction, including:

  1. Used java regex pattern
  2. estimate from 50% ~
  3. dismiss css, etc

However, it still cannot recognize English properly.

@ZeekYin
Copy link
Author

ZeekYin commented Oct 22, 2024

Is the regex-based method too rudimentary? Should I use Jsoup?

@ZeekYin
Copy link
Author

ZeekYin commented Oct 23, 2024

I also tried on a relatively big English corpus, but I got this

'language=ar'   'language=fr'  'language=ko'  'language=pt'  'language=uk'
'language=ast'  'language=ga'  'language=lt'  'language=ru'  'language=ur'
'language=be'   'language=gl'  'language=lv'  'language=sk'  'language=vi'
'language=bg'   'language=hi'  'language=mk'  'language=sq'  'language=zh'
'language=bn'   'language=is'  'language=mr'  'language=sr'   _SUCCESS
'language=cs'   'language=ja'  'language=mt'  'language=sv'
'language=el'   'language=km'  'language=oc'  'language=th'
'language=fa'   'language=kn'  'language=pl'  'language=tr'

There is only English disappearing. I think it's strange, so I tried an English html in this corpus, and the LangEstimator could estimate it to English correctly. Is there anything I need to change in uzushio? Any insights would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants