Incorrect sentence parsing using ja_core_news_trf #12106

e-e · 2023-01-12T12:00:11Z

e-e
Jan 12, 2023

I'm not sure if this is the proper place to report this, and this is the first time that I've seen something like this, but I wanted to create an issue in case this was something that should in fact be reported.

How to reproduce the behaviour

import spacy

nlp = spacy.load("ja_core_news_trf")
doc = nlp("息子のためとあれば、火の中でも飛び込みます。")

for sent in doc.sents:
    print(sent.text)

outputs:

息子のためとあれば、火の中でも飛び
込み
ます
。

Your Environment

spaCy version: 3.4.3
Platform: macOS-12.5.1-x86_64-i386-64bit
Python version: 3.10.8
Pipelines: ja_core_news_lg (3.4.0), ja_core_news_trf (3.4.0)

Answered by polm

Jan 13, 2023

In general issues like this fall under #3052, which basically amounts to "the models make mistakes sometimes". If the mistake is common and follows a clear pattern that might point to a fixable issue. In this case, there does seem to be something weird about how compound verbs are handled, so we'll take a closer look at that.

Note that if your goal is actually just sentence segmentation for Japanese, you should get high quality results with a punctuation-based sentencizer instead of relying on the default sentence boundaries, which are based on the parse tree.

View full answer

polm · 2023-01-13T04:37:36Z

polm
Jan 13, 2023

In general issues like this fall under #3052, which basically amounts to "the models make mistakes sometimes". If the mistake is common and follows a clear pattern that might point to a fixable issue. In this case, there does seem to be something weird about how compound verbs are handled, so we'll take a closer look at that.

Note that if your goal is actually just sentence segmentation for Japanese, you should get high quality results with a punctuation-based sentencizer instead of relying on the default sentence boundaries, which are based on the parse tree.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect sentence parsing using ja_core_news_trf #12106

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Incorrect sentence parsing using ja_core_news_trf #12106

Uh oh!

e-e Jan 12, 2023

How to reproduce the behaviour

Your Environment

Replies: 1 comment

Uh oh!

polm Jan 13, 2023

e-e
Jan 12, 2023

polm
Jan 13, 2023