Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return source white space string #95

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

kino-ma
Copy link

@kino-ma kino-ma commented Oct 7, 2024

Currently, Node.white_space always returns half-width space " " (0x20).

However, it is not desirable behavior in my use case because I want to preserve source white space characters other than " " (e.g., "\t") also.

By this PR, the function will return the original white space string.

@polm
Copy link
Owner

polm commented Oct 10, 2024

Thank you, this looks like an excellent change. Thank you also for including tests.

It may take me a little bit to review it, but I'll check it over and get it included.

@kino-ma
Copy link
Author

kino-ma commented Oct 10, 2024

Thanks for your reply!
よろしくお願いいたします 🙏

Escape characters make the source easier to read.
@polm
Copy link
Owner

polm commented Oct 31, 2024

Sorry it took me a while to get to this. The basic code is good, but because of the way MeCab does memory management, it has the issue that the surface is invalidated by further calls to parseToNode. For example:

from fugashi import Tagger

tagger = Tagger()

x = tagger("a\nb")
print(repr(x[1].white_space))
# => '\n'
y = tagger("a\tb")
print(repr(x[1].white_space))
# => '\t' (this is wrong)

The solution is to cache the white space, as is done for the token surfaces. I can add this myself.

@polm
Copy link
Owner

polm commented Oct 31, 2024

OK, caching is implemented so this should actually work now. I will do a few more checks before merging. Might take a little while, though it shouldn't be another three weeks 😅

@kino-ma
Copy link
Author

kino-ma commented Nov 1, 2024

Thank you so much!! 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants