Skip to content

Fix issue #30 Unicode decoding in conversion to CharSequence #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 11, 2017

Conversation

gsnewmark
Copy link
Contributor

Issue #30 affects us too, so I've looked a bit into it. CharseDecoder's JavaDoc is somewhat vague, but it states that:

In any case, if this method [decode] is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation.

It looks like in case of underflow during the decode operation CharsetDecoder leaves bytes not constituting a full character in the passed input and expects next decode operation to pass these bytes along with additional ones which together form a full character. So I've added merging of the remaining extra-bytes and new in to the undeflow branch of the decoding. It fixes the issue, but I'm not that experienced with byte fiddling, so maybe there is a more effective way to do that.

In case compatibility with Clojure 1.5 is needed, I can remove usage of some-> (the same goes for some? and Clojure < 1.5).

Test could be found in pull request #31.

@ztellman
Copy link
Collaborator

Thank you, I've been traveling and hadn't been able to look at this. I'll merge this, and make any performance tweaks myself.

@ztellman ztellman merged commit 29f50f7 into clj-commons:master Jun 11, 2017
@gsnewmark
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants