Skip to content

first try fixing utf8 issues #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

first try fixing utf8 issues #52

wants to merge 2 commits into from

Conversation

ambs
Copy link
Contributor

@ambs ambs commented Aug 19, 2016

This is a first try fixing a utf8 problem from #37.
This is not really a pull request, but a request for comments. This solves the issue, but will only work for perl >= 5.8.1

What happens is that when Perl reads the Makefile.PL, with utf8, it will store strings as utf8. Then, you will not be able to write them as bytes, unless you perform the right conversion.

btw, this is a Pull Request Challenge PR.

@karenetheridge
Copy link
Member

karenetheridge commented Aug 19, 2016

use utf8 is not necessary to call functions in the utf8 namespace -- its only purpose is to signal that the file should be parsed as utf8-encoded bytes.

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

This code makes no sense:

my $test_string = 'Alberto Simões <[email protected]>';
utf8::decode($test_string);

utf8::decode on a string that is already in unicode characters will result in mojibake -- if it doesn't error out on encountering bytes that cannot exist in utf8.

@ambs
Copy link
Contributor Author

ambs commented Aug 20, 2016

On 19/08/16 21:59, Karen Etheridge wrote:

This code makes no sense:

|my $test_string = 'Alberto Simões [email protected]';
utf8::decode($test_string); |

|utf8::decode| on a string that is already in unicode characters will
result in mojibake -- if it doesn't error out on encountering bytes that
cannot exist in utf8.

But it is not in utf8.
Note that the test script doesn't use utf8.

And you can't make it use utf8, or the Module::Install generated will be
wrote with bad utf8 encoding.

@ambs
Copy link
Contributor Author

ambs commented Aug 20, 2016

Dear @karenetheridge, looks more understandable this way?

@ambs
Copy link
Contributor Author

ambs commented Aug 20, 2016

Also, added some code to check Perl version.
I think this could be a really solution now.

@Leont
Copy link
Member

Leont commented Aug 22, 2016

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

Indeed, it is incorrect.

@miyagawa
Copy link
Member

The correct fix would be just call utf8::encode without checking the utf8 flag. It would break existing Makefile.PL that sets utf-8 byte string, but then they can fix it by adding use utf8; in there.

utf8 was shipped with 5.6.0 and there's no problem requiring that, given Module::Install's lowest supported perl is also 5.6.0.

@miyagawa
Copy link
Member

btw #37 sounds like a different problem then - it should decode when reading the =head1 AUTHOR section with author_from directive, when combined with the utf8::encode added like this.

@ambs
Copy link
Contributor Author

ambs commented Aug 22, 2016

@miyagawa, removing the utf8 check makes t/20_authors_with_special_characters.t fail.
Will try to understand how that can be fixed. Probably, changing that test... to use utf8

@miyagawa
Copy link
Member

Probably, changing that test... to use utf8

Haven't looked at the test but that sounds right. This can't essentially be done without a breaking change...

@Leont
Copy link
Member

Leont commented Aug 22, 2016

The only sane way to use unicode in perl is to handle the data consistently as either bytes or characters. Anything else will set everything on fire.

And while I generally consider threating strings as characters the more sensible option, the path of least resistance here is probably threating everything as bytes.

@ambs
Copy link
Contributor Author

ambs commented Aug 22, 2016

How would this change affect M::I users? I am never sure what the encode/decode stuff will affect the string.

@ambs
Copy link
Contributor Author

ambs commented Aug 22, 2016

@Leont you say that as your name doesn't have accented characters 😺

@karenetheridge
Copy link
Member

karenetheridge commented Aug 22, 2016

We've discussed in CPAN-Meta tickets that it's more important to avoid breaking installations than to corrupt an author or contributor's name. However, different rules can apply in MI since the code is bundled with the distribution, so if breakage occurs, it's all on the author side, and they are in a position to fix their Makefile.PL before shipping.

Given this, I think it would be acceptable to die with "invalid character encoding" errors and force the author to fix their code (especially if it's easy to do so, and we have something in the documentation explaining how) when they upgrade to the latest MI.

...especially since we don't really want authors to continue to use MI anyway.. :)

@Leont
Copy link
Member

Leont commented Aug 22, 2016

To summarize the current situation:

  1. The file gets read a bytes
  2. Module::Install/YAML::Tiny interprets that characters and returns characters.
  3. The file gets written to a binary handle, so it gets silently downgraded to latin1 if needed.

These mismatches cause mojibake.

The first suggested solution was essentially "sometimes we interpret the output as bytes, sometimes we don't", which is a path to madness.

The second suggested solution is to treat all files (on input and on output) as UTF-8. Which will actually work as long as those conditions are true, but I bet it often isn't.

The sensible solution is a combination of:

  1. read input correctly, e.g. decode input data from pod if =encoding <whatever> is set.
  2. write output correctly, e.g. encode the YAML string to bytes.

@miyagawa
Copy link
Member

The sensible solution is a combination of:

Yes, that was my suggestion. I forgot that the author names are often read from file with all_from, rather I was thinking about literals in Makefile.PL first.

@ambs
Copy link
Contributor Author

ambs commented Aug 22, 2016

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

@Leont
Copy link
Member

Leont commented Aug 22, 2016

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

  1. In Module::Install::Metadata::authors_from (and possibly other functions) the data should be appropriately decoded before being stored in the object.
  2. In Module::Install::Admin::Metadata::write_meta the data should be encoded before writing it.

This should fix #37 and probably other similar issues with characters outside the latin1 range, without risking any breakage elsewhere involving these read/write functions.

@ambs
Copy link
Contributor Author

ambs commented Aug 22, 2016

OK, will try to see if I understood it.

@ambs
Copy link
Contributor Author

ambs commented Aug 22, 2016

Something like #55?
I can't understand why with two 'encode' it works, but with one encode and one decode it doesn't.
Almost quitting this task 😞 hate unicode!

@ambs ambs closed this Aug 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants