first try fixing utf8 issues #52

ambs · 2016-08-19T20:47:37Z

This is a first try fixing a utf8 problem from #37.
This is not really a pull request, but a request for comments. This solves the issue, but will only work for perl >= 5.8.1

What happens is that when Perl reads the Makefile.PL, with utf8, it will store strings as utf8. Then, you will not be able to write them as bytes, unless you perform the right conversion.

btw, this is a Pull Request Challenge PR.

karenetheridge · 2016-08-19T20:59:45Z

use utf8 is not necessary to call functions in the utf8 namespace -- its only purpose is to signal that the file should be parsed as utf8-encoded bytes.

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

This code makes no sense:

my $test_string = 'Alberto Simões <[email protected]>';
utf8::decode($test_string);

utf8::decode on a string that is already in unicode characters will result in mojibake -- if it doesn't error out on encountering bytes that cannot exist in utf8.

ambs · 2016-08-20T10:27:16Z

On 19/08/16 21:59, Karen Etheridge wrote:

This code makes no sense:

|my $test_string = 'Alberto Simões [email protected]';
utf8::decode($test_string); |

|utf8::decode| on a string that is already in unicode characters will
result in mojibake -- if it doesn't error out on encountering bytes that
cannot exist in utf8.

But it is not in utf8.
Note that the test script doesn't use utf8.

And you can't make it use utf8, or the Module::Install generated will be
wrote with bad utf8 encoding.

ambs · 2016-08-20T13:03:49Z

Dear @karenetheridge, looks more understandable this way?

ambs · 2016-08-20T13:06:34Z

Also, added some code to check Perl version.
I think this could be a really solution now.

Leont · 2016-08-22T18:08:44Z

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

Indeed, it is incorrect.

miyagawa · 2016-08-22T18:12:06Z

The correct fix would be just call utf8::encode without checking the utf8 flag. It would break existing Makefile.PL that sets utf-8 byte string, but then they can fix it by adding use utf8; in there.

utf8 was shipped with 5.6.0 and there's no problem requiring that, given Module::Install's lowest supported perl is also 5.6.0.

miyagawa · 2016-08-22T18:15:47Z

btw #37 sounds like a different problem then - it should decode when reading the =head1 AUTHOR section with author_from directive, when combined with the utf8::encode added like this.

ambs · 2016-08-22T18:26:00Z

@miyagawa, removing the utf8 check makes t/20_authors_with_special_characters.t fail.
Will try to understand how that can be fixed. Probably, changing that test... to use utf8

miyagawa · 2016-08-22T18:27:25Z

Probably, changing that test... to use utf8

Haven't looked at the test but that sounds right. This can't essentially be done without a breaking change...

Leont · 2016-08-22T18:52:56Z

The only sane way to use unicode in perl is to handle the data consistently as either bytes or characters. Anything else will set everything on fire.

And while I generally consider threating strings as characters the more sensible option, the path of least resistance here is probably threating everything as bytes.

ambs · 2016-08-22T18:53:07Z

How would this change affect M::I users? I am never sure what the encode/decode stuff will affect the string.

ambs · 2016-08-22T18:56:56Z

@Leont you say that as your name doesn't have accented characters 😺

karenetheridge · 2016-08-22T19:08:30Z

We've discussed in CPAN-Meta tickets that it's more important to avoid breaking installations than to corrupt an author or contributor's name. However, different rules can apply in MI since the code is bundled with the distribution, so if breakage occurs, it's all on the author side, and they are in a position to fix their Makefile.PL before shipping.

Given this, I think it would be acceptable to die with "invalid character encoding" errors and force the author to fix their code (especially if it's easy to do so, and we have something in the documentation explaining how) when they upgrade to the latest MI.

...especially since we don't really want authors to continue to use MI anyway.. :)

Leont · 2016-08-22T19:19:48Z

To summarize the current situation:

The file gets read a bytes
Module::Install/YAML::Tiny interprets that characters and returns characters.
The file gets written to a binary handle, so it gets silently downgraded to latin1 if needed.

These mismatches cause mojibake.

The first suggested solution was essentially "sometimes we interpret the output as bytes, sometimes we don't", which is a path to madness.

The second suggested solution is to treat all files (on input and on output) as UTF-8. Which will actually work as long as those conditions are true, but I bet it often isn't.

The sensible solution is a combination of:

read input correctly, e.g. decode input data from pod if =encoding <whatever> is set.
write output correctly, e.g. encode the YAML string to bytes.

miyagawa · 2016-08-22T19:24:31Z

The sensible solution is a combination of:

Yes, that was my suggestion. I forgot that the author names are often read from file with all_from, rather I was thinking about literals in Makefile.PL first.

ambs · 2016-08-22T19:30:04Z

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

Leont · 2016-08-22T19:55:44Z

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

In Module::Install::Metadata::authors_from (and possibly other functions) the data should be appropriately decoded before being stored in the object.
In Module::Install::Admin::Metadata::write_meta the data should be encoded before writing it.

This should fix #37 and probably other similar issues with characters outside the latin1 range, without risking any breakage elsewhere involving these read/write functions.

ambs · 2016-08-22T19:57:54Z

OK, will try to see if I understood it.

ambs · 2016-08-22T20:51:27Z

Something like #55?
I can't understand why with two 'encode' it works, but with one encode and one decode it doesn't.
Almost quitting this task 😞 hate unicode!

ambs force-pushed the pr/fix_utf8 branch from 2918d89 to 5143d86 Compare August 20, 2016 13:03

Fix utf8 issues...

89340f0

ambs force-pushed the pr/fix_utf8 branch from 5143d86 to 89340f0 Compare August 20, 2016 13:05

ambs added bug needs review labels Aug 22, 2016

Use utf8 a lot more

baaebfe

ambs closed this Aug 23, 2016

first try fixing utf8 issues #52

first try fixing utf8 issues #52

Uh oh!

Conversation

ambs commented Aug 19, 2016

Uh oh!

karenetheridge commented Aug 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ambs commented Aug 20, 2016

Uh oh!

ambs commented Aug 20, 2016

Uh oh!

ambs commented Aug 20, 2016

Uh oh!

Leont commented Aug 22, 2016

Uh oh!

miyagawa commented Aug 22, 2016

Uh oh!

miyagawa commented Aug 22, 2016

Uh oh!

ambs commented Aug 22, 2016

Uh oh!

miyagawa commented Aug 22, 2016

Uh oh!

Leont commented Aug 22, 2016

Uh oh!

ambs commented Aug 22, 2016

Uh oh!

ambs commented Aug 22, 2016

Uh oh!

karenetheridge commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Leont commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miyagawa commented Aug 22, 2016

Uh oh!

ambs commented Aug 22, 2016

Uh oh!

Leont commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ambs commented Aug 22, 2016

Uh oh!

ambs commented Aug 22, 2016

Uh oh!

Uh oh!

karenetheridge commented Aug 19, 2016 •

edited

Loading

karenetheridge commented Aug 22, 2016 •

edited

Loading

Leont commented Aug 22, 2016 •

edited

Loading

Leont commented Aug 22, 2016 •

edited

Loading