-
Notifications
You must be signed in to change notification settings - Fork 3k
sanitize_file_name(): Normalize all space characters to a space. #9103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sanitize_file_name(): Normalize all space characters to a space. #9103
Conversation
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
…dPress#9103) Resolves Core-62995 The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it. This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator. There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.
ef23a57
to
198fce4
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an appropriate fix, to simply extend the space replacement to cover the space class. Does this have tests? It would be good to add a case for the known case described in the ticket.
The PHP page on unicode character properties is here. They don't seem to support using the full character property name like JavaScript, which is unfortunate.
I'd consider linking here instead of the data file: https://www.unicode.org/reports/tr44/#General_Category_Values. It seems a bit friendlier for human documentation, but I don't feel strongly.
src/wp-includes/formatting.php
Outdated
* Replace all whitespace characters with a basic space (U+0020). | ||
* | ||
* Characters in the White_Space category are listed with “Zs” in | ||
* their entry in the UnicodeData file maintained at the linked URL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct name for this property is "space separator," not "white space".
* Replace all whitespace characters with a basic space (U+0020). | |
* | |
* Characters in the White_Space category are listed with “Zs” in | |
* their entry in the UnicodeData file maintained at the linked URL. | |
* Replace all whitespace characters with a basic space (U+0020). | |
* | |
* Characters in the “Space_Separator” category are listed with “Zs” in | |
* their entry in the UnicodeData file maintained at the linked URL. |
Zl
Line separator
Zp
Paragraph separator
Zs
Space separator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting. this document is confusing to me. I think the category of Space_Separator
(Zs
) is meant to include characters with the binary White_Space
property. I’m not entirely sure that all characters with the White_Space
property are in the Space_Separator
category.
since this is directly referencing Zs
in the PCRE pattern, I think your suggestion is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typed with the wording; would appreciate your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also find the document confusing. I'm reasonably confident that the Zs
category is the correct choice, but I'm less confident about the language we're using to describe it.
I think the modified comment as you've authored it is good 👍 it explains the intent and it links to the relevant documentation for readers to investigate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after @adamziel raised his own similar questions I expanded the comment with more links
@dmsnell Would you be able to add some unit tests for this? Changing anything related to formatting without tests makes me pretty nervous. |
Concerning tests I will attempt to add a few but I started a meetup yesterday and don’t have guarantees on my timing. I welcome anyone wanting to submit some test cases — since this is a PR, feel free to fork my fork and propose a PR whose target branch is this one.
This will create a PR against this branch in my repo, which I can merge into this branch/PR. |
…dPress#9103) Resolves Core-62995 The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it. This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator. There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a sufficient, minimal fix for the issue described. Thanks!
@desrosj what do you think? On this one I would like to have agreement of at least two separate committers. |
This does include tests now: phpunit --testdox --group=16330 Difference with trunk (+3 assertions): -OK (1 test, 4 assertions)
+OK (1 test, 7 assertions) If
|
…dPress#9103) Resolves Core-62995 The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it. This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator. There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.
…ption__suggestions' into filenames/avoid-collation-corruption
e3ce336
to
5d836bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this one, all! I think the approach of expanding to all Space_Separator
characters makes sense.
The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it. This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator. There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions. Developed in #9103 Discussed in https://core.trac.wordpress.org/ticket/62995 Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam. Fixes #62995. git-svn-id: https://develop.svn.wordpress.org/trunk@60399 602fd350-edb4-49c9-b593-d223f7449a82
The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it. This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator. There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions. Developed in WordPress/wordpress-develop#9103 Discussed in https://core.trac.wordpress.org/ticket/62995 Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam. Fixes #62995. Built from https://develop.svn.wordpress.org/trunk@60399 git-svn-id: http://core.svn.wordpress.org/trunk@59735 1a063a9b-81f0-0310-95a4-ce76da25c4cd
The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it. This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator. There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions. Developed in WordPress/wordpress-develop#9103 Discussed in https://core.trac.wordpress.org/ticket/62995 Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam. Fixes #62995. Built from https://develop.svn.wordpress.org/trunk@60399 git-svn-id: https://core.svn.wordpress.org/trunk@59735 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: Core-62995
The
sanitize_file_name()
function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.
There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.