Skip to content

sanitize_file_name(): Normalize all space characters to a space. #9103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jun 27, 2025

Trac ticket: Core-62995

The sanitize_file_name() function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.

This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.

Copy link

github-actions bot commented Jun 27, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, zieladam, desrosj.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

…dPress#9103)

Resolves Core-62995

The `sanitize_file_name()` function normalizes the no-break space
to a normal space (U+0020) in order to prevent issues saving files
with the no-break space in it.

This patch expands the replacement to all space characters, since
it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in
screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this
patch resolves a known and common problem without raising any of
the deeper refactoring questions.
@dmsnell dmsnell force-pushed the filenames/avoid-collation-corruption branch from ef23a57 to 198fce4 Compare June 27, 2025 21:55
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an appropriate fix, to simply extend the space replacement to cover the space class. Does this have tests? It would be good to add a case for the known case described in the ticket.

The PHP page on unicode character properties is here. They don't seem to support using the full character property name like JavaScript, which is unfortunate.

I'd consider linking here instead of the data file: https://www.unicode.org/reports/tr44/#General_Category_Values. It seems a bit friendlier for human documentation, but I don't feel strongly.

Comment on lines 2039 to 2042
* Replace all whitespace characters with a basic space (U+0020).
*
* Characters in the White_Space category are listed with “Zs” in
* their entry in the UnicodeData file maintained at the linked URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct name for this property is "space separator," not "white space".

Suggested change
* Replace all whitespace characters with a basic space (U+0020).
*
* Characters in the White_Space category are listed with “Zs” in
* their entry in the UnicodeData file maintained at the linked URL.
* Replace all whitespace characters with a basic space (U+0020).
*
* Characters in the “Space_Separator” category are listed with “Zs” in
* their entry in the UnicodeData file maintained at the linked URL.

Zl Line separator
Zp Paragraph separator
Zs Space separator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. this document is confusing to me. I think the category of Space_Separator (Zs) is meant to include characters with the binary White_Space property. I’m not entirely sure that all characters with the White_Space property are in the Space_Separator category.

since this is directly referencing Zs in the PCRE pattern, I think your suggestion is best.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typed with the wording; would appreciate your thoughts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find the document confusing. I'm reasonably confident that the Zs category is the correct choice, but I'm less confident about the language we're using to describe it.

I think the modified comment as you've authored it is good 👍 it explains the intent and it links to the relevant documentation for readers to investigate further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after @adamziel raised his own similar questions I expanded the comment with more links

@desrosj
Copy link
Member

desrosj commented Jul 1, 2025

@dmsnell Would you be able to add some unit tests for this? Changing anything related to formatting without tests makes me pretty nervous.

@dmsnell
Copy link
Member Author

dmsnell commented Jul 1, 2025

Concerning tests I will attempt to add a few but I started a meetup yesterday and don’t have guarantees on my timing.

I welcome anyone wanting to submit some test cases — since this is a PR, feel free to fork my fork and propose a PR whose target branch is this one.

git remote add dmsnell [email protected]:dmsnell/wordpress-develop.git
git fetch dmsnell
git checkout my-tests
git reset --hard dmsnell/filenames/avoid-collation-corruption
# make edits
git push -u origin my-tests

This will create a PR against this branch in my repo, which I can merge into this branch/PR.

dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Jul 1, 2025
…dPress#9103)

Resolves Core-62995

The `sanitize_file_name()` function normalizes the no-break space
to a normal space (U+0020) in order to prevent issues saving files
with the no-break space in it.

This patch expands the replacement to all space characters, since
it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in
screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this
patch resolves a known and common problem without raising any of
the deeper refactoring questions.
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a sufficient, minimal fix for the issue described. Thanks!

@dmsnell
Copy link
Member Author

dmsnell commented Jul 1, 2025

@desrosj what do you think? On this one I would like to have agreement of at least two separate committers.

@sirreal
Copy link
Member

sirreal commented Jul 1, 2025

This does include tests now:

phpunit --testdox --group=16330

Difference with trunk (+3 assertions):

-OK (1 test, 4 assertions)
+OK (1 test, 7 assertions)

If wp-includes/formatting.php is reverted to trunk's version, the tests fail:

_Formatting_Sanitize File Name
 ✘ Replaces spaces
   ┐
   ├ Failed asserting that two strings are identical.
   ┊ ---·Expected
   ┊ +++·Actual
   ┊ @@ @@
   ┊ -'Screenshot-2025-02-19-at-2.17.33-PM.png'
   ┊ +'Screenshot-2025-02-19-at-2.17.33 PM.png'
   │
   ╵ …/tests/phpunit/tests/formatting/sanitizeFileName.php:52
   ┴

dmsnell added 5 commits July 1, 2025 17:03
…dPress#9103)

Resolves Core-62995

The `sanitize_file_name()` function normalizes the no-break space
to a normal space (U+0020) in order to prevent issues saving files
with the no-break space in it.

This patch expands the replacement to all space characters, since
it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in
screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this
patch resolves a known and common problem without raising any of
the deeper refactoring questions.
…ption__suggestions' into filenames/avoid-collation-corruption
@dmsnell dmsnell force-pushed the filenames/avoid-collation-corruption branch from e3ce336 to 5d836bf Compare July 1, 2025 15:03
Copy link
Member

@desrosj desrosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this one, all! I think the approach of expanding to all Space_Separator characters makes sense.

pento pushed a commit that referenced this pull request Jul 1, 2025
The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.

This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.

Developed in #9103
Discussed in https://core.trac.wordpress.org/ticket/62995

Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam.
Fixes #62995.


git-svn-id: https://develop.svn.wordpress.org/trunk@60399 602fd350-edb4-49c9-b593-d223f7449a82
Copy link

github-actions bot commented Jul 1, 2025

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60399
GitHub commit: 77e2afb

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Jul 1, 2025
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Jul 1, 2025
The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.

This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.

Developed in WordPress/wordpress-develop#9103
Discussed in https://core.trac.wordpress.org/ticket/62995

Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam.
Fixes #62995.

Built from https://develop.svn.wordpress.org/trunk@60399


git-svn-id: http://core.svn.wordpress.org/trunk@59735 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Jul 1, 2025
The `sanitize_file_name()` function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.

This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.

Developed in WordPress/wordpress-develop#9103
Discussed in https://core.trac.wordpress.org/ticket/62995

Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam.
Fixes #62995.

Built from https://develop.svn.wordpress.org/trunk@60399


git-svn-id: https://core.svn.wordpress.org/trunk@59735 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell dmsnell deleted the filenames/avoid-collation-corruption branch July 1, 2025 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants