Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of non-ASCII filenames #26

Closed
yan12125 opened this issue Nov 24, 2017 · 2 comments
Closed

Improve handling of non-ASCII filenames #26

yan12125 opened this issue Nov 24, 2017 · 2 comments

Comments

@yan12125
Copy link
Contributor

Here are some examples not working the same as vanilla Firefox.

  1. https://drive.google.com/file/d/0B7pIvhrJqP6xaGNkVldaeUpuRG8/view

The filename is 測試.txt, while open-in-browser displays __.txt. That's because RFC 6266 is not correctly implemented. The Content-Disposition line for this file is:

attachment;filename="__.txt";filename*=UTF-8''%E6%B8%AC%E8%A9%A6.txt

According to RFC 6266:

when both "filename" and "filename*" are present in a single header field value, recipients SHOULD pick "filename*" and ignore "filename".

%E6%B8%AC%E8%A9%A6.txt should be used here. That's exactly 測試.txt.

Similar bug reports and fixes:

By the way, from one of new test cases in wget's commit,

"filename**0=\"A\"; filename**1=\"A.ext\"; filename*0=\"B\";filename*1=\"B\"", "AA.ext"

I bet correctly implement RFC 6266 is not something easy.

  1. https://www.csie.ntu.edu.tw/download.php?filename=13101_7da5e585.pdf&dir=news&title=%E5%9C%8B%E7%AB%8B%E8%87%BA%E7%81%A3%E5%A4%A7%E5%AD%B8%E5%AD%B8%E7%94%9F%E9%80%95%E8%A1%8C%E4%BF%AE%E8%AE%80%E5%8D%9A%E5%A3%AB%E5%AD%B8%E4%BD%8D%E8%BE%A6%E6%B3%951060609

This website is misconfigured and return filenames in UTF-8 without quoting:

attachment; filename=國立臺灣大學學生逕行修讀博士學位辦法1060609.pdf

If I disabled the open-in-browser extension, Firefox uses 國立臺灣大學學生逕行修讀博士學位辦法1060609.pdf as the filename, while open-in-browser says:

åç«èºç£å¤§å¸å¸çéè¡ä¿®è®å士å¸ä½è¾¦æ³1060609.pdf

That's because Firefox re-encodes the header with ISO-8859-1. I guess Firefox has some heuristic for recoding filenames back to UTF-8. In my PR for est31's version, I recode raw filenames back to UTF-8 unconditionally. I'm not sure if it's a good approach.

Rob--W added a commit that referenced this issue Nov 27, 2017
The current Content-Disposition parser is very poor (see e.g. #26).
Let the browser determine the file name since they are probably better
at it.
@Rob--W Rob--W closed this as completed in 6f3bbb8 Nov 27, 2017
Rob--W added a commit that referenced this issue Nov 27, 2017
- Recognize non-ASCII file names (#26)
- Parse Content-Disposition according to RFC 2047, 2231, 5987, 6266
- Fix default action on Linux/macOS when pressing Enter (#27)
- Work around issue that prematurely closed the dialog (#28)
@Rob--W
Copy link
Owner

Rob--W commented Nov 27, 2017

I decided to roll my own Content-Disposition parser (6f3bbb8) because the library that you suggested is incomplete.
In particular it does not support RFC 2047 (which is obsolete but still supported in Firefox), and also lacks support for parameter continuations (jshttp/content-disposition#2).

The test case that you referenced from wget is not valid either, I opened a bug for that at https://savannah.gnu.org/bugs/index.php?52531

@yan12125
Copy link
Contributor Author

Thank you very much for the parser. It's useful and easy to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants