fix: HTTP filename parsing for mixed encodings and GBK-encoded #1189

Copilot · 2025-11-27T07:54:27Z

Fixes garbled Chinese filenames in Content-Disposition headers caused by encoding issues.

Problem

Chinese filenames like 下载地址.zip can appear as garbled characters (e.g., ???ص?ַ.zip) due to:

Mixed encoding headers where mime.ParseMediaType fails but filename*=UTF-8''... is still valid
GBK-encoded filenames from Chinese Windows servers being misinterpreted as UTF-8

Root Cause Analysis

The garbled filename ???ص?ַ.zip (hex: 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70) is produced when:

Chinese text 下载地址.zip is encoded as GBK: CF C2 D4 D8 B5 D8 D6 B7 2E 7A 69 70
GBK bytes are interpreted as UTF-8
Invalid UTF-8 sequences become ? (0x3F), while valid sequences (like D8 B5 → Arabic ص) survive
Result: 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70

Changes

parseFilename() function - Centralized parsing that checks filename*= (RFC 5987) first before mime.ParseMediaType
parseFilenameExtended() - Manually parses RFC 5987 format when header is malformed
tryDecodeGBK() - Detects invalid UTF-8 and attempts GBK decoding as fallback
simulateGBKCorruption() - Test helper that dynamically reproduces the real-world corruption pattern

Test Endpoints

/mixed-encoding - Headers with both garbled filename= and valid filename*=UTF-8''...
/filename-star - RFC 5987 format only
/gbk-encoded - Pure GBK bytes (fixable case)
/corrupted-filename - Dynamically generated corrupted bytes matching real-world pattern

What the fix handles

Scenario	Before	After
Pure GBK bytes	`²âÊÔ.zip` (garbled)	`测试.zip` ✓
Mixed encoding with `filename*=`	Empty or garbled	Correctly extracted ✓
Already-corrupted bytes	`???ص?ַ.zip`	Cannot recover (data lost server-side)

Fixes 文件名乱码 #1172

Original prompt

This section details on the original issue you should resolve

<issue_title>文件名乱码</issue_title>
<issue_description>
</issue_description>

<agent_instructions>处理解析多编码混合 Content-Disposition的文件名情况，例如：
Content-Disposition: attachment;filename="ã€�ä¸�å¿˜åˆ�å¿ƒç¾ŽåŒ–ç‰ˆã€‘æ— å¿§ä¸“äº«Windows10_22H2_19045.6282_X64_æ— æ›´æ–°[çº¯å‡€ç²¾ç®€ç‰ˆ]1.6G.esd";filename*=UTF-8''%E3%80%90%E4%B8%8D%E5%BF%98%E5%88%9D%E5%BF%83%E7%BE%8E%E5%8C%96%E7%89%88%E3%80%91%E6%97%A0%E5%BF%A7%E4%B8%93%E4%BA%ABWindows10_22H2_19045.6282_X64_%E6%97%A0%E6%9B%B4%E6%96%B0%5B%E7%BA%AF%E5%87%80%E7%B2%BE%E7%AE%80%E7%89%88%5D%5B1.6G%5D(2025.8.28).esd

并且对应的在 httptest.go 里面添加测试端点

另外还要找到 下载地址.zip 被解析成 ��u�.zip 的原因，并且给出正确的解析方案和测试用例</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@monkeyWie 可以提供一个下载链接吗

Fixes 文件名乱码 #1172

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…d UTF-8 Co-authored-by: monkeyWie <[email protected]>

…d preserve original filename Co-authored-by: monkeyWie <[email protected]>

monkeyWie · 2025-11-27T08:35:29Z

@copilot 乱码的问题应该还没有修复，因为这个 latin1-mangled 测试用例在旧代码中是可以正确解析的

…ename*= parsing when mime.ParseMediaType fails) Co-authored-by: monkeyWie <[email protected]>

Copilot · 2025-11-27T08:46:25Z

@copilot 乱码的问题应该还没有修复，因为这个 latin1-mangled 测试用例在旧代码中是可以正确解析的

You're right - the latin1-mangled test case was working with the original code because mime.ParseMediaType already handles raw UTF-8 bytes correctly.

The actual issue is when mime.ParseMediaType fails due to invalid characters in the Content-Disposition header (like <a> tags in the original issue), but the filename*= parameter is still valid.

Fixed in 0d0be4b by:

Removing the incorrect latin1-mangled test case
Updating mixed-encoding test to use invalid chars that trigger mime.ParseMediaType failure
Simplifying the code by removing unused tryRecoverUTF8 function

monkeyWie · 2025-11-27T09:11:57Z

@copilot 还是不对，我是想让你反推出一个测试用例，也就是说你先得生成一个没修复之前解析出来是 ��u�.zip 的用例，然后再去修复它

Co-authored-by: monkeyWie <[email protected]>

Copilot · 2025-11-27T09:24:45Z

@copilot 还是不对，我是想让你反推出一个测试用例，也就是说你先得生成一个没修复之前解析出来是 ��u�.zip 的用例，然后再去修复它

I've identified and fixed the actual garbled filename issue in 7164448.

The problem was GBK-encoded filenames from Chinese Windows servers. When a server sends 测试.zip as GBK bytes [178 226 202 212 46 122 105 112], Go interprets them as UTF-8, producing garbled characters like ²âÊÔ.zip.

Test case added: /gbk-encoded endpoint that sends GBK-encoded TestChineseFileName - before the fix this would parse as garbled characters, after the fix it correctly decodes to 测试.zip.

The fix detects invalid UTF-8 strings and attempts GBK decoding as a fallback.

codecov · 2025-11-27T12:46:05Z

Codecov Report

❌ Patch coverage is 73.68421% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.61%. Comparing base (49b4d99) to head (948d1fd).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/protocol/http/fetcher.go	70.76%	13 Missing and 6 partials ⚠️
internal/test/httptest.go	80.00%	3 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1189      +/-   ##
==========================================
+ Coverage   68.53%   68.61%   +0.08%     
==========================================
  Files          43       43              
  Lines        4348     4423      +75     
==========================================
+ Hits         2980     3035      +55     
- Misses       1043     1054      +11     
- Partials      325      334       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

monkeyWie · 2025-11-27T12:51:26Z

@copilot 乱码下载地址.zip的响应头值的十六进制为：61 74 74 61 63 68 6D 65 6E 74 3B 20 66 69 6C 65 6E 61 6D 65 3D 22 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70 22

帮我再反推乱码的生成，并且调整对应的测试用例和解码逻辑

…nnot be recovered Co-authored-by: monkeyWie <[email protected]>

Copilot · 2025-11-27T12:58:21Z

@copilot 乱码下载地址.zip的响应头值的十六进制为：61 74 74 61 63 68 6D 65 6E 74 3B 20 66 69 6C 65 6E 61 6D 65 3D 22 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70 22

帮我再反推乱码�...

Analyzed in 948d1fd.

The hex bytes 3F 3F 3F D8 B5 3F D6 B7 show already-corrupted data:

3F = ? (replacement for lost bytes)
D8 B5 and D6 B7 = surviving partial GBK sequences

Original 下载地址 in GBK: CF C2 D4 D8 B5 D8 D6 B7
Corrupted version: 3F 3F 3F D8 B5 3F D6 B7

The corruption happened server-side (some charset conversion replaced bytes with ?). Since the corrupted bytes are valid UTF-8, we cannot detect or recover the original.

What my fix handles:

Pure GBK bytes [B2 E2 CA D4] ("测试") → Invalid UTF-8 → GBK decoded → ✓
Corrupted bytes [3F 3F 3F D8 B5 3F D6 B7] → Valid UTF-8 → Cannot recover → ✗

The /gbk-encoded test verifies the fixable case (pure GBK bytes).

monkeyWie · 2025-11-27T13:31:19Z

@copilot 测试用例响应的报文和我提供的还是对不上：
目标：3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70
结果：cf c2 d4 d8 b5 d8 d6 b7 2e 7a 69 70

Copilot · 2025-11-27T13:37:18Z

@copilot 测试用例响应的报文和我提供的还是对不上：
目标：3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70
结果：cf c2 d4 d8 b5 d8 d6 b7 2e 7a 69 70

Added in 51f98c1.

Created new /corrupted-filename endpoint that sends the exact bytes you provided:

Target: 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70 ✓

The /gbk-encoded endpoint still sends pure GBK bytes (CF C2 D4 D8 B5 D8 D6 B7) which is the fixable case.

Note: The corrupted bytes [3F 3F 3F D8 B5 3F D6 B7] are valid UTF-8 (they render as ???ص?ַ), so we cannot programmatically detect or recover the original 下载地址 from them. The corruption happened before the data was sent.

Initial plan

605eee3

Copilot AI assigned Copilot and monkeyWie Nov 27, 2025

Copilot started work on behalf of monkeyWie November 27, 2025 07:54 View session

Copilot AI and others added 2 commits November 27, 2025 08:09

Fix Content-Disposition parsing for mixed encoding and Latin-1 mangle…

c9b88f9

…d UTF-8 Co-authored-by: monkeyWie <[email protected]>

Address code review feedback: use unicode.ReplacementChar constant an…

71d7d7b

…d preserve original filename Co-authored-by: monkeyWie <[email protected]>

Copilot AI changed the title ~~[WIP] Fix filename garbling in Content-Disposition header~~ Fix Content-Disposition filename parsing for mixed encodings and Latin-1 mangled UTF-8 Nov 27, 2025

Copilot AI requested a review from monkeyWie November 27, 2025 08:15

Copilot finished work on behalf of monkeyWie November 27, 2025 08:15

Copilot started work on behalf of monkeyWie November 27, 2025 08:35 View session

Fix: Remove incorrect latin1-mangled test, focus on actual issue (fil…

0d0be4b

…ename*= parsing when mime.ParseMediaType fails) Co-authored-by: monkeyWie <[email protected]>

Copilot AI changed the title ~~Fix Content-Disposition filename parsing for mixed encodings and Latin-1 mangled UTF-8~~ Fix Content-Disposition filename parsing when mime.ParseMediaType fails Nov 27, 2025

Copilot finished work on behalf of monkeyWie November 27, 2025 08:49

Copilot started work on behalf of monkeyWie November 27, 2025 09:12 View session

Fix GBK-encoded filename parsing (reproduces garbled character issue)

7164448

Co-authored-by: monkeyWie <[email protected]>

Copilot AI changed the title ~~Fix Content-Disposition filename parsing when mime.ParseMediaType fails~~ Fix Content-Disposition filename parsing for mixed encodings and GBK-encoded filenames Nov 27, 2025

Copilot finished work on behalf of monkeyWie November 27, 2025 10:10

Merge branch 'main' into copilot/handle-content-disposition-encoding

7236630

Copilot started work on behalf of monkeyWie November 27, 2025 12:51 View session

Clarify GBK test case - fix handles pure GBK bytes, corrupted data ca…

948d1fd

…nnot be recovered Co-authored-by: monkeyWie <[email protected]>

Copilot stopped work on behalf of monkeyWie due to an error November 27, 2025 13:34
Copilot has encountered an error. See logs for additional details.

Copilot started work on behalf of monkeyWie November 27, 2025 13:34 View session

Copilot finished work on behalf of monkeyWie November 27, 2025 13:59

monkeyWie force-pushed the copilot/handle-content-disposition-encoding branch from c3bad0a to 948d1fd Compare November 27, 2025 14:14

monkeyWie marked this pull request as ready for review November 27, 2025 14:14

monkeyWie added the bug Something isn't working label Nov 27, 2025

monkeyWie changed the title ~~Fix Content-Disposition filename parsing for mixed encodings and GBK-encoded filenames~~ fix: HTTP filename parsing for mixed encodings and GBK-encoded Nov 27, 2025

monkeyWie merged commit 2f5fbfb into main Nov 29, 2025
41 checks passed

monkeyWie deleted the copilot/handle-content-disposition-encoding branch November 29, 2025 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: HTTP filename parsing for mixed encodings and GBK-encoded #1189

fix: HTTP filename parsing for mixed encodings and GBK-encoded #1189

Uh oh!

Copilot AI commented Nov 27, 2025 •

edited

Loading

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

codecov bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: HTTP filename parsing for mixed encodings and GBK-encoded #1189

fix: HTTP filename parsing for mixed encodings and GBK-encoded #1189

Uh oh!

Conversation

Copilot AI commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause Analysis

Changes

Test Endpoints

What the fix handles

Comments on the Issue (you are @copilot in this section)

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

codecov bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

monkeyWie commented Nov 27, 2025

Uh oh!

Copilot AI commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 27, 2025 •

edited

Loading

codecov bot commented Nov 27, 2025 •

edited

Loading