Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 27, 2025

Fixes garbled Chinese filenames in Content-Disposition headers caused by encoding issues.

Problem

Chinese filenames like 下载地址.zip can appear as garbled characters (e.g., ???ص?ַ.zip) due to:

  1. Mixed encoding headers where mime.ParseMediaType fails but filename*=UTF-8''... is still valid
  2. GBK-encoded filenames from Chinese Windows servers being misinterpreted as UTF-8

Root Cause Analysis

The garbled filename ???ص?ַ.zip (hex: 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70) is produced when:

  1. Chinese text 下载地址.zip is encoded as GBK: CF C2 D4 D8 B5 D8 D6 B7 2E 7A 69 70
  2. GBK bytes are interpreted as UTF-8
  3. Invalid UTF-8 sequences become ? (0x3F), while valid sequences (like D8 B5 → Arabic ص) survive
  4. Result: 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70

Changes

  • parseFilename() function - Centralized parsing that checks filename*= (RFC 5987) first before mime.ParseMediaType
  • parseFilenameExtended() - Manually parses RFC 5987 format when header is malformed
  • tryDecodeGBK() - Detects invalid UTF-8 and attempts GBK decoding as fallback
  • simulateGBKCorruption() - Test helper that dynamically reproduces the real-world corruption pattern

Test Endpoints

  • /mixed-encoding - Headers with both garbled filename= and valid filename*=UTF-8''...
  • /filename-star - RFC 5987 format only
  • /gbk-encoded - Pure GBK bytes (fixable case)
  • /corrupted-filename - Dynamically generated corrupted bytes matching real-world pattern

What the fix handles

Scenario Before After
Pure GBK bytes ²âÊÔ.zip (garbled) 测试.zip
Mixed encoding with filename*= Empty or garbled Correctly extracted ✓
Already-corrupted bytes ???ص?ַ.zip Cannot recover (data lost server-side)
Original prompt

This section details on the original issue you should resolve

<issue_title>文件名乱码</issue_title>
<issue_description>Image
</issue_description>

<agent_instructions>处理解析多编码混合 Content-Disposition的文件名情况,例如:
Content-Disposition: attachment;filename="ã€�ä¸�忘åˆ�å¿ƒç¾ŽåŒ–ç‰ˆã€‘æ— å¿§ä¸“äº«Windows10_22H2_19045.6282_X64_æ— æ›´æ–°[纯净精简版]1.6G.esd";filename*=UTF-8''%E3%80%90%E4%B8%8D%E5%BF%98%E5%88%9D%E5%BF%83%E7%BE%8E%E5%8C%96%E7%89%88%E3%80%91%E6%97%A0%E5%BF%A7%E4%B8%93%E4%BA%ABWindows10_22H2_19045.6282_X64_%E6%97%A0%E6%9B%B4%E6%96%B0%5B%E7%BA%AF%E5%87%80%E7%B2%BE%E7%AE%80%E7%89%88%5D%5B1.6G%5D(2025.8.28).esd

并且对应的在 httptest.go 里面添加测试端点

另外还要找到 下载地址.zip 被解析成 ����u�.zip 的原因,并且给出正确的解析方案和测试用例</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@monkeyWie 可以提供一个下载链接吗

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Fix filename garbling in Content-Disposition header Fix Content-Disposition filename parsing for mixed encodings and Latin-1 mangled UTF-8 Nov 27, 2025
Copilot AI requested a review from monkeyWie November 27, 2025 08:15
Copilot finished work on behalf of monkeyWie November 27, 2025 08:15
@monkeyWie
Copy link
Member

@copilot 乱码的问题应该还没有修复,因为这个 latin1-mangled 测试用例在旧代码中是可以正确解析的

…ename*= parsing when mime.ParseMediaType fails)

Co-authored-by: monkeyWie <[email protected]>
Copy link
Contributor Author

Copilot AI commented Nov 27, 2025

@copilot 乱码的问题应该还没有修复,因为这个 latin1-mangled 测试用例在旧代码中是可以正确解析的

You're right - the latin1-mangled test case was working with the original code because mime.ParseMediaType already handles raw UTF-8 bytes correctly.

The actual issue is when mime.ParseMediaType fails due to invalid characters in the Content-Disposition header (like <a> tags in the original issue), but the filename*= parameter is still valid.

Fixed in 0d0be4b by:

  1. Removing the incorrect latin1-mangled test case
  2. Updating mixed-encoding test to use invalid chars that trigger mime.ParseMediaType failure
  3. Simplifying the code by removing unused tryRecoverUTF8 function

Copilot AI changed the title Fix Content-Disposition filename parsing for mixed encodings and Latin-1 mangled UTF-8 Fix Content-Disposition filename parsing when mime.ParseMediaType fails Nov 27, 2025
Copilot finished work on behalf of monkeyWie November 27, 2025 08:49
@monkeyWie
Copy link
Member

@copilot 还是不对,我是想让你反推出一个测试用例,也就是说你先得生成一个没修复之前解析出来是 ����u�.zip 的用例,然后再去修复它

Copy link
Contributor Author

Copilot AI commented Nov 27, 2025

@copilot 还是不对,我是想让你反推出一个测试用例,也就是说你先得生成一个没修复之前解析出来是 ����u�.zip 的用例,然后再去修复它

I've identified and fixed the actual garbled filename issue in 7164448.

The problem was GBK-encoded filenames from Chinese Windows servers. When a server sends 测试.zip as GBK bytes [178 226 202 212 46 122 105 112], Go interprets them as UTF-8, producing garbled characters like ²âÊÔ.zip.

Test case added: /gbk-encoded endpoint that sends GBK-encoded TestChineseFileName - before the fix this would parse as garbled characters, after the fix it correctly decodes to 测试.zip.

The fix detects invalid UTF-8 strings and attempts GBK decoding as a fallback.

Copilot AI changed the title Fix Content-Disposition filename parsing when mime.ParseMediaType fails Fix Content-Disposition filename parsing for mixed encodings and GBK-encoded filenames Nov 27, 2025
Copilot finished work on behalf of monkeyWie November 27, 2025 10:10
@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 73.68421% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.61%. Comparing base (49b4d99) to head (948d1fd).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/protocol/http/fetcher.go 70.76% 13 Missing and 6 partials ⚠️
internal/test/httptest.go 80.00% 3 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1189      +/-   ##
==========================================
+ Coverage   68.53%   68.61%   +0.08%     
==========================================
  Files          43       43              
  Lines        4348     4423      +75     
==========================================
+ Hits         2980     3035      +55     
- Misses       1043     1054      +11     
- Partials      325      334       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@monkeyWie
Copy link
Member

@copilot 乱码下载地址.zip的响应头值的十六进制为:61 74 74 61 63 68 6D 65 6E 74 3B 20 66 69 6C 65 6E 61 6D 65 3D 22 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70 22

帮我再反推乱码的生成,并且调整对应的测试用例和解码逻辑

Copy link
Contributor Author

Copilot AI commented Nov 27, 2025

@copilot 乱码下载地址.zip的响应头值的十六进制为:61 74 74 61 63 68 6D 65 6E 74 3B 20 66 69 6C 65 6E 61 6D 65 3D 22 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70 22

帮我再反推乱码�...

Analyzed in 948d1fd.

The hex bytes 3F 3F 3F D8 B5 3F D6 B7 show already-corrupted data:

  • 3F = ? (replacement for lost bytes)
  • D8 B5 and D6 B7 = surviving partial GBK sequences

Original 下载地址 in GBK: CF C2 D4 D8 B5 D8 D6 B7
Corrupted version: 3F 3F 3F D8 B5 3F D6 B7

The corruption happened server-side (some charset conversion replaced bytes with ?). Since the corrupted bytes are valid UTF-8, we cannot detect or recover the original.

What my fix handles:

  • Pure GBK bytes [B2 E2 CA D4] ("测试") → Invalid UTF-8 → GBK decoded → ✓
  • Corrupted bytes [3F 3F 3F D8 B5 3F D6 B7] → Valid UTF-8 → Cannot recover → ✗

The /gbk-encoded test verifies the fixable case (pure GBK bytes).

@monkeyWie
Copy link
Member

@copilot 测试用例响应的报文和我提供的还是对不上:
目标:3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70
结果:cf c2 d4 d8 b5 d8 d6 b7 2e 7a 69 70

Copy link
Contributor Author

Copilot AI commented Nov 27, 2025

@copilot 测试用例响应的报文和我提供的还是对不上:
目标:3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70
结果:cf c2 d4 d8 b5 d8 d6 b7 2e 7a 69 70

Added in 51f98c1.

Created new /corrupted-filename endpoint that sends the exact bytes you provided:

  • Target: 3F 3F 3F D8 B5 3F D6 B7 2E 7A 69 70

The /gbk-encoded endpoint still sends pure GBK bytes (CF C2 D4 D8 B5 D8 D6 B7) which is the fixable case.

Note: The corrupted bytes [3F 3F 3F D8 B5 3F D6 B7] are valid UTF-8 (they render as ???ص?ַ), so we cannot programmatically detect or recover the original 下载地址 from them. The corruption happened before the data was sent.

Copilot finished work on behalf of monkeyWie November 27, 2025 13:59
@monkeyWie monkeyWie force-pushed the copilot/handle-content-disposition-encoding branch from c3bad0a to 948d1fd Compare November 27, 2025 14:14
@monkeyWie monkeyWie marked this pull request as ready for review November 27, 2025 14:14
@monkeyWie monkeyWie added the bug Something isn't working label Nov 27, 2025
@monkeyWie monkeyWie changed the title Fix Content-Disposition filename parsing for mixed encodings and GBK-encoded filenames fix: HTTP filename parsing for mixed encodings and GBK-encoded Nov 27, 2025
@monkeyWie monkeyWie merged commit 2f5fbfb into main Nov 29, 2025
41 checks passed
@monkeyWie monkeyWie deleted the copilot/handle-content-disposition-encoding branch November 29, 2025 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

文件名乱码

2 participants