Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ feat: Add configurable PDF processing method with Unstructured #5927

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

fzlzjerry
Copy link

@fzlzjerry fzlzjerry commented Feb 9, 2025

💻 变更类型 | Change Type

  • ✨ feat
  • 🐛 fix
  • ♻️ refactor
  • 💄 style
  • 👷 build
  • ⚡️ perf
  • 📝 docs
  • 🔨 chore

🔀 变更说明 | Description of Change

📝 补充信息 | Additional Information

Copy link

vercel bot commented Feb 9, 2025

@fzlzjerry is attempting to deploy a commit to the LobeHub Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 🌠 Feature Request New feature or request | 特性与建议 labels Feb 9, 2025
@lobehubbot
Copy link
Member

👍 @fzlzjerry

Thank you for raising your pull request and contributing to our Community
Please make sure you have followed our contributing guidelines. We will review it as soon as possible.
If you encounter any problems, please feel free to connect with us.
非常感谢您提出拉取请求并为我们的社区做出贡献,请确保您已经遵循了我们的贡献指南,我们会尽快审查它。
如果您遇到任何问题,请随时与我们联系。

Copy link
Contributor

gru-agent bot commented Feb 9, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 06cbc84 ✅ Finished

Files

File Pull Request
src/server/modules/ContentChunk/index.ts ❌ Failure (I failed to write the unit tests for the file.)
src/server/utils/env.ts 🔴 Closed #5930

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

src/server/modules/ContentChunk/index.ts Show resolved Hide resolved
@@ -26,7 +31,10 @@ export class ContentChunk {
}

isUsingUnstructured(params: ChunkContentParams) {
return params.fileType === 'application/pdf' && params.mode === 'hi-res';
return params.fileType === 'application/pdf' &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外这里有个点想探讨一下,你觉得是只有 PDF 都走 Unstructured ,还是说我们可以给出一个扩展语法,比如:

CHUNK_BY_UNSTRUCTURED=pdf,xlsx

这样,使得我们可以灵活控制哪些文件走 Unstructured 还是用原生解析。

还是现在这个 USE_UNSTRUCTURED_FOR_PDF,只考虑 pdf 的使用场景?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

灵活控制?毕竟有些docs这些也是有文件的

Copy link
Contributor

@arvinxx arvinxx Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那要不这里改成按规则控制的实现吧?

这个变量名可以改成 FILE_TYPE_CHUNKING_RULES

然后按照 key=value 的形式来写?

比如 PDF 和 XLSX 走 UnstructuredIO,那么变量就是:

FILE_TYPE_CHUNKING_RULES="pdf=unstructured;xlsx=unstructured"

如果未来还有别的 PDF Chunking 服务商,那么就可以灵活控制:

FILE_TYPE_CHUNKING_RULES="pdf=unstructured,doc2x,default;xlsx=unstructured"

这就意味着 PDF 解析的服务有 3 种: unstructured 、 doc2x 和默认,然后如果 unstructured 解析失败则往下调用 doc2x 来解析,如果 doc2x 还失败的话,则兜底到内建 default 解析。而 xlsx 没写 default 的话,就只用 unstructured 解析

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 13, 2025
@fzlzjerry fzlzjerry requested a review from arvinxx February 13, 2025 07:37
src/server/utils/env.ts Outdated Show resolved Hide resolved
@@ -26,7 +31,10 @@ export class ContentChunk {
}

isUsingUnstructured(params: ChunkContentParams) {
return params.fileType === 'application/pdf' && params.mode === 'hi-res';
return params.fileType === 'application/pdf' &&
Copy link
Contributor

@arvinxx arvinxx Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那要不这里改成按规则控制的实现吧?

这个变量名可以改成 FILE_TYPE_CHUNKING_RULES

然后按照 key=value 的形式来写?

比如 PDF 和 XLSX 走 UnstructuredIO,那么变量就是:

FILE_TYPE_CHUNKING_RULES="pdf=unstructured;xlsx=unstructured"

如果未来还有别的 PDF Chunking 服务商,那么就可以灵活控制:

FILE_TYPE_CHUNKING_RULES="pdf=unstructured,doc2x,default;xlsx=unstructured"

这就意味着 PDF 解析的服务有 3 种: unstructured 、 doc2x 和默认,然后如果 unstructured 解析失败则往下调用 doc2x 来解析,如果 doc2x 还失败的话,则兜底到内建 default 解析。而 xlsx 没写 default 的话,就只用 unstructured 解析

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Feb 13, 2025
return await this.chunkByLangChain(params.filename, params.content);
}

private canUseUnstructured(): boolean {
return !!(
knowledgeEnv.USE_UNSTRUCTURED_FOR_PDF &&
Copy link
Contributor

@arvinxx arvinxx Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是不需要 USE_UNSTRUCTURED_FOR_PDF 这个变量了?

UNSTRUCTURED_API_KEY: process.env.UNSTRUCTURED_API_KEY,
UNSTRUCTURED_SERVER_URL: process.env.UNSTRUCTURED_SERVER_URL,
USE_UNSTRUCTURED_FOR_PDF: process.env.USE_UNSTRUCTURED_FOR_PDF,
FILE_TYPE_CHUNKING_RULES: process.env.FILE_TYPE_CHUNKING_RULES || 'pdf=unstructured',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不应该加 'pdf=unstructured' 默认值吧?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🌠 Feature Request New feature or request | 特性与建议 size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants