-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ feat: Add configurable PDF processing method with Unstructured #5927
base: main
Are you sure you want to change the base?
Conversation
@fzlzjerry is attempting to deploy a commit to the LobeHub Team on Vercel. A member of the Team first needs to authorize it. |
Thank you for raising your pull request and contributing to our Community |
TestGru AssignmentSummary
Files
Tip You can |
@@ -26,7 +31,10 @@ export class ContentChunk { | |||
} | |||
|
|||
isUsingUnstructured(params: ChunkContentParams) { | |||
return params.fileType === 'application/pdf' && params.mode === 'hi-res'; | |||
return params.fileType === 'application/pdf' && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外这里有个点想探讨一下,你觉得是只有 PDF 都走 Unstructured ,还是说我们可以给出一个扩展语法,比如:
CHUNK_BY_UNSTRUCTURED=pdf,xlsx
这样,使得我们可以灵活控制哪些文件走 Unstructured 还是用原生解析。
还是现在这个 USE_UNSTRUCTURED_FOR_PDF
,只考虑 pdf 的使用场景?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
灵活控制?毕竟有些docs这些也是有文件的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那要不这里改成按规则控制的实现吧?
这个变量名可以改成 FILE_TYPE_CHUNKING_RULES
然后按照 key=value
的形式来写?
比如 PDF 和 XLSX 走 UnstructuredIO,那么变量就是:
FILE_TYPE_CHUNKING_RULES="pdf=unstructured;xlsx=unstructured"
如果未来还有别的 PDF Chunking 服务商,那么就可以灵活控制:
FILE_TYPE_CHUNKING_RULES="pdf=unstructured,doc2x,default;xlsx=unstructured"
这就意味着 PDF 解析的服务有 3 种: unstructured 、 doc2x 和默认,然后如果 unstructured 解析失败则往下调用 doc2x 来解析,如果 doc2x 还失败的话,则兜底到内建 default 解析。而 xlsx 没写 default 的话,就只用 unstructured 解析
@@ -26,7 +31,10 @@ export class ContentChunk { | |||
} | |||
|
|||
isUsingUnstructured(params: ChunkContentParams) { | |||
return params.fileType === 'application/pdf' && params.mode === 'hi-res'; | |||
return params.fileType === 'application/pdf' && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那要不这里改成按规则控制的实现吧?
这个变量名可以改成 FILE_TYPE_CHUNKING_RULES
然后按照 key=value
的形式来写?
比如 PDF 和 XLSX 走 UnstructuredIO,那么变量就是:
FILE_TYPE_CHUNKING_RULES="pdf=unstructured;xlsx=unstructured"
如果未来还有别的 PDF Chunking 服务商,那么就可以灵活控制:
FILE_TYPE_CHUNKING_RULES="pdf=unstructured,doc2x,default;xlsx=unstructured"
这就意味着 PDF 解析的服务有 3 种: unstructured 、 doc2x 和默认,然后如果 unstructured 解析失败则往下调用 doc2x 来解析,如果 doc2x 还失败的话,则兜底到内建 default 解析。而 xlsx 没写 default 的话,就只用 unstructured 解析
return await this.chunkByLangChain(params.filename, params.content); | ||
} | ||
|
||
private canUseUnstructured(): boolean { | ||
return !!( | ||
knowledgeEnv.USE_UNSTRUCTURED_FOR_PDF && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是不需要 USE_UNSTRUCTURED_FOR_PDF
这个变量了?
src/config/knowledge.ts
Outdated
UNSTRUCTURED_API_KEY: process.env.UNSTRUCTURED_API_KEY, | ||
UNSTRUCTURED_SERVER_URL: process.env.UNSTRUCTURED_SERVER_URL, | ||
USE_UNSTRUCTURED_FOR_PDF: process.env.USE_UNSTRUCTURED_FOR_PDF, | ||
FILE_TYPE_CHUNKING_RULES: process.env.FILE_TYPE_CHUNKING_RULES || 'pdf=unstructured', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不应该加 'pdf=unstructured'
默认值吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💻 变更类型 | Change Type
🔀 变更说明 | Description of Change
📝 补充信息 | Additional Information