Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ feat: Add configurable PDF processing method with Unstructured #5927

Merged
merged 20 commits into from
Feb 15, 2025
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
8745ca5
✨ feat: Add configurable PDF processing method with Unstructured
fzlzjerry Feb 9, 2025
06cbc84
πŸ”§ fix: Update import path for env utility in ContentChunk module
fzlzjerry Feb 9, 2025
76bdec6
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 9, 2025
d868e7b
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 9, 2025
f9b7751
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 10, 2025
c67fb33
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 10, 2025
e8f0b03
Merge branch 'lobehub:main' into fix/unstructured_io
fzlzjerry Feb 13, 2025
4de5511
feat: add USE_UNSTRUCTURED_FOR_PDF environment variable to knowledge …
fzlzjerry Feb 13, 2025
79a6da5
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 13, 2025
d6301bb
Delete src/server/utils/env.ts
fzlzjerry Feb 13, 2025
c132ff3
feat: implement ChunkingRuleParser for file type and service mapping
fzlzjerry Feb 13, 2025
6c3f3f4
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 14, 2025
b82f9fe
refactor: remove USE_UNSTRUCTURED_FOR_PDF from knowledge environment …
fzlzjerry Feb 14, 2025
32771f5
test: add unit tests for ChunkingRuleParser functionality
fzlzjerry Feb 15, 2025
ef63c8a
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 15, 2025
e878342
refactor: remove isUsingUnstructured method from ContentChunk class
fzlzjerry Feb 15, 2025
bd7d700
Merge branch 'main' into fix/unstructured_io
fzlzjerry Feb 15, 2025
044c33d
refactor: update ChunkingService type and clean up ContentChunk rules
fzlzjerry Feb 15, 2025
4df73f7
refactor: simplify ChunkingRuleParser and update ContentChunk module
fzlzjerry Feb 15, 2025
9b4038a
refactor: update ContentChunk module import for ChunkingService
fzlzjerry Feb 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 14 additions & 16 deletions src/config/knowledge.ts
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
import { createEnv } from '@t3-oss/env-nextjs';
import { z } from 'zod';

export const getKnowledgeConfig = () => {
return createEnv({
runtimeEnv: {
DEFAULT_FILES_CONFIG: process.env.DEFAULT_FILES_CONFIG,
UNSTRUCTURED_API_KEY: process.env.UNSTRUCTURED_API_KEY,
UNSTRUCTURED_SERVER_URL: process.env.UNSTRUCTURED_SERVER_URL,
},
server: {
DEFAULT_FILES_CONFIG: z.string().optional(),
UNSTRUCTURED_API_KEY: z.string().optional(),
UNSTRUCTURED_SERVER_URL: z.string().optional(),
},
});
};

export const knowledgeEnv = getKnowledgeConfig();
export const knowledgeEnv = createEnv({
runtimeEnv: {
DEFAULT_FILES_CONFIG: process.env.DEFAULT_FILES_CONFIG,
UNSTRUCTURED_API_KEY: process.env.UNSTRUCTURED_API_KEY,
UNSTRUCTURED_SERVER_URL: process.env.UNSTRUCTURED_SERVER_URL,
USE_UNSTRUCTURED_FOR_PDF: process.env.USE_UNSTRUCTURED_FOR_PDF,
},
server: {
DEFAULT_FILES_CONFIG: z.string().optional(),
UNSTRUCTURED_API_KEY: z.string().optional(),
UNSTRUCTURED_SERVER_URL: z.string().optional(),
USE_UNSTRUCTURED_FOR_PDF: z.string().optional(),
},
});
6 changes: 5 additions & 1 deletion src/server/modules/ContentChunk/index.ts
arvinxx marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import { Strategy } from 'unstructured-client/sdk/models/shared';

import { NewChunkItem, NewUnstructuredChunkItem } from '@/database/schemas';
import { ChunkingStrategy, Unstructured } from '@/libs/unstructured';
import { knowledgeEnv } from '@/config/knowledge';

export interface ChunkContentParams {
content: Uint8Array;
Expand All @@ -26,7 +27,10 @@ export class ContentChunk {
}

isUsingUnstructured(params: ChunkContentParams) {
return params.fileType === 'application/pdf' && params.mode === 'hi-res';
return params.fileType === 'application/pdf' &&
arvinxx marked this conversation as resolved.
Show resolved Hide resolved
!!knowledgeEnv.USE_UNSTRUCTURED_FOR_PDF &&
!!knowledgeEnv.UNSTRUCTURED_API_KEY &&
!!knowledgeEnv.UNSTRUCTURED_SERVER_URL;
}

async chunkContent(params: ChunkContentParams): Promise<ChunkResult> {
Expand Down
12 changes: 12 additions & 0 deletions src/server/utils/env.ts
arvinxx marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
export const isDev = process.env.NODE_ENV === 'development';

export const isOnServerSide = typeof window === 'undefined';
/**
* Get environment variable value
* @param key - Environment variable key
* @returns Environment variable value or empty string if not found
*/
export const getEnvironment = (key: string): string => {
if (typeof process === 'undefined') return '';
return process.env[key] || '';
};