Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Are You Using a Max Length Chunking Strategy for All File Types? #422

Open
QuangTQV opened this issue Oct 22, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@QuangTQV
Copy link

Reference Issues

No response

Summary

It seems that a max length chunking strategy is being used for all file types. I believe that each file type should have its own chunking strategy to optimize accuracy.

Implementing customized chunking strategies based on file types could improve the overall precision of the system by taking into account the unique structure and content of each file type.

Basic Example

For example:

Markdown files could be chunked based on headers.
DOCX files could be split into sections or paragraphs, and if a paragraph is too small, it can be merged with adjacent ones. Additionally, semantic similarity between two chunks could be used to decide whether they should be combined.

Drawbacks

None

Additional information

Optimizing chunking per file type is very important for improving accuracy. This adjustment would help create more meaningful chunks and enhance the overall performance.

@QuangTQV QuangTQV added the enhancement New feature or request label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant