-
Notifications
You must be signed in to change notification settings - Fork 2
add support for chunking #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jiffin Tony Thottan <[email protected]>
pythonvectordbceph.py
Outdated
| fields = [ | ||
| FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=2048, is_primary=True), # VARCHARS need a maximum length, so for this example they are set to 200 characters | ||
| FieldSchema(name='embedded_vector', dtype=DataType.FLOAT_VECTOR, dim=int(os.getenv("VECTOR_DIMENSION"))), | ||
| FieldSchema(name='start_offset', dtype=DataType.INT64, default_value=0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try and add is_primary=True to the start_offset field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not needed for the end_offset as we dont expect overlaps
| MILVUS_ENDPOINT : "http://my-release-milvus.default.svc:19530" | ||
| OBJECT_TYPE : "TEXT" | ||
| VECTOR_DIMENSION: "384" | ||
| # CHUNK_SIZE : "500" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why under comment?
| app.logger.debug("object size zero cannot be chunked") | ||
| return | ||
| text_splitter = CharacterTextSplitter( | ||
| separator=".", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you split by . or by size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to demo chunking done by content (by the language model itself)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you split by
.or by size?
First check for ".", if not then chunking happen based on size
| objectlist = text_splitter.split_text(object_content) | ||
| app.logger.debug("chunk size " + str(chunk_size) + " no of chunks " + str(len(objectlist))) | ||
| else : | ||
| objectlist.append(object_content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you append the entire object content to the object list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if Chunking is disabled entire content is added together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, chunk_size=1 is the indication that chunking is disabled?
why not "0"?
also, what would be the value if the env var is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will set it to one if it is not defined, missing in this PR
|
IMO, the main issue here, is that we use delimiter/size based chunking. |
Signed-off-by: Jiffin Tony Thottan <[email protected]>
Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page |
they have an example based on the token chunk length. since we know the model we use for embedding, we can probably use that method? |
No description provided.