add support for chunking #7

thotz · 2025-02-28T09:02:42Z

No description provided.

Signed-off-by: Jiffin Tony Thottan <[email protected]>

yuvalif · 2025-03-04T14:27:55Z

pythonvectordbceph.py

        fields = [
                FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=2048, is_primary=True),  # VARCHARS need a maximum length, so for this example they are set to 200 characters
                FieldSchema(name='embedded_vector', dtype=DataType.FLOAT_VECTOR, dim=int(os.getenv("VECTOR_DIMENSION"))),
+                FieldSchema(name='start_offset', dtype=DataType.INT64, default_value=0),


can you try and add is_primary=True to the start_offset field?

probably not needed for the end_offset as we dont expect overlaps

yuvalif · 2025-03-04T14:28:52Z

sample-deployment-text.yaml

  MILVUS_ENDPOINT : "http://my-release-milvus.default.svc:19530"
  OBJECT_TYPE     : "TEXT"
  VECTOR_DIMENSION: "384"
+#  CHUNK_SIZE      : "500"


why under comment?

yuvalif · 2025-03-04T14:30:02Z

pythonvectordbceph.py

+                    app.logger.debug("object size zero cannot be chunked")
+                    return
+                text_splitter = CharacterTextSplitter(
+                    separator=".",


do you split by . or by size?

is it possible to demo chunking done by content (by the language model itself)?

do you split by . or by size?

First check for ".", if not then chunking happen based on size

yuvalif · 2025-03-04T14:49:29Z

pythonvectordbceph.py

+                objectlist = text_splitter.split_text(object_content)
+                app.logger.debug("chunk size " + str(chunk_size) + " no of chunks " + str(len(objectlist)))
+            else :
+                objectlist.append(object_content)


why do you append the entire object content to the object list?

if Chunking is disabled entire content is added together

so, chunk_size=1 is the indication that chunking is disabled?
why not "0"?
also, what would be the value if the env var is not set?

will set it to one if it is not defined, missing in this PR

yuvalif · 2025-03-17T10:54:59Z

IMO, the main issue here, is that we use delimiter/size based chunking.
is there a way to use the LLM to do the chunking?

Signed-off-by: Jiffin Tony Thottan <[email protected]>

thotz · 2025-03-17T11:05:55Z

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

yuvalif · 2025-03-17T11:20:01Z

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

they have an example based on the token chunk length. since we know the model we use for embedding, we can probably use that method?

add support for chunking

364768e

Signed-off-by: Jiffin Tony Thottan <[email protected]>

yuvalif reviewed Mar 4, 2025

View reviewed changes

including auto_id, null vector back to collection

c763c99

Signed-off-by: Jiffin Tony Thottan <[email protected]>

add support for chunking #7

Are you sure you want to change the base?

add support for chunking #7

Uh oh!

Conversation

thotz commented Feb 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuvalif commented Mar 17, 2025

Uh oh!

thotz commented Mar 17, 2025

Uh oh!

yuvalif commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants