Skip to content

Conversation

@thotz
Copy link
Owner

@thotz thotz commented Feb 28, 2025

No description provided.

Signed-off-by: Jiffin Tony Thottan <[email protected]>
fields = [
FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=2048, is_primary=True), # VARCHARS need a maximum length, so for this example they are set to 200 characters
FieldSchema(name='embedded_vector', dtype=DataType.FLOAT_VECTOR, dim=int(os.getenv("VECTOR_DIMENSION"))),
FieldSchema(name='start_offset', dtype=DataType.INT64, default_value=0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try and add is_primary=True to the start_offset field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed for the end_offset as we dont expect overlaps

MILVUS_ENDPOINT : "http://my-release-milvus.default.svc:19530"
OBJECT_TYPE : "TEXT"
VECTOR_DIMENSION: "384"
# CHUNK_SIZE : "500"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why under comment?

app.logger.debug("object size zero cannot be chunked")
return
text_splitter = CharacterTextSplitter(
separator=".",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you split by . or by size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to demo chunking done by content (by the language model itself)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you split by . or by size?

First check for ".", if not then chunking happen based on size

objectlist = text_splitter.split_text(object_content)
app.logger.debug("chunk size " + str(chunk_size) + " no of chunks " + str(len(objectlist)))
else :
objectlist.append(object_content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you append the entire object content to the object list?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Chunking is disabled entire content is added together

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, chunk_size=1 is the indication that chunking is disabled?
why not "0"?
also, what would be the value if the env var is not set?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will set it to one if it is not defined, missing in this PR

@yuvalif
Copy link
Contributor

yuvalif commented Mar 17, 2025

IMO, the main issue here, is that we use delimiter/size based chunking.
is there a way to use the LLM to do the chunking?

@thotz
Copy link
Owner Author

thotz commented Mar 17, 2025

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

@yuvalif
Copy link
Contributor

yuvalif commented Mar 17, 2025

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

they have an example based on the token chunk length. since we know the model we use for embedding, we can probably use that method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants