Skip to content

[BUG]: Error in chunking document in RAG demo "In[8]" #1258

Closed
@D-aisukeY-oshida

Description

@D-aisukeY-oshida

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
When the above process is executed, the following error occurs

///
ValueError Traceback (most recent call last)
Cell In[11], line 29
27 for doc in docs:
28 soup = BeautifulSoup(doc.page_content, 'html.parser')
---> 29 splits = html_splitter.split_text(str(soup))
30 for split in splits:
31 # Add the source URL and header values to the metadata
32 metadata = {}

File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:589, in HTMLHeaderTextSplitter.split_text(self, text)
583 def split_text(self, text: str) -> List[Document]:
584 """Split HTML text string
585
586 Args:
587 text: HTML text
588 """
--> 589 return self.split_text_from_file(StringIO(text))

File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:617, in HTMLHeaderTextSplitter.split_text_from_file(self, file)
615 xslt_tree = etree.parse(xslt_path)
616 transform = etree.XSLT(xslt_tree)
--> 617 result = transform(tree)
618 result_dom = etree.fromstring(str(result))
620 # create filter and mapping for header metadata

File src/lxml/xslt.pxi:509, in lxml.etree.XSLT.call()

File src/lxml/apihelpers.pxi:50, in lxml.etree._documentOrRaise()

ValueError: Input object has no document: lxml.etree._ElementTree
///
image

Expected Behavior

It seems that one of the sample documents (html) has no h1, h2 elements and is empty.
 'rtdocs/pymilvus.readthedocs.io/en/latest/search.html'

I feel that this is causing the process to drop as it seems that empty characters are not expected in the following process. Shouldn't doc.page_content check for empty characters?
soup = BeautifulSoup(doc.page_content, 'html.parser')

Steps To Reproduce

https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
Execution of the above code.

Software version

Milvus   : 2.3
OS       : Debian GNU/Linux 12 (bookworm)
langchain: 0.1.5

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions