Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
When the above process is executed, the following error occurs
///
ValueError Traceback (most recent call last)
Cell In[11], line 29
27 for doc in docs:
28 soup = BeautifulSoup(doc.page_content, 'html.parser')
---> 29 splits = html_splitter.split_text(str(soup))
30 for split in splits:
31 # Add the source URL and header values to the metadata
32 metadata = {}
File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:589, in HTMLHeaderTextSplitter.split_text(self, text)
583 def split_text(self, text: str) -> List[Document]:
584 """Split HTML text string
585
586 Args:
587 text: HTML text
588 """
--> 589 return self.split_text_from_file(StringIO(text))
File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:617, in HTMLHeaderTextSplitter.split_text_from_file(self, file)
615 xslt_tree = etree.parse(xslt_path)
616 transform = etree.XSLT(xslt_tree)
--> 617 result = transform(tree)
618 result_dom = etree.fromstring(str(result))
620 # create filter and mapping for header metadata
File src/lxml/xslt.pxi:509, in lxml.etree.XSLT.call()
File src/lxml/apihelpers.pxi:50, in lxml.etree._documentOrRaise()
ValueError: Input object has no document: lxml.etree._ElementTree
///
Expected Behavior
It seems that one of the sample documents (html) has no h1, h2 elements and is empty.
'rtdocs/pymilvus.readthedocs.io/en/latest/search.html'
I feel that this is causing the process to drop as it seems that empty characters are not expected in the following process. Shouldn't doc.page_content check for empty characters?
soup = BeautifulSoup(doc.page_content, 'html.parser')
Steps To Reproduce
https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
Execution of the above code.
Software version
Milvus : 2.3
OS : Debian GNU/Linux 12 (bookworm)
langchain: 0.1.5
Anything else?
No response