You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
///
ValueError Traceback (most recent call last)
Cell In[11], line 29
27 for doc in docs:
28 soup = BeautifulSoup(doc.page_content, 'html.parser')
---> 29 splits = html_splitter.split_text(str(soup))
30 for split in splits:
31 # Add the source URL and header values to the metadata
32 metadata = {}
File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:589, in HTMLHeaderTextSplitter.split_text(self, text)
583 def split_text(self, text: str) -> List[Document]:
584 """Split HTML text string
585
586 Args:
587 text: HTML text
588 """
--> 589 return self.split_text_from_file(StringIO(text))
File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:617, in HTMLHeaderTextSplitter.split_text_from_file(self, file)
615 xslt_tree = etree.parse(xslt_path)
616 transform = etree.XSLT(xslt_tree)
--> 617 result = transform(tree)
618 result_dom = etree.fromstring(str(result))
620 # create filter and mapping for header metadata
File src/lxml/xslt.pxi:509, in lxml.etree.XSLT.call()
File src/lxml/apihelpers.pxi:50, in lxml.etree._documentOrRaise()
ValueError: Input object has no document: lxml.etree._ElementTree
///
Expected Behavior
It seems that one of the sample documents (html) has no h1, h2 elements and is empty.
'rtdocs/pymilvus.readthedocs.io/en/latest/search.html'
I feel that this is causing the process to drop as it seems that empty characters are not expected in the following process. Shouldn't doc.page_content check for empty characters? soup = BeautifulSoup(doc.page_content, 'html.parser')
Steps To Reproduce
https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
Execution of the above code.
I just re-ran and I see I missed a pip install unstructured, from langchain.document_loaders import DirectoryLoader requires module unstructured. I jut did !pip install unstructured and the code works.
Meantime, I'll add another try..except in case there are no h1 or h2 headers.
I'll add unstructured into the Line1 pip installs too.
Is there an existing issue for this?
Current Behavior
https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
When the above process is executed, the following error occurs
///
ValueError Traceback (most recent call last)
Cell In[11], line 29
27 for doc in docs:
28 soup = BeautifulSoup(doc.page_content, 'html.parser')
---> 29 splits = html_splitter.split_text(str(soup))
30 for split in splits:
31 # Add the source URL and header values to the metadata
32 metadata = {}
File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:589, in HTMLHeaderTextSplitter.split_text(self, text)
583 def split_text(self, text: str) -> List[Document]:
584 """Split HTML text string
585
586 Args:
587 text: HTML text
588 """
--> 589 return self.split_text_from_file(StringIO(text))
File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:617, in HTMLHeaderTextSplitter.split_text_from_file(self, file)
615 xslt_tree = etree.parse(xslt_path)
616 transform = etree.XSLT(xslt_tree)
--> 617 result = transform(tree)
618 result_dom = etree.fromstring(str(result))
620 # create filter and mapping for header metadata
File src/lxml/xslt.pxi:509, in lxml.etree.XSLT.call()
File src/lxml/apihelpers.pxi:50, in lxml.etree._documentOrRaise()
ValueError: Input object has no document: lxml.etree._ElementTree
///
Expected Behavior
It seems that one of the sample documents (html) has no h1, h2 elements and is empty.
'rtdocs/pymilvus.readthedocs.io/en/latest/search.html'
I feel that this is causing the process to drop as it seems that empty characters are not expected in the following process. Shouldn't doc.page_content check for empty characters?
soup = BeautifulSoup(doc.page_content, 'html.parser')
Steps To Reproduce
Software version
Anything else?
No response
The text was updated successfully, but these errors were encountered: