Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Error in chunking document in RAG demo "In[8]" #1258

Closed
1 task done
D-aisukeY-oshida opened this issue Feb 6, 2024 · 4 comments
Closed
1 task done

[BUG]: Error in chunking document in RAG demo "In[8]" #1258

D-aisukeY-oshida opened this issue Feb 6, 2024 · 4 comments

Comments

@D-aisukeY-oshida
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
When the above process is executed, the following error occurs

///
ValueError Traceback (most recent call last)
Cell In[11], line 29
27 for doc in docs:
28 soup = BeautifulSoup(doc.page_content, 'html.parser')
---> 29 splits = html_splitter.split_text(str(soup))
30 for split in splits:
31 # Add the source URL and header values to the metadata
32 metadata = {}

File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:589, in HTMLHeaderTextSplitter.split_text(self, text)
583 def split_text(self, text: str) -> List[Document]:
584 """Split HTML text string
585
586 Args:
587 text: HTML text
588 """
--> 589 return self.split_text_from_file(StringIO(text))

File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:617, in HTMLHeaderTextSplitter.split_text_from_file(self, file)
615 xslt_tree = etree.parse(xslt_path)
616 transform = etree.XSLT(xslt_tree)
--> 617 result = transform(tree)
618 result_dom = etree.fromstring(str(result))
620 # create filter and mapping for header metadata

File src/lxml/xslt.pxi:509, in lxml.etree.XSLT.call()

File src/lxml/apihelpers.pxi:50, in lxml.etree._documentOrRaise()

ValueError: Input object has no document: lxml.etree._ElementTree
///
image

Expected Behavior

It seems that one of the sample documents (html) has no h1, h2 elements and is empty.
 'rtdocs/pymilvus.readthedocs.io/en/latest/search.html'

I feel that this is causing the process to drop as it seems that empty characters are not expected in the following process. Shouldn't doc.page_content check for empty characters?
soup = BeautifulSoup(doc.page_content, 'html.parser')

Steps To Reproduce

https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
Execution of the above code.

Software version

Milvus   : 2.3
OS       : Debian GNU/Linux 12 (bookworm)
langchain: 0.1.5

Anything else?

No response

@D-aisukeY-oshida D-aisukeY-oshida changed the title [BUG]: Error in chunking document in RAG demo [BUG]: Error in chunking document in RAG demo "In[8]" Feb 6, 2024
@christy
Copy link
Collaborator

christy commented Feb 6, 2024

Hey D-aisukeY-oshida! Thank you for the bug report! I notice you're running Langchain 0.1.5. I was running 0.1.0. I'll upgrade and look into it.

@christy
Copy link
Collaborator

christy commented Feb 6, 2024

I just re-ran and I see I missed a pip install unstructured, from langchain.document_loaders import DirectoryLoader requires module unstructured. I jut did !pip install unstructured and the code works.

Meantime, I'll add another try..except in case there are no h1 or h2 headers.
I'll add unstructured into the Line1 pip installs too.

image

@christy
Copy link
Collaborator

christy commented Feb 6, 2024

Pushed #1260

@D-aisukeY-oshida
Copy link
Author

Hello @christy ! The issue has been successfully resolved. I appreciate your prompt assistance in this matter. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants