[BUG]: Error in chunking document in RAG demo "In[8]"

### Is there an existing issue for this?

- [X] I have searched the existing issues

### Current Behavior

https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
When the above process is executed, the following error occurs

///
ValueError                                Traceback (most recent call last)
Cell In[11], line 29
     27 for doc in docs:
     28     soup = BeautifulSoup(doc.page_content, 'html.parser')
---> 29     splits = html_splitter.split_text(str(soup))
     30     for split in splits:
     31         # Add the source URL and header values to the metadata
     32         metadata = {}

File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:589, in HTMLHeaderTextSplitter.split_text(self, text)
    583 def split_text(self, text: str) -> List[Document]:
    584     """Split HTML text string
    585 
    586     Args:
    587         text: HTML text
    588     """
--> 589     return self.split_text_from_file(StringIO(text))

File /usr/local/lib/python3.8/site-packages/langchain/text_splitter.py:617, in HTMLHeaderTextSplitter.split_text_from_file(self, file)
    615 xslt_tree = etree.parse(xslt_path)
    616 transform = etree.XSLT(xslt_tree)
--> 617 result = transform(tree)
    618 result_dom = etree.fromstring(str(result))
    620 # create filter and mapping for header metadata

File src/lxml/xslt.pxi:509, in lxml.etree.XSLT.__call__()

File src/lxml/apihelpers.pxi:50, in lxml.etree._documentOrRaise()

ValueError: Input object has no document: lxml.etree._ElementTree
///
![image](https://github.com/milvus-io/bootcamp/assets/65534112/76be78c3-6183-4fce-83d9-c638afda8f8f)


### Expected Behavior

It seems that one of the sample documents (html) has no h1, h2 elements and is empty.
　'rtdocs/pymilvus.readthedocs.io/en/latest/search.html'

I feel that this is causing the process to drop as it seems that empty characters are not expected in the following process. Shouldn't doc.page_content check for empty characters?
`soup = BeautifulSoup(doc.page_content, 'html.parser')`

### Steps To Reproduce

```markdown
https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
Chapter on "Chunking."
Code block "In[8]"
Execution of the above code.
```


### Software version

```markdown
Milvus   : 2.3
OS       : Debian GNU/Linux 12 (bookworm)
langchain: 0.1.5
```


### Anything else?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Error in chunking document in RAG demo "In[8]" #1258

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Software version

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Error in chunking document in RAG demo "In[8]" #1258

Description

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Software version

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions