fix: Part of the docx document is parsed incorrectly#1981
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| except BaseException as e: | ||
| traceback.print_exception(e) | ||
| return f'{e}' No newline at end of file | ||
| return f'{e}' |
There was a problem hiding this comment.
There are a few suggestions for optimizing and fixing this code:
-
Remove Redundant Characters: The code currently uses
replaceto remove different versions of "标题". Consider normalizing these to ensure consistency. -
Refactor Conditional Logic: Use separate conditions instead of multiple chained
ifstatements to improve readability. -
Simplify Error Handling: Simplify error handling by using a common error message format.
Here’s an optimized version of the code with these considerations:
class DocSplitHandle(BaseSplitHandle):
def paragraph_to_md(self, paragraph: Paragraph, doc: Document, images_list, get_image_id):
try:
psn = paragraph.style.name
if psn.startswith(('Heading', ' TOC 标题', '标题')):
levels = sum(1 for c in psn[psn.index(' ') + 1:].split()) + 1
title = self._build_heading(levels, paragraph.text)
images = sum(get_paragraph_element_images(e, doc, images_list, get_image_id) for e in paragraph._element)
else:
title = paragraph.text
images = []
return f"#{title}\n\n{images}"
except BaseException as e:
traceback.print_exception(e)
return f'Error processing {e}'
def _build_heading(self, level, text):
# Build heading string based on level
return '#' * level + ' ' + text
def get_content(self, file_path, save_image):
try:
document_manager = load_document(file_path)
content = ''
for section in document_manager.sections:
content += self.paragraph_to_md(section.headings[0], document_manager.document, [], lambda x, y: [])
if not all(img['path'] is None for img in section.images):
content += '\nChanges Made:
- Normalized condition checking for titles by splitting the logic into
_build_heading. - Created helper functions for clarity.
- Unified error handling messages.
(cherry picked from commit d9df013)
fix: Part of the docx document is parsed incorrectly