Skip to content

Conversation

Girik1105
Copy link
Contributor

@Girik1105 Girik1105 commented Apr 17, 2025

Guidelines for Pull Requests

If you haven't yet read our code review guidelines, please do so, You can find them here.

Please confirm the following by adding an x for each item (turn [ ] into [x]).

  • I have removed all code style changes that are not necessary (e.g. changing blanks across the whole file that don’t need to be changed, adding empty lines in parts other than your own code)
  • I am not making any changes to files that don’t have any effect (e.g. imports added that don’t need to be added)
  • I do not have any sysout statements in my code or commented out code that isn’t needed anymore
  • I am not reformatting any files in the wrong format or without cause.
  • I am not changing file encoding or line endings to something else than UTF-8, LF
  • My pull request does not show an insane amount of files being changed although my ticket only requires a few files being changed
  • I have added Javadoc/documentation where appropriate
  • I have added test cases where appropriate
  • I have explained any part of my code/implementation decisions that is not be self-explanatory

Please provide a brief description of your ticket

Vogon should be able to annotate TEI-XML files

TEI is an XML format to annotate texts (it defines what paragraphs there are, line breaks, etc, see example). Vogon should be able to show the text (without the XML markup) with proper layout (respecting paragraphs, line breaks, etc), and then let the user annotate the text. Here, pointers would need to be xpointers (referencing by xpath and character count within tag probably).

VOGRE-73

Are there any other pull requests that this one depends on?

Anything else the reviewer needs to know?

... describe here ...

@diging-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@jophals
Copy link
Contributor

jophals commented Apr 22, 2025

Am i doing something wrong? I used the example xml file in Jira but my text in the annotation view still had all the xml tags and headers
Screenshot 2025-04-21 at 11 12 05 PM

@Girik1105 Girik1105 reopened this Apr 24, 2025
@Girik1105
Copy link
Contributor Author

Am i doing something wrong? I used the example xml file in Jira but my text in the annotation view still had all the xml tags and headers Screenshot 2025-04-21 at 11 12 05 PM

I was missing checks for some namespace tags, I have added code in our TEI-XML utility files to strip the namespace tags and display the text in appropriate formatting


tokenized_content = tei_utils.tokenize_tei_content(tei_data['display_html'])

self.tei_data = tei_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge with line 241

is_xml = False
is_tei = False

if text_content.strip().startswith('<?xml') or text_content.strip().startswith('<'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition is included in the second, isn't it? If the text starts with "<?xml", then it also starts with "<".

return None

# ── create namespace-free copy of *this* element ────────────────────
if node.tag.startswith('{'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would a tag name start with {?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, a better way is probably to do what the documentation suggests and to include namespaces when searching the xml: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still

"""
try:
# Use a parser that removes processing instructions (XML declarations, etc.)
# This prevents issues with <?xml ...?> and <?xml-model ...?> tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there issues with those tags?

Copy link
Contributor Author

@Girik1105 Girik1105 May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Before this try/catch block, XML processing instructions like <?xml ...?> slipped through the parser. The parser is now configured with remove_pis=True (remove processing instructions), which specifically removes XML declarations and other processing instructions like<?xml ...?>and<?xml-model ...?>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the comment then. Are you trying to tell people, they should be using a parser that removes those tags, but if not, this try/catch will take care of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, not, the parser comes after. I'm confused.

'start_pos': len("".join(html_parts))})
if element.text:
html_parts.append(_escape_html(element.text))
for ch in element:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does ch stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was short for child_elements, I have changed it to use better variable names

html_parts.append(_escape_html(ch.tail))
html_parts.append(f'</span>')

return {'html_parts': html_parts, 'element_map': element_map}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be taken apart. The most elegant solutions is probably to create a method for each tag, something like _process_cell or _process_item and then dynamically call the methods given the tag.

@jdamerow jdamerow closed this May 1, 2025
@Girik1105 Girik1105 reopened this May 2, 2025
"""
try:
# Use a parser that removes processing instructions (XML declarations, etc.)
# This prevents issues with <?xml ...?> and <?xml-model ...?> tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the comment then. Are you trying to tell people, they should be using a parser that removes those tags, but if not, this try/catch will take care of it?

"""
try:
# Use a parser that removes processing instructions (XML declarations, etc.)
# This prevents issues with <?xml ...?> and <?xml-model ...?> tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, not, the parser comes after. I'm confused.

return None

# ── create namespace-free copy of *this* element ────────────────────
if node.tag.startswith('{'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still

@jdamerow
Copy link
Member

Make it so, Jenkins.

@jdamerow jdamerow closed this May 23, 2025
@Girik1105 Girik1105 reopened this Jun 5, 2025
for prefix, uri in namespace_map.items():
if 'tei-c.org' in uri:
tei_namespace_uri = uri
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to properly parse the xml, you'll need all namespaces, not just the TEI one I would think

# If root is TEI but no namespace declared, assume default
if not tei_namespace_uri and etree.QName(root).localname in ['TEI', 'tei']:
tei_namespace_uri = TEI_NAMESPACE
namespace_map[None] = TEI_NAMESPACE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be possible. If a prefix is used, then the prefix needs to be declared with the namespace uri i believe. Or is this doing something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if there is a default namespace (elements are used without a prefix), then the namespace is specified with the xmlns attribute i believe.

@jdamerow
Copy link
Member

jdamerow commented Jun 6, 2025

also, there are conflicts

@jdamerow jdamerow closed this Jun 6, 2025
@Girik1105 Girik1105 reopened this Jun 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants