-
Notifications
You must be signed in to change notification settings - Fork 0
story/VOGRE-73 #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
story/VOGRE-73 #77
Conversation
… created helper functions to parse TXML files in tei_utils.py
…mport to incorporate tei xml files
Can one of the admins verify this patch? |
annotations/annotators.py
Outdated
|
||
tokenized_content = tei_utils.tokenize_tei_content(tei_data['display_html']) | ||
|
||
self.tei_data = tei_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge with line 241
annotations/tei_utils.py
Outdated
is_xml = False | ||
is_tei = False | ||
|
||
if text_content.strip().startswith('<?xml') or text_content.strip().startswith('<'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first condition is included in the second, isn't it? If the text starts with "<?xml", then it also starts with "<".
annotations/tei_utils.py
Outdated
return None | ||
|
||
# ── create namespace-free copy of *this* element ──────────────────── | ||
if node.tag.startswith('{'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would a tag name start with {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see, a better way is probably to do what the documentation suggests and to include namespaces when searching the xml: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still
annotations/tei_utils.py
Outdated
""" | ||
try: | ||
# Use a parser that removes processing instructions (XML declarations, etc.) | ||
# This prevents issues with <?xml ...?> and <?xml-model ...?> tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are there issues with those tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this try/catch block, XML processing instructions like <?xml ...?>
slipped through the parser. The parser is now configured with remove_pis=True (remove processing instructions), which specifically removes XML declarations and other processing instructions like<?xml ...?>
and<?xml-model ...?>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand the comment then. Are you trying to tell people, they should be using a parser that removes those tags, but if not, this try/catch will take care of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, not, the parser comes after. I'm confused.
annotations/tei_utils.py
Outdated
'start_pos': len("".join(html_parts))}) | ||
if element.text: | ||
html_parts.append(_escape_html(element.text)) | ||
for ch in element: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does ch
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was short for child_elements, I have changed it to use better variable names
annotations/tei_utils.py
Outdated
html_parts.append(_escape_html(ch.tail)) | ||
html_parts.append(f'</span>') | ||
|
||
return {'html_parts': html_parts, 'element_map': element_map} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be taken apart. The most elegant solutions is probably to create a method for each tag, something like _process_cell
or _process_item
and then dynamically call the methods given the tag.
annotations/tei_utils.py
Outdated
""" | ||
try: | ||
# Use a parser that removes processing instructions (XML declarations, etc.) | ||
# This prevents issues with <?xml ...?> and <?xml-model ...?> tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand the comment then. Are you trying to tell people, they should be using a parser that removes those tags, but if not, this try/catch will take care of it?
annotations/tei_utils.py
Outdated
""" | ||
try: | ||
# Use a parser that removes processing instructions (XML declarations, etc.) | ||
# This prevents issues with <?xml ...?> and <?xml-model ...?> tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, not, the parser comes after. I'm confused.
annotations/tei_utils.py
Outdated
return None | ||
|
||
# ── create namespace-free copy of *this* element ──────────────────── | ||
if node.tag.startswith('{'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still
Make it so, Jenkins. |
…tion, parsing failing
…, added debug statements - stand alone tags still slipping
for prefix, uri in namespace_map.items(): | ||
if 'tei-c.org' in uri: | ||
tei_namespace_uri = uri | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to properly parse the xml, you'll need all namespaces, not just the TEI one I would think
# If root is TEI but no namespace declared, assume default | ||
if not tei_namespace_uri and etree.QName(root).localname in ['TEI', 'tei']: | ||
tei_namespace_uri = TEI_NAMESPACE | ||
namespace_map[None] = TEI_NAMESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be possible. If a prefix is used, then the prefix needs to be declared with the namespace uri i believe. Or is this doing something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if there is a default namespace (elements are used without a prefix), then the namespace is specified with the xmlns
attribute i believe.
also, there are conflicts |
…ues wit tei xml files not getting extracted
Guidelines for Pull Requests
If you haven't yet read our code review guidelines, please do so, You can find them here.
Please confirm the following by adding an x for each item (turn
[ ]
into[x]
).Please provide a brief description of your ticket
Vogon should be able to annotate TEI-XML files
TEI is an XML format to annotate texts (it defines what paragraphs there are, line breaks, etc, see example). Vogon should be able to show the text (without the XML markup) with proper layout (respecting paragraphs, line breaks, etc), and then let the user annotate the text. Here, pointers would need to be xpointers (referencing by xpath and character count within tag probably).
VOGRE-73
Are there any other pull requests that this one depends on?
Anything else the reviewer needs to know?
... describe here ...