story/VOGRE-73 #77

Girik1105 · 2025-04-17T17:27:57Z

Guidelines for Pull Requests

If you haven't yet read our code review guidelines, please do so, You can find them here.

Please confirm the following by adding an x for each item (turn [ ] into [x]).

I have removed all code style changes that are not necessary (e.g. changing blanks across the whole file that don’t need to be changed, adding empty lines in parts other than your own code)
I am not making any changes to files that don’t have any effect (e.g. imports added that don’t need to be added)
I do not have any sysout statements in my code or commented out code that isn’t needed anymore
I am not reformatting any files in the wrong format or without cause.
I am not changing file encoding or line endings to something else than UTF-8, LF
My pull request does not show an insane amount of files being changed although my ticket only requires a few files being changed
I have added Javadoc/documentation where appropriate
I have added test cases where appropriate
I have explained any part of my code/implementation decisions that is not be self-explanatory

Please provide a brief description of your ticket

Vogon should be able to annotate TEI-XML files

TEI is an XML format to annotate texts (it defines what paragraphs there are, line breaks, etc, see example). Vogon should be able to show the text (without the XML markup) with proper layout (respecting paragraphs, line breaks, etc), and then let the user annotate the text. Here, pointers would need to be xpointers (referencing by xpath and character count within tag probably).

VOGRE-73

Are there any other pull requests that this one depends on?

Anything else the reviewer needs to know?

... describe here ...

… created helper functions to parse TXML files in tei_utils.py

…mport to incorporate tei xml files

… function

… in frontend

diging-jenkins · 2025-04-17T17:28:00Z

Can one of the admins verify this patch?

jophals · 2025-04-22T06:13:42Z

Am i doing something wrong? I used the example xml file in Jira but my text in the annotation view still had all the xml tags and headers

Girik1105 · 2025-04-24T23:00:09Z

Am i doing something wrong? I used the example xml file in Jira but my text in the annotation view still had all the xml tags and headers

I was missing checks for some namespace tags, I have added code in our TEI-XML utility files to strip the namespace tags and display the text in appropriate formatting

jdamerow · 2025-05-01T17:09:44Z

annotations/annotators.py

+
+            tokenized_content = tei_utils.tokenize_tei_content(tei_data['display_html'])
+
+            self.tei_data = tei_data


merge with line 241

jdamerow · 2025-05-01T17:18:37Z

annotations/tei_utils.py

+    is_xml = False
+    is_tei = False
+
+    if text_content.strip().startswith('<?xml') or text_content.strip().startswith('<'):


The first condition is included in the second, isn't it? If the text starts with "<?xml", then it also starts with "<".

jdamerow · 2025-05-01T17:23:28Z

annotations/tei_utils.py

+        return None
+
+    # ── create namespace-free copy of *this* element ────────────────────
+    if node.tag.startswith('{'):


when would a tag name start with {?

ah I see, a better way is probably to do what the documentation suggests and to include namespaces when searching the xml: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

jdamerow · 2025-05-01T17:35:43Z

annotations/tei_utils.py

+    """
+    try:
+        # Use a parser that removes processing instructions (XML declarations, etc.)
+        # This prevents issues with <?xml ...?> and <?xml-model ...?> tags 


why are there issues with those tags?

Before this try/catch block, XML processing instructions like <?xml ...?> slipped through the parser. The parser is now configured with remove_pis=True (remove processing instructions), which specifically removes XML declarations and other processing instructions like<?xml ...?>and<?xml-model ...?>.

I don't think I understand the comment then. Are you trying to tell people, they should be using a parser that removes those tags, but if not, this try/catch will take care of it?

wait, not, the parser comes after. I'm confused.

jdamerow · 2025-05-01T17:42:37Z

annotations/tei_utils.py

+                            'start_pos': len("".join(html_parts))})
+        if element.text:
+            html_parts.append(_escape_html(element.text))
+        for ch in element:


what does ch stand for?

This was short for child_elements, I have changed it to use better variable names

jdamerow · 2025-05-01T17:44:36Z

annotations/tei_utils.py

+            html_parts.append(_escape_html(ch.tail))
+    html_parts.append(f'</span>')
+
+    return {'html_parts': html_parts, 'element_map': element_map}


this needs to be taken apart. The most elegant solutions is probably to create a method for each tag, something like _process_cell or _process_item and then dynamically call the methods given the tag.

…edundant code

jdamerow · 2025-05-23T18:05:29Z

annotations/tei_utils.py

+    """
+    try:
+        # Use a parser that removes processing instructions (XML declarations, etc.)
+        # This prevents issues with <?xml ...?> and <?xml-model ...?> tags 


I don't think I understand the comment then. Are you trying to tell people, they should be using a parser that removes those tags, but if not, this try/catch will take care of it?

jdamerow · 2025-05-23T18:06:55Z

annotations/tei_utils.py

+    """
+    try:
+        # Use a parser that removes processing instructions (XML declarations, etc.)
+        # This prevents issues with <?xml ...?> and <?xml-model ...?> tags 


wait, not, the parser comes after. I'm confused.

jdamerow · 2025-05-23T18:07:31Z

annotations/tei_utils.py

+        return None
+
+    # ── create namespace-free copy of *this* element ────────────────────
+    if node.tag.startswith('{'):


jdamerow · 2025-05-23T18:08:15Z

Make it so, Jenkins.

…tion, parsing failing

…eds a fix

…, added debug statements - stand alone tags still slipping

…cessful

jdamerow · 2025-06-06T16:36:22Z

annotations/tei_utils.py

+    for prefix, uri in namespace_map.items():
+        if 'tei-c.org' in uri:
+            tei_namespace_uri = uri
+            break


if you want to properly parse the xml, you'll need all namespaces, not just the TEI one I would think

jdamerow · 2025-06-06T16:37:14Z

annotations/tei_utils.py

+    # If root is TEI but no namespace declared, assume default
+    if not tei_namespace_uri and etree.QName(root).localname in ['TEI', 'tei']:
+        tei_namespace_uri = TEI_NAMESPACE
+        namespace_map[None] = TEI_NAMESPACE


this should not be possible. If a prefix is used, then the prefix needs to be declared with the namespace uri i believe. Or is this doing something else?

and if there is a default namespace (elements are used without a prefix), then the namespace is specified with the xmlns attribute i believe.

jdamerow · 2025-06-06T16:44:09Z

also, there are conflicts

…ues wit tei xml files not getting extracted

Girik1105 added 10 commits April 11, 2025 12:51

[VOGRE-73] Added TXML as a supported type in document position model,…

d91269a

… created helper functions to parse TXML files in tei_utils.py

[VOGRE-73] added missing tei help functions, upated repository text i…

9a83755

…mport to incorporate tei xml files

Merge branch 'develop' into story/VOGRE-73

e50a3f1

[VOGRE-73] Fixed progress id check and implemnted it in managers item…

21ea495

… function

[VOGRE-73] Added check to list xml files

9c0632c

[VOGRE-73] Updated tei utils to better handle xml files

b478330

[VOGRE-73] Added xml annotator

9b2ca21

[VOGRE-73] added better xml processing, still few tags are showing up…

5e9f8ac

… in frontend

[VOGRE-73] Added html check in tokenize function

66ac84b

[VOGRE-73] Code cleanup

5880b1c

jophals closed this Apr 22, 2025

jophals requested a review from jdamerow April 22, 2025 06:13

jophals assigned Girik1105 Apr 22, 2025

Girik1105 added 3 commits April 24, 2025 14:58

[VOGRE-73] Fixed namespace tag issues while xml parsing

fbd7e55

[VOGRE-7] fixced footnote and other misc processing

718090e

[VOGRE-73] Code clean up

6fb35a7

Girik1105 reopened this Apr 24, 2025

jdamerow requested changes May 1, 2025

View reviewed changes

jdamerow closed this May 1, 2025

Girik1105 added 4 commits May 2, 2025 14:55

[VOGRE-73] better variable naming, removed duplicate check, removed r…

c9e3cda

…edundant code

[VOGRE-73] Made a xml parsing class

61de61b

[VOGRE-73] used documentation to search xml instead of manual functions

468e7ea

Merge branch 'develop' into story/VOGRE-73

af2b3bc

Girik1105 reopened this May 2, 2025

Merge branch 'develop' into story/VOGRE-73

14708a0

jdamerow requested changes May 23, 2025

View reviewed changes

jdamerow closed this May 23, 2025

Girik1105 added 8 commits May 23, 2025 14:17

Merge branch 'develop' into story/VOGRE-73

dc41e41

[VOGRE-73] fixed comments, attempted to use namespace using documenta…

5bef64b

…tion, parsing failing

[VOHRE-73] update tei xml parsing functions

c54cf4c

[VOGRE-73] XML functions refactoring - tags still slipping through ne…

5d9c58d

…eds a fix

[VOGRE-73 added new tei xml strucutres, refactored existing functions…

ac8b0e2

…, added debug statements - stand alone tags still slipping

[VOGRE-73] Fixed functions for tei xml processing, tei extraction suc…

0effb65

…cessful

Merge branch 'develop' into story/VOGRE-73

590a695

[VOGrE-73] Code refactored, tested files from giles, everything works

76b3f24

Girik1105 reopened this Jun 5, 2025

jdamerow requested changes Jun 6, 2025

View reviewed changes

jdamerow closed this Jun 6, 2025

[VOGRE-73] fixed all namespaces issue, prefix declarations, fixed iss…

7796fb6

…ues wit tei xml files not getting extracted

Girik1105 reopened this Jun 7, 2025


		tokenized_content = tei_utils.tokenize_tei_content(tei_data['display_html'])

		self.tei_data = tei_data

story/VOGRE-73 #77

Are you sure you want to change the base?

story/VOGRE-73 #77

Uh oh!

Conversation

Girik1105 commented Apr 17, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Guidelines for Pull Requests

Please provide a brief description of your ticket

Are there any other pull requests that this one depends on?

Anything else the reviewer needs to know?

Uh oh!

diging-jenkins commented Apr 17, 2025

Uh oh!

jophals commented Apr 22, 2025

Uh oh!

Girik1105 commented Apr 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Girik1105 May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdamerow commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdamerow commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Girik1105 commented Apr 17, 2025 •

edited by atlassian bot

Loading

Girik1105 May 2, 2025 •

edited

Loading