Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml2 read_html removes closing tags from JSON-LD when using a single option #373

Open
sbha opened this issue Oct 3, 2022 · 2 comments
Open
Labels
bug an unexpected problem or unintended behavior

Comments

@sbha
Copy link

sbha commented Oct 3, 2022

xml2::read_html(x) returns the HTML within a linked data JSON object as expected:

library(xml2)
library(magrittr)
library(rvest)

test_ld <- '<script type="application/ld+json">{"@context":"http://schema.org","@type":"ReproducibleExample", "description":"<p><strong>text within tags</strong>text after closing tag</p>"'

# tags preserved
test_ld %>% 
  read_html() %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"

Where description contains the HTML <p><strong>text within tags</strong>text after closing tag</p>

But if using xml2::read_html(x, options = 'HUGE') or with any single option (I've tested 5 or 6), the closing tags are removed from the HTML text in a JSON-LD object.

# tags removed
test_ld %>% 
  read_html(options = 'HUGE') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = "NOBLANKS") %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = '') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# all return:
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tagstext after closing tag\"</script

description now becomes <p><strong>text within tagstext after closing tag

Setting options is necessary for some of the HTML I'm parsing. Is it possible to use options and preserve properly formatted HTML from a linked data object?

@sbha
Copy link
Author

sbha commented Oct 3, 2022

If multiple options are set the HTML is correct:

test_ld %>% 
  read_html(options = c("RECOVER", "NOERROR", "NOBLANKS")) %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# or
test_ld %>% 
  read_html(options = c("HUGE", "RECOVER")) %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()


[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"

description is as it should be <p><strong>text within tags</strong>text after closing tag</p>

@sbha sbha changed the title xml2 read_html removes closing tags from JSON-LD when using options xml2 read_html removes closing tags from JSON-LD when using a single option Oct 3, 2022
@hadley hadley added the bug an unexpected problem or unintended behavior label Oct 30, 2023
@hadley
Copy link
Member

hadley commented Oct 30, 2023

I'm not sure there's much we can do here, but leaving open because I have some suspicions that something is going wrong with the way we pass the options from R to C.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants