Skip to content

read_html() doesn't report parsing failure on very very long lines #440

Open
@hadley

Description

@hadley
library(xml2)

path <- tempfile()

long <- paste0("start", strrep("x", 12e6), "end")
nchar(long)
#> [1] 12000008

cat(
  "<html><body>\n<script type=\"application/json\">",
  long,
  "</script>\n</body></html>\n",
  file = path,
  sep = ""
)

html <- read_html(path)
xml <- read_xml(path)
#> Warning in read_xml.character(path): xmlSAX2Characters: huge text nod [2]
#> Error in read_xml.character(path): Extra content at the end of the document [5]

Created on 2024-02-27 with reprex v2.1.0

From tidyverse/rvest#399

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions