Skip to content

print(xml_document) does not scale with large documents #441

Open
@zeehio

Description

@zeehio

When loading a fairly large XML file (~500MB), if I print() the document it takes a long time, and it is not interruptible.

However printing the children nodes individually is fast.

I believe the reprex below eventually calls show_nodes which calls as.character here, that takes a long time and blocks the interpreter.

contents <- vapply(x, as.character, FUN.VALUE = character(1L))

library(xml2)
# Download 490 MB:
if (!file.exists("cellosaurus.xml")) download.file("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", "cellosaurus.xml")
# Read XML:
cellosaurus_xml <- xml2::read_xml("cellosaurus.xml")

# My print (a fast version, closer to what I would expect)

cat(format(cellosaurus_xml))
#> <Cellosaurus>
children <- xml2:::xml_children(cellosaurus_xml)
for (child in children) {
  cat(format(child), "\n")
  xml2:::show_nodes(xml2:::xml_children(child))
}
#> <header> 
#> [1] <terminology-name>Cellosaurus</terminology-name>
#> [2] <description>Cellosaurus: a controlled vocabulary of cell lines</descript ...
#> [3] <release version="48.0" updated="2024-01-30" nb-cell-lines="152231" nb-pu ...
#> [4] <terminology-list>\n  <terminology name="NCBI-Taxonomy" source="National  ...
#> <cell-line-list> 
#>  [1] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#>  [2] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#>  [3] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#>  [4] <cell-line category="Hybridoma" created="2017-08-22" last-updated="2023- ...
#>  [5] <cell-line category="Cancer cell line" created="2017-05-15" last-updated ...
#>  [6] <cell-line category="Hybridoma" created="2012-06-06" last-updated="2023- ...
#>  [7] <cell-line category="Hybridoma" created="2014-07-17" last-updated="2023- ...
#>  [8] <cell-line category="Hybridoma" created="2022-12-15" last-updated="2023- ...
#>  [9] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#> [10] <cell-line category="Hybridoma" created="2013-02-11" last-updated="2023- ...
#> [11] <cell-line category="Cancer cell line" created="2018-05-14" last-updated ...
#> [12] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [13] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [14] <cell-line category="Finite cell line" created="2013-11-05" last-updated ...
#> [15] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [16] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [17] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [18] <cell-line category="Spontaneously immortalized cell line" created="2019 ...
#> [19] <cell-line category="Transformed cell line" created="2021-12-16" last-up ...
#> [20] <cell-line category="Cancer cell line" created="2024-01-30" last-updated ...
#> ...
#> <publication-list> 
#>  [1] <publication date="2005" type="article" journal-name="AAPS J." volume="7 ...
#>  [2] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#>  [3] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#>  [4] <publication date="2016" type="article" journal-name="AAPS J." volume="1 ...
#>  [5] <publication date="2000" type="article" journal-name="AAPS PharmSci" vol ...
#>  [6] <publication date="2004" type="article" journal-name="AAPS PharmSci" vol ...
#>  [7] <publication date="2008" type="article" journal-name="ACS Chem. Biol." v ...
#>  [8] <publication date="2014" type="article" journal-name="ACS Chem. Biol." v ...
#>  [9] <publication date="2018" type="article" journal-name="ACS Infect. Dis."  ...
#> [10] <publication date="2023" type="article" journal-name="ACS Materials Au"  ...
#> [11] <publication date="2022" type="article" journal-name="ACS Omega" volume= ...
#> [12] <publication date="2017" type="article" journal-name="ACS Synth. Biol."  ...
#> [13] <publication date="2001" type="article" journal-name="Acta Astronaut." v ...
#> [14] <publication date="2013" type="article" journal-name="Acta Astronaut." v ...
#> [15] <publication date="2005" type="article" journal-name="Acta Biochim. Biop ...
#> [16] <publication date="2004" type="article" journal-name="Acta Biochim. Pol. ...
#> [17] <publication date="1988" type="article" journal-name="Acta Biol. Hung."  ...
#> [18] <publication date="2015" type="article" journal-name="Acta Biol. Hung."  ...
#> [19] <publication date="2016" type="article" journal-name="Acta Crystallogr.  ...
#> [20] <publication date="2001" type="article" journal-name="Acta Cytol." volum ...
#> ...
#> <copyright>

# This is extremely slow, and non-interruptible:
# print(cellosaurus_xml)

Created on 2024-03-12 with reprex v2.1.0

Is this expected? Or should the print() function scale better with larger XML files?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions