Open
Description
When loading a fairly large XML file (~500MB), if I print()
the document it takes a long time, and it is not interruptible.
However printing the children nodes individually is fast.
I believe the reprex below eventually calls show_nodes
which calls as.character
here, that takes a long time and blocks the interpreter.
Line 73 in ab73051
library(xml2)
# Download 490 MB:
if (!file.exists("cellosaurus.xml")) download.file("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", "cellosaurus.xml")
# Read XML:
cellosaurus_xml <- xml2::read_xml("cellosaurus.xml")
# My print (a fast version, closer to what I would expect)
cat(format(cellosaurus_xml))
#> <Cellosaurus>
children <- xml2:::xml_children(cellosaurus_xml)
for (child in children) {
cat(format(child), "\n")
xml2:::show_nodes(xml2:::xml_children(child))
}
#> <header>
#> [1] <terminology-name>Cellosaurus</terminology-name>
#> [2] <description>Cellosaurus: a controlled vocabulary of cell lines</descript ...
#> [3] <release version="48.0" updated="2024-01-30" nb-cell-lines="152231" nb-pu ...
#> [4] <terminology-list>\n <terminology name="NCBI-Taxonomy" source="National ...
#> <cell-line-list>
#> [1] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#> [2] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#> [3] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#> [4] <cell-line category="Hybridoma" created="2017-08-22" last-updated="2023- ...
#> [5] <cell-line category="Cancer cell line" created="2017-05-15" last-updated ...
#> [6] <cell-line category="Hybridoma" created="2012-06-06" last-updated="2023- ...
#> [7] <cell-line category="Hybridoma" created="2014-07-17" last-updated="2023- ...
#> [8] <cell-line category="Hybridoma" created="2022-12-15" last-updated="2023- ...
#> [9] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#> [10] <cell-line category="Hybridoma" created="2013-02-11" last-updated="2023- ...
#> [11] <cell-line category="Cancer cell line" created="2018-05-14" last-updated ...
#> [12] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [13] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [14] <cell-line category="Finite cell line" created="2013-11-05" last-updated ...
#> [15] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [16] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [17] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [18] <cell-line category="Spontaneously immortalized cell line" created="2019 ...
#> [19] <cell-line category="Transformed cell line" created="2021-12-16" last-up ...
#> [20] <cell-line category="Cancer cell line" created="2024-01-30" last-updated ...
#> ...
#> <publication-list>
#> [1] <publication date="2005" type="article" journal-name="AAPS J." volume="7 ...
#> [2] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#> [3] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#> [4] <publication date="2016" type="article" journal-name="AAPS J." volume="1 ...
#> [5] <publication date="2000" type="article" journal-name="AAPS PharmSci" vol ...
#> [6] <publication date="2004" type="article" journal-name="AAPS PharmSci" vol ...
#> [7] <publication date="2008" type="article" journal-name="ACS Chem. Biol." v ...
#> [8] <publication date="2014" type="article" journal-name="ACS Chem. Biol." v ...
#> [9] <publication date="2018" type="article" journal-name="ACS Infect. Dis." ...
#> [10] <publication date="2023" type="article" journal-name="ACS Materials Au" ...
#> [11] <publication date="2022" type="article" journal-name="ACS Omega" volume= ...
#> [12] <publication date="2017" type="article" journal-name="ACS Synth. Biol." ...
#> [13] <publication date="2001" type="article" journal-name="Acta Astronaut." v ...
#> [14] <publication date="2013" type="article" journal-name="Acta Astronaut." v ...
#> [15] <publication date="2005" type="article" journal-name="Acta Biochim. Biop ...
#> [16] <publication date="2004" type="article" journal-name="Acta Biochim. Pol. ...
#> [17] <publication date="1988" type="article" journal-name="Acta Biol. Hung." ...
#> [18] <publication date="2015" type="article" journal-name="Acta Biol. Hung." ...
#> [19] <publication date="2016" type="article" journal-name="Acta Crystallogr. ...
#> [20] <publication date="2001" type="article" journal-name="Acta Cytol." volum ...
#> ...
#> <copyright>
# This is extremely slow, and non-interruptible:
# print(cellosaurus_xml)
Created on 2024-03-12 with reprex v2.1.0
Is this expected? Or should the print()
function scale better with larger XML files?