Skip to content

too large memory footprint in READER node.attributes when millions of child nodes are present #1283

@robert-v-simon

Description

@robert-v-simon

Environment:

  • Windows 7 64-bit
  • Ruby 2.0.0 p353 32-bit
  • Nokogiri 1.6.1 x86-mingw32

I've a 5GB XML file with the following structure:

<BATCH>
    <BATCH_TYPE>ALL</BATCH_TYPE>
    <BATCH_UPDATE>RELOAD</BATCH_UPDATE>
    <BATCH_ID>0815</BATCH_ID>
    <BATCH_CHANGE TYPE="UPDATE_CONTENT_A">
        <CONTENT_A>
        ...
        </CONTENT_A>
    </BATCH_CHANGE>
    <BATCH_CHANGE TYPE="UPDATE_CONTENT_B">
        <CONTENT_B OBJECT_A="abcdefg" OBJECT_B="0123456" BEGIN="000000000" END="000000500">
        ...
        </CONTENT_B>
    </BATCH_CHANGE>
<BATCH>

total count of CONTENT_A: 1,261,642
total count of CONTENT_B: 10,707,587

I use the READER to go thru those XML files and analyse the data of CONTENT_A and CONTENT_B for which I would need to consider also values which are on node attributes within the nodes of CONTENT_A and CONTENT_B.

The following reader code will explode when it reaches the first <BATCH_CHANGE node:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? 
                @nodeAttrib = node.attributes 
            else 
                @nodeAttrib = {} 
            end
...

The following reader code works fine for the entire document but doesn't keep the key of the attribute which would be essential for my analysis:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? 
                g = 0
                while g < node.attribute_count do
                    @nodeAttrib[g] = node.attribute_at(g)
                end
            else 
                @nodeAttrib = {} 
            end
            ...

It appears that node.attributes is looking at sub-sequential nodes too which causes the memory footprint to grow above the limit ruby can handle while node.attribute_count and node.attribute_at() read just the local node attribute data and therefore behave as expected.

As there is no node.attribute_key_at() available I currently exclude the node which causes trouble from the node.attributes lookup which makes the reader go thru the XML file as follows:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? && (node.name != "BATCH_CHANGE")
                @nodeAttrib = node.attributes 
            else 
                @nodeAttrib = {} 
            end
            ...

As the last code example works there seems to be a problem with node.attributes when there are millions of child-nodes present which also contain attributes. Strangely only node.attributes is affected while node.attribute_count and node.attribute_at() work fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions