-
-
Notifications
You must be signed in to change notification settings - Fork 918
Description
Environment:
- Windows 7 64-bit
- Ruby 2.0.0 p353 32-bit
- Nokogiri 1.6.1 x86-mingw32
I've a 5GB XML file with the following structure:
<BATCH>
<BATCH_TYPE>ALL</BATCH_TYPE>
<BATCH_UPDATE>RELOAD</BATCH_UPDATE>
<BATCH_ID>0815</BATCH_ID>
<BATCH_CHANGE TYPE="UPDATE_CONTENT_A">
<CONTENT_A>
...
</CONTENT_A>
</BATCH_CHANGE>
<BATCH_CHANGE TYPE="UPDATE_CONTENT_B">
<CONTENT_B OBJECT_A="abcdefg" OBJECT_B="0123456" BEGIN="000000000" END="000000500">
...
</CONTENT_B>
</BATCH_CHANGE>
<BATCH>
total count of CONTENT_A: 1,261,642
total count of CONTENT_B: 10,707,587
I use the READER to go thru those XML files and analyse the data of CONTENT_A and CONTENT_B for which I would need to consider also values which are on node attributes within the nodes of CONTENT_A and CONTENT_B.
The following reader code will explode when it reaches the first <BATCH_CHANGE node:
xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
case node.node_type
when 1
@xmlTree.push(node.name)
if node.attributes?
@nodeAttrib = node.attributes
else
@nodeAttrib = {}
end
...
The following reader code works fine for the entire document but doesn't keep the key of the attribute which would be essential for my analysis:
xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
case node.node_type
when 1
@xmlTree.push(node.name)
if node.attributes?
g = 0
while g < node.attribute_count do
@nodeAttrib[g] = node.attribute_at(g)
end
else
@nodeAttrib = {}
end
...
It appears that node.attributes is looking at sub-sequential nodes too which causes the memory footprint to grow above the limit ruby can handle while node.attribute_count and node.attribute_at() read just the local node attribute data and therefore behave as expected.
As there is no node.attribute_key_at() available I currently exclude the node which causes trouble from the node.attributes lookup which makes the reader go thru the XML file as follows:
xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
case node.node_type
when 1
@xmlTree.push(node.name)
if node.attributes? && (node.name != "BATCH_CHANGE")
@nodeAttrib = node.attributes
else
@nodeAttrib = {}
end
...
As the last code example works there seems to be a problem with node.attributes when there are millions of child-nodes present which also contain attributes. Strangely only node.attributes is affected while node.attribute_count and node.attribute_at() work fine.