Skip to content

explore: make NodeSet a subclass of Array #2184

@flavorjones

Description

@flavorjones

NodeSet today

Over the years, NodeSet has slowly approached being API-compatible with Enumerable or Array. This is good, and it validates the mental model of libxml2's xmlNodeSet as an augmented ordered set, especially given that the underlying implementation is literally a C array:

typedef struct _xmlNodeSet xmlNodeSet;
typedef xmlNodeSet *xmlNodeSetPtr;
struct _xmlNodeSet {
    int nodeNr;			/* number of nodes in the set */
    int nodeMax;		/* size of the array as allocated */
    xmlNodePtr *nodeTab;	/* array of nodes in no particular order */
    /* @@ with_ns to check whether namespace nodes should be looked at @@ */
};

However, we find ourselves at an interesting point, where NodeSet is not completely an Enumerable or Array, and there are open issues pointing this out:

Further, NodeSet has baggage, namely the associated Document object which makes simple operations harder:

or even causes bugs:

Finally, the NodeSet class is bigger and more complex than necessary (in both CRuby and JRuby), and so is a bit of a maintenance burden at this point.

NodeSet Tomorrow

As mentioned in #1952, it would be simpler if NodeSet was a subclass of Array, which would free us from using libxml2's xmlNodeSet and unify the JRuby and CRuby implementations

The memory model could be updated so that it was independent of any Document, thereby bringing it into alignment with the memory model of all the standard Ruby collection classes.

The Enumerable API would be perfectly conformed to.

The API would be extended with Searchable to support current API usage.

The API could also implement Document decorators at creation time by optionally inheriting them from an existing NodeSet or the creating Document. Decorators are a rarely-used and ill-documented feature which I suspect is buggy and would be improved by moving to a simpler implementation.

DocumentFragment tomorrow

Finally, this opens the door to a long-time roadmap item, which is to re-implement DocumentFragment on top of NodeSet, thereby avoiding use of libxml2's underlying conventions (and further unifying the JRuby and CRuby implementations). This would further be a simplifying change and would potentially allow us to fix the quirks with how XPath searches work in fragments differently than in Documents and NodeSets.

Risks

Primarily, the risks are:

The first risk exists because we'd be making an invasive change to the current codebase which has been tested thoroughly by many applications over many years. This can be mitigated by continuing to run valgrind in the CI suite, and potentially extending coverage to use ASan. We may want to consider implementing a new class entirely to allow applications the ability to "flip back to the previous implementation" at runtime if any surprising problems occur (i.e., by setting an environment variable or global constant before Nokogiri is loaded).

The second risk exists because a NodeSet may now contain nodes from many documents, and the highly-connected DOM graph may then mean that many unused objects would be prevented from being GCed. This perhaps shouldn't be surprising to anyone who's thought deeply about directed graphs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic/memorySegfaults, memory leaks, valgrind testing, etc.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions