explore: make NodeSet a subclass of Array

## NodeSet today

Over the years, `NodeSet` has slowly approached being API-compatible with `Enumerable` or `Array`. This is good, and it validates the mental model of libxml2's `xmlNodeSet` as an augmented ordered set, especially given that the underlying implementation is literally a C array:

```c
typedef struct _xmlNodeSet xmlNodeSet;
typedef xmlNodeSet *xmlNodeSetPtr;
struct _xmlNodeSet {
    int nodeNr;			/* number of nodes in the set */
    int nodeMax;		/* size of the array as allocated */
    xmlNodePtr *nodeTab;	/* array of nodes in no particular order */
    /* @@ with_ns to check whether namespace nodes should be looked at @@ */
};
```

However, we find ourselves at an interesting point, where `NodeSet` is _not_ completely an `Enumerable` or `Array`, and there are open issues pointing this out:

- [NodeSet does not follow ruby conventions for enumerable methods · Issue #1677 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/1677)

Further, `NodeSet` has baggage, namely the associated `Document` object which makes simple operations harder:

- [NodeSet#dup needs deep/recursive option · Issue #924 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/924)
- [NodeSet.dup does not deep copy · Issue #1678 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/1678)

or even causes bugs:

- [segfault in node_set.rb · Issue #1952 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/1952)

Finally, the `NodeSet` class is bigger and more complex than necessary (in both CRuby and JRuby), and so is a bit of a maintenance burden at this point.

## NodeSet Tomorrow

As mentioned in https://github.com/sparklemotion/nokogiri/issues/1952, it would be simpler if `NodeSet` was a subclass of `Array`, which would free us from using libxml2's `xmlNodeSet` and unify the JRuby and CRuby implementations

The memory model could be updated so that it was independent of any `Document`, thereby bringing it into alignment with the memory model of all the standard Ruby collection classes.

The `Enumerable` API would be perfectly conformed to.

The API would be extended with `Searchable` to support current API usage.

The API could also implement `Document` decorators at creation time by optionally inheriting them from an existing `NodeSet` or the creating `Document`. Decorators are a rarely-used and ill-documented feature which I suspect is buggy and would be improved by moving to a simpler implementation.

## DocumentFragment tomorrow

Finally, this opens the door to a long-time roadmap item, which is to re-implement `DocumentFragment` on top of `NodeSet`, thereby avoiding use of libxml2's underlying conventions (and further unifying the JRuby and CRuby implementations). This would further be a simplifying change and would potentially allow us to fix the quirks with how XPath searches work in fragments differently than in `Document`s and `NodeSet`s.


## Risks

Primarily, the risks are:

- GC implementation correctness
- Potentially, unexpected memory usage - enabled by the fix to #1952 

The first risk exists because we'd be making an invasive change to the current codebase which has been tested thoroughly by many applications over many years. This can be mitigated by continuing to run `valgrind` in the CI suite, and potentially extending coverage to use `ASan`. We may want to consider implementing a new class entirely to allow applications the ability to "flip back to the previous implementation" at runtime if any surprising problems occur (i.e., by setting an environment variable or global constant before Nokogiri is loaded).

The second risk exists because a `NodeSet` may now contain nodes from many documents, and the highly-connected DOM graph may then mean that many unused objects would be prevented from being GCed. This perhaps shouldn't be surprising to anyone who's thought deeply about directed graphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

explore: make NodeSet a subclass of Array #2184

NodeSet today

NodeSet Tomorrow

DocumentFragment tomorrow

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

explore: make NodeSet a subclass of Array #2184

Description

NodeSet today

NodeSet Tomorrow

DocumentFragment tomorrow

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions