Skip to content

<ol> tag is dropped from an ordered list #913

Closed
@beala

Description

@beala

Sometimes the <ol> element of an ordered list is dropped from the content output.

Here is a repro:

test('JSDOM and Readability', () => {
    const html = `
<html>
    <body>
        <ol>
            <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
        </ol>
    </body>
</html>`
    const dom: JSDOM = new JSDOM(html)
    const extracted = new Readability(dom.window.document, {
        debug: true
    }).parse()
    
    expect(extracted?.content).toContain("<ol>")
})

Here is the failure:

  ● JSDOM and Readability

    expect(received).toContain(expected) // indexOf

    Expected substring: "<ol>"
    Received string:    "<div id=\"readability-page-1\" class=\"page\"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>"

      82 |     }).parse()
      83 |     
    > 84 |     expect(extracted?.content).toContain("<ol>")
         |                                ^
      85 | })

      at Object.<anonymous> (__tests__/email/emailParser.test.ts:84:32)

This is using

"dependencies": {
    "jsdom": "^24.0.0",
    "@mozilla/readability": "^0.5.0",
}

Here is the debug output:

  console.log
    Reader: (Readability) **** grabArticle ****

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <body > with score 0.6666666666666666

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Cleaning Conditionally <div >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Cleaning Conditionally <div >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Cleaning Conditionally <div >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Grabbed: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions