Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<ol> tag is dropped from an ordered list #913

Closed
beala opened this issue Oct 11, 2024 · 0 comments · Fixed by #914
Closed

<ol> tag is dropped from an ordered list #913

beala opened this issue Oct 11, 2024 · 0 comments · Fixed by #914

Comments

@beala
Copy link
Contributor

beala commented Oct 11, 2024

Sometimes the <ol> element of an ordered list is dropped from the content output.

Here is a repro:

test('JSDOM and Readability', () => {
    const html = `
<html>
    <body>
        <ol>
            <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
        </ol>
    </body>
</html>`
    const dom: JSDOM = new JSDOM(html)
    const extracted = new Readability(dom.window.document, {
        debug: true
    }).parse()
    
    expect(extracted?.content).toContain("<ol>")
})

Here is the failure:

  ● JSDOM and Readability

    expect(received).toContain(expected) // indexOf

    Expected substring: "<ol>"
    Received string:    "<div id=\"readability-page-1\" class=\"page\"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>"

      82 |     }).parse()
      83 |     
    > 84 |     expect(extracted?.content).toContain("<ol>")
         |                                ^
      85 | })

      at Object.<anonymous> (__tests__/email/emailParser.test.ts:84:32)

This is using

"dependencies": {
    "jsdom": "^24.0.0",
    "@mozilla/readability": "^0.5.0",
}

Here is the debug output:

  console.log
    Reader: (Readability) **** grabArticle ****

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <body > with score 0.6666666666666666

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Cleaning Conditionally <div >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Cleaning Conditionally <div >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Cleaning Conditionally <div >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Starting grabArticle loop

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <li > with score 1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Candidate: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Looking at sibling node: <ol > with score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Sibling has score -1

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Appending node: <ol >

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Altering sibling: <ol > to div.

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) _setNodeTag <ol > DIV

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content pre-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content post-prep: <div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Article content after paging: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)

  console.log
    Reader: (Readability) Grabbed: <div id="readability-page-1" class="page"><div>
                <li><p>AI hasn’t meaningfully changed anything in cybersecurity so far. Deep fake phishing is still rare, L</p></li>
            </div></div>

      at Readability.log (node_modules/@mozilla/readability/Readability.js:84:21)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant