Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting author from DOM could be more surgical if we detected itemprop="name" #935

Closed
danielnixon opened this issue Dec 27, 2024 · 3 comments · Fixed by #943
Closed
Labels
metadata Issues with the metadata generated by readability

Comments

@danielnixon
Copy link
Contributor

Take https://lithub.com/why-are-writers-particularly-drawn-to-tarot/ as an example.

There are no meta tags or json-ld from which to extract the author, so Readability looks in the DOM. It matches this snippet:

<div class="author_info" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
  <div class="author_name">By&nbsp;<a href="https://lithub.com/author/rochelle/" itemprop="url"><span itemprop="name">Rochelle&nbsp;Spencer</span></a></div>
  <div class="author_hr"><hr></div>
  <div class="publish_date">August 27, 2019</div>
</div>

It greedily takes the whole text contents, resulting in a byline of

By Rochelle Spencer August 27, 2019

If we were a bit more surgical and noticed that the [itemprop="author"] contained an [itemprop="name"] and then only took that name, we'd get an byline of just

Rochelle Spencer

Here's a patch. It uses querySelector -- which I realise is rarely used in Readability -- but I think it's well enough established now that we can rely on it being available.

       var itemprop = node.getAttribute("itemprop");
     }
 
-    if ((rel === "author" || (itemprop && itemprop.indexOf("author") !== -1) || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
+    if (itemprop && itemprop.indexOf("author") !== -1) {
+      const nameItem = node.querySelector && node.querySelector('[itemprop="name"]');
+      if (nameItem && this._isValidByline(nameItem.textContent)) {
+        this._articleByline = nameItem.textContent.trim();
+        return true;
+      } else if (this._isValidByline(node.textContent)) {
+        this._articleByline = node.textContent.trim();
+        return true;
+      }
+    }
+
+    if ((rel === "author" || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
       this._articleByline = node.textContent.trim();
       return true;
     }
@gijsk
Copy link
Contributor

gijsk commented Dec 31, 2024

querySelector is not available in the JSDOMParser.js implementation, which is what gets used inside Firefox (on a worker thread, where there is no "real" DOM). So we can't really rely on it from that perspective. Building a fully-fledged CSS selector parsing and matching engine is also somewhat out of scope for the library.

It looks like this patch provides graceful fallback though, so we may still be able to take it... would you mind submitting it as a PR instead? That would also offer some visibility as to what (if anything) this change would do to the existing testcases.

@gijsk gijsk added the metadata Issues with the metadata generated by readability label Dec 31, 2024
@danielnixon
Copy link
Contributor Author

Thanks for taking the time to explain that @gijsk.

I did notice precedent for querySelectorAll in _getAllNodesWithTag but that looks like it's just a performance boost if querySelectorAll happens to be available. It would probably be unwise to wind up in a state where the actual outcome is materially affected by whether querySelectorAll is available or not (which would happen with my patch above).

I can probably reformulate this in terms of repeated calls to _getNextNode with a check for itemprop === name applied to each node. I'll do that and get a PR up.

@danielnixon
Copy link
Contributor Author

PR in terms of _getNextNode: #943

gijsk pushed a commit that referenced this issue Jan 9, 2025
* Extract author name from itemprop='name'. Fixes #935

* De-dupe textContent.trim()
@gijsk gijsk closed this as completed in #943 Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata Issues with the metadata generated by readability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants