Skip to content

Extracting author from DOM could be more surgical if we detected itemprop="name" #935

Closed
@danielnixon

Description

@danielnixon

Take https://lithub.com/why-are-writers-particularly-drawn-to-tarot/ as an example.

There are no meta tags or json-ld from which to extract the author, so Readability looks in the DOM. It matches this snippet:

<div class="author_info" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
  <div class="author_name">By&nbsp;<a href="https://lithub.com/author/rochelle/" itemprop="url"><span itemprop="name">Rochelle&nbsp;Spencer</span></a></div>
  <div class="author_hr"><hr></div>
  <div class="publish_date">August 27, 2019</div>
</div>

It greedily takes the whole text contents, resulting in a byline of

By Rochelle Spencer August 27, 2019

If we were a bit more surgical and noticed that the [itemprop="author"] contained an [itemprop="name"] and then only took that name, we'd get an byline of just

Rochelle Spencer

Here's a patch. It uses querySelector -- which I realise is rarely used in Readability -- but I think it's well enough established now that we can rely on it being available.

       var itemprop = node.getAttribute("itemprop");
     }
 
-    if ((rel === "author" || (itemprop && itemprop.indexOf("author") !== -1) || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
+    if (itemprop && itemprop.indexOf("author") !== -1) {
+      const nameItem = node.querySelector && node.querySelector('[itemprop="name"]');
+      if (nameItem && this._isValidByline(nameItem.textContent)) {
+        this._articleByline = nameItem.textContent.trim();
+        return true;
+      } else if (this._isValidByline(node.textContent)) {
+        this._articleByline = node.textContent.trim();
+        return true;
+      }
+    }
+
+    if ((rel === "author" || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
       this._articleByline = node.textContent.trim();
       return true;
     }

Metadata

Metadata

Assignees

No one assigned

    Labels

    metadataIssues with the metadata generated by readability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions