Extracting author from DOM could be more surgical if we detected itemprop="name"

Take https://lithub.com/why-are-writers-particularly-drawn-to-tarot/ as an example.

There are no meta tags or json-ld from which to extract the author, so Readability looks in the DOM. It matches this snippet:

```html
<div class="author_info" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
  <div class="author_name">By&nbsp;<a href="https://lithub.com/author/rochelle/" itemprop="url"><span itemprop="name">Rochelle&nbsp;Spencer</span></a></div>
  <div class="author_hr"><hr></div>
  <div class="publish_date">August 27, 2019</div>
</div>
```

It greedily takes the whole text contents, resulting in a byline of

```
By Rochelle Spencer August 27, 2019
```

If we were a bit more surgical and noticed that the `[itemprop="author"]` contained an `[itemprop="name"]` and then only took that name, we'd get an byline of just

```
Rochelle Spencer
```

Here's a patch. It uses `querySelector` -- which I realise is rarely used in Readability -- but I think it's well enough established now that we can rely on it being available.

```diff
       var itemprop = node.getAttribute("itemprop");
     }
 
-    if ((rel === "author" || (itemprop && itemprop.indexOf("author") !== -1) || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
+    if (itemprop && itemprop.indexOf("author") !== -1) {
+      const nameItem = node.querySelector && node.querySelector('[itemprop="name"]');
+      if (nameItem && this._isValidByline(nameItem.textContent)) {
+        this._articleByline = nameItem.textContent.trim();
+        return true;
+      } else if (this._isValidByline(node.textContent)) {
+        this._articleByline = node.textContent.trim();
+        return true;
+      }
+    }
+
+    if ((rel === "author" || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
       this._articleByline = node.textContent.trim();
       return true;
     }
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extracting author from DOM could be more surgical if we detected itemprop="name" #935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extracting author from DOM could be more surgical if we detected itemprop="name" #935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions