Closed
Description
Take https://lithub.com/why-are-writers-particularly-drawn-to-tarot/ as an example.
There are no meta tags or json-ld from which to extract the author, so Readability looks in the DOM. It matches this snippet:
<div class="author_info" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="author_name">By <a href="https://lithub.com/author/rochelle/" itemprop="url"><span itemprop="name">Rochelle Spencer</span></a></div>
<div class="author_hr"><hr></div>
<div class="publish_date">August 27, 2019</div>
</div>
It greedily takes the whole text contents, resulting in a byline of
By Rochelle Spencer August 27, 2019
If we were a bit more surgical and noticed that the [itemprop="author"]
contained an [itemprop="name"]
and then only took that name, we'd get an byline of just
Rochelle Spencer
Here's a patch. It uses querySelector
-- which I realise is rarely used in Readability -- but I think it's well enough established now that we can rely on it being available.
var itemprop = node.getAttribute("itemprop");
}
- if ((rel === "author" || (itemprop && itemprop.indexOf("author") !== -1) || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
+ if (itemprop && itemprop.indexOf("author") !== -1) {
+ const nameItem = node.querySelector && node.querySelector('[itemprop="name"]');
+ if (nameItem && this._isValidByline(nameItem.textContent)) {
+ this._articleByline = nameItem.textContent.trim();
+ return true;
+ } else if (this._isValidByline(node.textContent)) {
+ this._articleByline = node.textContent.trim();
+ return true;
+ }
+ }
+
+ if ((rel === "author" || this.REGEXPS.byline.test(matchString)) && this._isValidByline(node.textContent)) {
this._articleByline = node.textContent.trim();
return true;
}