-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting author from DOM could be more surgical if we detected itemprop="name" #935
Comments
It looks like this patch provides graceful fallback though, so we may still be able to take it... would you mind submitting it as a PR instead? That would also offer some visibility as to what (if anything) this change would do to the existing testcases. |
Thanks for taking the time to explain that @gijsk. I did notice precedent for I can probably reformulate this in terms of repeated calls to |
PR in terms of |
* Extract author name from itemprop='name'. Fixes #935 * De-dupe textContent.trim()
Take https://lithub.com/why-are-writers-particularly-drawn-to-tarot/ as an example.
There are no meta tags or json-ld from which to extract the author, so Readability looks in the DOM. It matches this snippet:
It greedily takes the whole text contents, resulting in a byline of
If we were a bit more surgical and noticed that the
[itemprop="author"]
contained an[itemprop="name"]
and then only took that name, we'd get an byline of justHere's a patch. It uses
querySelector
-- which I realise is rarely used in Readability -- but I think it's well enough established now that we can rely on it being available.The text was updated successfully, but these errors were encountered: