Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle en dashes (–) and em dashes (—) in titles #954

Open
robinmetral opened this issue Feb 15, 2025 · 2 comments · May be fixed by #960
Open

Handle en dashes (–) and em dashes (—) in titles #954

robinmetral opened this issue Feb 15, 2025 · 2 comments · May be fixed by #960

Comments

@robinmetral
Copy link

To determine an article's title1, the first thing that Readability does is to attempt to split the document's <title> by a separator and keep the first part:

readability/Readability.js

Lines 595 to 598 in 118f015

// If there's a separator in the title, first remove the final part
if (/ [\|\-\\\/>»] /.test(curTitle)) {
titleHadHierarchicalSeparators = / [\\\/>»] /.test(curTitle);
curTitle = origTitle.replace(/(.*)[\|\-\\\/>»] .*/gi, "$1");

This is a good heuristic and seems to extract the right title for a variety of documents. However the separators only cover the following regex: / [\|\-\\\/>»] /. This means that the following separator patterns are covered:

  • Title | Site name
  • Title - Site name
  • Title \ Site name
  • Title > Site name
  • Title » Site name

Goal

The heuristic described above leaves out some other relatively common separators. The ones I'm seeing most often are en and em dashes:

  • Title – Site name
  • Title — Site name

I'd like to add support for these two separators. This would be a two-line change and wouldn't change any of the existing logic.

Here are example titles that this would fix:2

  • Your weekly Web Directions reading – Web Directions (article)
  • Partisan Politics and the Road to Plutocracy – Economics from the Top Down (article)
  • Pluralistic: Premature Internet Activists (13 Feb 2025) – Pluralistic: Daily links from Cory Doctorow (article)3

Can I open a PR?

Non-goals (unless you want them to be)

  • Handling cases where em dashes are used as separators without spaces: Title—Site name. This is relatively common (example 1, example 2), but it would require treating an em dash differently from all other separators (a hyphen without spaces around cannot be treated as a separator, it would split hyphenated words). If you don't mind the extra complexity, I'm happy to cover this as well.
  • Handling less common separators. For example, I've seen various types of interpuncts being used (·, , , , etc.), but I don't expect Readability to cover all of these since the list could become endless (though it could consider covering the most common, perhaps &middot; and &bullet;—I leave this up to you).

Footnotes

  1. Unless there is explicit title metadata in the document: https://github.com/mozilla/readability/blob/118f01538e167218bd86ffd493bd3466aec4870a/Readability.js#L1813-L1815

  2. These are all using en dashes but I think including em dashes makes sense as well. There may not be many English-language documents using em dashes with spaces around them, but that is a common thing in French for example.

  3. Readability currently thinks that the title here is Daily links from Cory Doctorow, which is the publication name. This happens because of the semicolons: if Readability doesn't find a separator but finds a semicolon instead, it takes whatever is after the last semicolon.

@gijsk
Copy link
Contributor

gijsk commented Mar 3, 2025

Can I open a PR?

Of course! Sorry for the slow response. I'm unfortunately not able to spend tons of time on readability amongst other work and some ongoing health concerns, so sometimes responses may be delayed - but PRs are definitely welcome!

@robinmetral robinmetral linked a pull request Mar 4, 2025 that will close this issue
@robinmetral
Copy link
Author

Thanks for the reply @gijsk, here it is: #960. Happy to discuss implementation over there (though the PR is fairly straightforward, as you will see).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants