html2rss
is a Ruby gem that generates RSS 2.0 feeds from websites.
Its auto_source
scraper finds items for the RSS feed automatically. π§πΌ
Additionally, you can use the selectors
scraper and control the information extraction.
It takes plain old CSS selectors and extracts the information with help from
Extractors and chainable post processors.
It supports scraping JSON responses.
To scrape websites that require JavaScript, html2rss can request these using a headless browser (Puppeteer / browserless.io). Independently of the used request strategy, you can set HTTP request headers.
π€© Like it? | Star it! βοΈ |
π Endorse it? | Sponsor it! π |
Tip
Want to retrieve your RSS feeds via HTTP?
Check out html2rss-web
!
Install Ruby (latest version is recommended) on your machine and run gem install html2rss
in your terminal.
After the installation has finished, html2rss help
will print usage information.
html2rss offers an automatic RSS generation feature. Try it on CLI with:
html2rss auto https://unmatchedstyle.com/
If the results are not to your satisfaction, you can create a feed config file.
Create a file called my_config_file.yml
with this sample content:
channel:
url: https://unmatchedstyle.com
selectors:
items:
selector: "article[id^='post-']"
title:
selector: h2
url:
selector: a
extractor: href
description:
selector: ".post-content"
auto_source: {} # this enables auto_source additionally. Remove if you don't want that.
Build the feed from this config with: html2rss feed ./my_config_file.yml
.
Html2rss is configured using channel
, selectors
, strategy
, headers
, stylesheets
and auto_source
.
The possible options of each are explained below.
Good to know:
- You'll find extensive example feed configs at
spec/*.test.yml
. - See
html2rss-configs
for ready-made feed configs! - If you've created feed configs, you're invited to send a PR to
html2rss-configs
to make your config available to the public.
Alright, let's dive in.
attribute | type | default | remark | |
---|---|---|---|---|
url |
required | String | ||
title |
optional | String | auto-generated | |
description |
optional | String | auto-generated | Retrieved from meta description tags |
author |
optional | String | blank | Format: email (Name) |
ttl |
optional | Integer | auto-generated | Responses max-age, falls back to 360 (minutes) |
language |
optional | String | auto-generated | Determined by lang attribute |
time_zone |
optional | String | 'UTC' |
TimeZone name |
The auto_source
scraper finds items automatically. To find them it searches the websites for:
<script type="json/ld">
tags which contain Schema.org objects like Article.- Semantic HTML, i.e. tags like
<article>
. - As last resort, tries its luck by finding frequently repeated HTML patterns.
It's a good idea to give auto_source
a try, before starting to configure the selectors
scraper.
The selectors
scraper requires you to specify CSS selectors.
You must give an items
selector hash, which contains the CSS selector. The items selector selects a collection of HTML tags from which the RSS feed items are built. Except for the items
selector, all other keys are scoped to each item of the collection.
To build a valid RSS 2.0 item, you need at least a title
or a description
in your item. You can, of course, have both.
Having an items
and a title
selector is enough to build a simple feed:
channel:
url: "https://example.com"
selectors:
items:
selector: ".article"
title:
selector: "h1"
Your selectors
hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (due to the RSS 2.0 specification):
RSS 2.0 tag | name in html2rss |
remark |
---|---|---|
title |
title |
|
description |
description |
Will be sanitized when contains HTML |
link |
url |
A URL. |
author |
author |
|
category |
categories |
See notes below. |
guid |
guid |
Generated automatically. See notes below. |
enclosure |
enclosure |
See notes below. |
pubDate |
published_at |
An instance of Time . |
comments |
comments |
A URL. |
source |
Not yet supported. |
Every named selector (i.e. title
, description
, see above) in your selectors
can have these attributes:
name | value |
---|---|
selector |
The CSS selector to select the tag with the information. |
extractor |
Name of the extractor. See notes below. |
post_process |
An array. See notes below. |
Extractors help with extracting the information from the selected HTML tag.
- The default extractor is
text
, which returns the tag's inner text. - The
html
extractor returns the tag's outer HTML. - The
href
extractor returns a URL from the tag'shref
attribute and corrects relative ones to absolute ones. - The
attribute
extractor returns the value of that tag's attribute. - The
static
extractor returns the configured static value (it doesn't extract anything). - See file list of extractors.
Extractors might need extra attributes on the selector hash. π Read their docs for usage examples.
See a Ruby example
Html2rss.feed(
channel: {},
selectors: {
url: { selector: 'a', extractor: 'href' }
}
)
See a YAML feed config example
channel:
# ... omitted
selectors:
# ... omitted
url:
selector: "a"
extractor: "href"
Extracted information can be further manipulated with post processors. You can specify one or more post processors and they'll process in that order.
name | |
---|---|
gsub |
Allows global substitution operations on Strings (Regexp or simple pattern). |
html_to_markdown |
HTML to Markdown, using reverse_markdown. |
markdown_to_html |
converts Markdown to HTML, using kramdown. |
parse_time |
Parses a String containing a time in a time zone. |
parse_uri |
Parses a String as URL. |
sanitize_html |
Strips unsafe and uneeded HTML and adds security related attributes. |
substring |
Cuts a part off of a String, starting at a position. |
template |
Based on a template, it creates a new String filled with other selectors values. |
sanitize_html
post processor for HTML content. Never trust the internet!
If the description
contains HTML, it will be sanitized automatically.
YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML
channel:
Β Β # ... omitted
selectors:
Β Β # ... omitted
price:
selector: '.price'
description:
selector: '.section'
post_process:
- name: template
string: |
# %{self}
Price: %{price}
- name: markdown_to_html
The post processor gsub
makes use of Ruby's gsub
method.
key | type | required | note |
---|---|---|---|
pattern |
String | yes | Can be Regexp or String. |
replacement |
String | yes | Can be a backreference. |
See a Ruby example
Html2rss.feed(
channel: {},
selectors: {
title: { selector: 'a', post_process: [{ name: 'gsub', pattern: 'foo', replacement: 'bar' }] }
}
)
See a YAML feed config example
channel:
# ... omitted
selectors:
# ... omitted
title:
selector: "a"
post_process:
- name: "gsub"
pattern: "foo"
replacement: "bar"
The categories
selector takes an array of selector names. Each value of those
selectors will become a <category>
on the RSS item.
See a Ruby example
Html2rss.feed(
channel: {},
selectors: {
genre: {
# ... omitted
selector: '.genre'
},
branch: { selector: '.branch' },
categories: %i[genre branch]
}
)
See a YAML feed config example
channel:
Β Β # ... omitted
selectors:
# ... omitted
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branch
By default, html2rss generates a stable GUID automatically, based on the item's url, or ultimatively on title
or description
.
If this is not stable (i.e. your RSS reader shows already read articles as new/unread frequently), you can choose from which attributes the GUID will be build. The principle is the same as for the categories: pass an array of selectors names.
See a Ruby example
Html2rss.feed(
channel: {},
selectors: {
title: {
# ... omitted
selector: 'h1'
},
url: { selector: 'a', extractor: 'href' },
guid: %i[url]
}
)
See a YAML feed config example
channel:
Β Β # ... omitted
selectors:
# ... omitted
title:
selector: "h1"
url:
selector: "a"
extractor: "href"
guid:
- url
In all cases, the GUID is eventually encoded as base-36 CRC32 checksum.
An enclosure can be any file, e.g. a image, audio or video - think Podcast.
The enclosure
selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
Since html2rss
does no further inspection of the enclosure, its support comes with trade-offs:
- The content-type is guessed from the file extension of the URL, unless one is specified in
content_type
. - If the content-type guessing fails, it will default to
application/octet-stream
. - The content-length will always be undetermined and therefore stated as
0
bytes.
Read the RSS 2.0 spec for further information on enclosing content.
See a Ruby example
Html2rss.feed(
channel: {},
selectors: {
enclosure: {
selector: 'audio',
extractor: 'attribute',
attribute: 'src',
content_type: 'audio/mp3'
}
}
)
See a YAML feed config example
channel:
Β Β # ... omitted
selectors:
Β Β # ... omitted
enclosure:
selector: "audio"
extractor: "attribute"
attribute: "src"
content_type: "audio/mp3"
See the more complex formatting options of the sprintf
method.
When the requested website returns a application/json content-typed response (i.e. you Accept: application/json
header in the request), the selectors scraper converts that JSON to XML naiively. That XML you can query using CSS selectors.
Note
The JSON response must be an Array or Hash for this to work.
See example of a converted JSON object
This JSON object:
{
"data": [{ "title": "Headline", "url": "https://example.com" }]
}
converts to:
<object>
<data>
<array>
<object>
<title>Headline</title>
<url>https://example.com</url>
</object>
</array>
</data>
</object>
Your items selector would be array > object
, the item's URL selector would be url
.
See example of a converted JSON array
This JSON array:
[{ "title": "Headline", "url": "https://example.com" }]
converts to:
<array>
<object>
<title>Headline</title>
<url>https://example.com</url>
</object>
</array>
Your items selector would be array > object
, the item's URL selector would be url
.
See a Ruby example
Html2rss.feed(
headers: {
Accept: 'application/json'
},
channel: {
url: 'http://domainname.tld/whatever.json'
},
selectors: {
title: { selector: 'foo' }
}
)
See a YAML feed config example
channel:
url: "http://domainname.tld/whatever.json"
headers:
Accept: application/json
selectors:
title:
selector: "foo"
By default, html2rss issues a naiive HTTP request and extracts information from the response. That is performant and works for many websites. Under the hood, the faraday gem is used and gives the name to the default strategy: faraday
.
Modern websites often do not render much HTML on the server, but evaluate JavaScript on the client to create the HTML. Because the default strategy does not execute any JavaScript, the faraday strategy will not find the "juicy content". For this scenario, try the browserless strategy.
You can write your custom strategy and make use of it. Consult the docs of Html2rss::RequestService.register_strategy()
.
You can use Browserless.io to run a headless Chrome browser and return the website's source code after the website generated it. For this, you can either run your own Browserless.io instance (Docker image available -- read their license!) or pay them for a hosted instance.
To run a local Browserless.io instance, you can use the following Docker command:
docker run \
--rm \
-p 3000:3000 \
-e "CONCURRENT=10" \
-e "TOKEN=6R0W53R135510" \
ghcr.io/browserless/chromium
To make html2rss use your instance, specify the browserless
strategy.
# auto:
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
html2rss auto --strategy=browserless https://example.com
# feed:
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
html2rss feed --strategy=browserless the_the_config.yml
Tip
When running locally with commands from above, you can skip setting the environment variables, as they are aligned with the default values from above example.
In your config, set strategy: browserless
.
See a YAML feed config example
strategy: browserless
headers:
User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
channel:
url: https://www.imdb.com/user/ur67728460/ratings
ttl: 1440
selectors:
items:
selector: "li.ipc-metadata-list-summary-item"
title:
selector: ".ipc-title__text"
post_process:
- name: gsub
pattern: "/^(\\d+.)\\s/"
replacement: ""
- name: template
string: "%{self} rated with: %{user_rating}"
url:
selector: "a.ipc-title-link-wrapper"
extractor: "href"
user_rating:
selector: "[data-testid='ratingGroup--other-user-rating'] > .ipc-rating-star--rating"
To set HTTP request headers, you can add them to headers
. This is useful for i.e. APIs that require an Authorization
header or you'd like to send Accept: application/json
.
headers:
Authorization: "Bearer YOUR_TOKEN"
Accept: application/json
channel:
url: "https://example.com/api/resource"
selectors:
# ... omitted
Or for setting a User-Agent:
headers:
User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
channel:
url: "https://example.com"
selectors:
# ... omitted
auto_source: {}
Sometimes there are structurally similar pages with different URLs or you need to pass some values into the headers.
In such cases, you can add dynamic parameters to the channel
and headers
values.
Example of an dynamic parameter id
in the channel URL:
channel:
url: "http://domainname.tld/whatever/%<id>s.html"
headers:
X-Something: "%<foo>s"
Command line usage example:
html2rss feed the_feed_config.yml --params id:42 foo:bar
See a Ruby example
Html2rss.feed(channel: { url: 'http://domainname.tld/whatever/%<id>s.html' },
headers: { 'X-Something': '%<foo>s' },
params: { id: 42, foo: 'bar' })
To display RSS feeds nicely in a web browser, you can:
- add a plain old CSS stylesheet, or
- use XSLT (eXtensible Stylesheet Language Transformations).
A web browser will apply these stylesheets and show the contents as described.
In a CSS stylesheet, you'd use element
selectors to apply styles.
If you want to do more, then you need to create a XSLT. XSLT allows you to use a HTML template and to freely design the information of the RSS, including using JavaScript and external resources.
You can add as many stylesheets and types as you like. Just add them to your global configuration.
Ruby: a stylesheet config example
Html2rss.feed(
stylesheets: [
{
href: '/relative/base/path/to/style.xls', media: :all, type: 'text/xsl'
},
{
href: 'http://example.com/rss.css', media: :all, type: 'text/css'
}
],
channel: {},
selectors: {}
)
YAML: a stylesheet config example
stylesheets:
- href: "/relative/base/path/to/style.xls"
media: "all"
type: "text/xsl"
- href: "http://example.com/rss.css"
media: "all"
type: "text/css"
feeds:
# ... omitted
Recommended further readings:
- How to format RSS with CSS on lifewire.com
- XSLT: Extensible Stylesheet Language Transformations on MDN
- The XSLT used by html2rss-web
This step is not required to work with this gem, but is helpful when you plan to use the CLI or html2rss-web
.
First, create a YAML file, e.g. feeds.yml
. This file will contain your multiple feed configs under the key feeds
. Everything which you specify outside of this, will be applied to every feed you're building.
Example:
headers:
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
"Accept": "text/html"
feeds:
myfeed:
channel:
selectors:
auto_source:
myotherfeedwit:
headers:
strategy:
channel:
selectors:
Your feed configs go below feeds
.
Find a full example of a feeds.yml
at spec/fixtures/feeds.test.yml
.
If you prefer to have a single feed defined in a YAML, just omit the feeds. Checkout the single.test.yml
..
Now you can build your feeds like this:
Build feeds in Ruby
require 'html2rss'
myfeed = Html2rss.config_from_yaml_file('feeds.yml', 'myfeed')
Html2rss.feed(myfeed)
myotherfeed = Html2rss.config_from_yaml_file('feeds.yml', 'myotherfeed')
Html2rss.feed(myotherfeed)
single = Html2rss.config_from_yaml_file('single.test.yml')
Html2rss.feed(single)
Build feeds on the command line
html2rss feed feeds.yml myfeed
html2rss feed feeds.yml myotherfeed
html2rss feed single.test.yml
You can also install it as a dependency in your Ruby project:
π€© Like it? | Star it! βοΈ |
---|---|
Add this line to your Gemfile : |
gem 'html2rss' |
Then execute: | bundle |
In your code: | require 'html2rss' |
Here's a minimal working example using Ruby:
require 'html2rss'
rss = Html2rss.feed(
channel: { url: 'https://stackoverflow.com/questions' },
auto_source: {}
)
puts rss
and instead with auto_source
, provide selectors
(you can use both simultaneously):
require 'html2rss'
rss = Html2rss.feed(
channel: { url: 'https://stackoverflow.com/questions' },
selectors: {
items: { selector: '#hot-network-questions > ul > li' },
title: { selector: 'a' },
url: { selector: 'a', extractor: 'href' }
}
)
puts rss
- Check that the channel URL does not redirect to a mobile page with a different markup structure.
- Do not rely on your web browser's developer console when using the standard strategy. It does not execute JavaScript.
In such cases, fiddling with
curl
andpup
to find the selectors seems efficient (curl URL | pup
). - CSS selectors are versatile. Here's an overview.
Find ideas what to contribute in:
- https://github.com/orgs/html2rss/discussions
- the issues tracker: https://github.com/html2rss/html2rss/issues
To submit changes:
- Fork this repo ( https://github.com/html2rss/html2rss/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Implement a commit your changes (
git commit -am 'feat: add XYZ'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request using the Github web UI
bin/setup
: installs dependencies and sets up the development environment.- for a modern Ruby development experience: install
ruby-lsp
and integrate it to your IDE.
For example: Ruby in Visual Studio Code.