Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Almost lost all content on mp.weixin.qq.com #921

Closed
kfstorm opened this issue Nov 1, 2024 · 3 comments
Closed

Almost lost all content on mp.weixin.qq.com #921

kfstorm opened this issue Nov 1, 2024 · 3 comments

Comments

@kfstorm
Copy link

kfstorm commented Nov 1, 2024

I use monolith to get the single HTML version of a web page, then go through readability (ran with node.js) to keep only the article. This works mostly well.

However, it seems broken with articles on WeChat. For example, https://mp.weixin.qq.com/s/koaLJvsFLkfi_j3HKIi6Dw.

The screenshot of rendered HTML generated by monolith is like this:

SCR-20241101-rixf

And the rendered HTML generated by "monolith -> readability" is like this:

Clipboard_Screenshot_1730462296

Almost all the meaningful article text are lost.

However, If I use Firefox's Reader view on the monolith-generated HTML, everything looks great:

Clipboard_Screenshot_1730462455

I'm confused. What's the gap?

My environment
2b70f29bf000:/app$ node --version
v20.15.1
2b70f29bf000:/app$ npm --version
10.8.0
2b70f29bf000:/app$ uname -a
Linux 2b70f29bf000 6.1.0-21-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 Linux
cat carnivore/app/readability/package-lock.json | grep -w version -A 1 -B 1
    "node_modules/@mozilla/readability": {
      "version": "0.5.0",
      "resolved": "https://registry.npmjs.org/@mozilla/readability/-/readability-0.5.0.tgz",
--
    "node_modules/agent-base": {
      "version": "7.1.1",
      "resolved": "https://registry.npmjs.org/agent-base/-/agent-base-7.1.1.tgz",
--
    "node_modules/asynckit": {
      "version": "0.4.0",
      "resolved": "https://registry.npmjs.org/asynckit/-/asynckit-0.4.0.tgz",
--
    "node_modules/combined-stream": {
      "version": "1.0.8",
      "resolved": "https://registry.npmjs.org/combined-stream/-/combined-stream-1.0.8.tgz",
--
    "node_modules/cssstyle": {
      "version": "4.1.0",
      "resolved": "https://registry.npmjs.org/cssstyle/-/cssstyle-4.1.0.tgz",
--
    "node_modules/data-urls": {
      "version": "5.0.0",
      "resolved": "https://registry.npmjs.org/data-urls/-/data-urls-5.0.0.tgz",
--
    "node_modules/debug": {
      "version": "4.3.7",
      "resolved": "https://registry.npmjs.org/debug/-/debug-4.3.7.tgz",
--
    "node_modules/decimal.js": {
      "version": "10.4.3",
      "resolved": "https://registry.npmjs.org/decimal.js/-/decimal.js-10.4.3.tgz",
--
    "node_modules/delayed-stream": {
      "version": "1.0.0",
      "resolved": "https://registry.npmjs.org/delayed-stream/-/delayed-stream-1.0.0.tgz",
--
    "node_modules/entities": {
      "version": "4.5.0",
      "resolved": "https://registry.npmjs.org/entities/-/entities-4.5.0.tgz",
--
    "node_modules/form-data": {
      "version": "4.0.1",
      "resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.1.tgz",
--
    "node_modules/html-encoding-sniffer": {
      "version": "4.0.0",
      "resolved": "https://registry.npmjs.org/html-encoding-sniffer/-/html-encoding-sniffer-4.0.0.tgz",
--
    "node_modules/http-proxy-agent": {
      "version": "7.0.2",
      "resolved": "https://registry.npmjs.org/http-proxy-agent/-/http-proxy-agent-7.0.2.tgz",
--
    "node_modules/https-proxy-agent": {
      "version": "7.0.5",
      "resolved": "https://registry.npmjs.org/https-proxy-agent/-/https-proxy-agent-7.0.5.tgz",
--
    "node_modules/iconv-lite": {
      "version": "0.6.3",
      "resolved": "https://registry.npmjs.org/iconv-lite/-/iconv-lite-0.6.3.tgz",
--
    "node_modules/is-potential-custom-element-name": {
      "version": "1.0.1",
      "resolved": "https://registry.npmjs.org/is-potential-custom-element-name/-/is-potential-custom-element-name-1.0.1.tgz",
--
    "node_modules/jsdom": {
      "version": "25.0.1",
      "resolved": "https://registry.npmjs.org/jsdom/-/jsdom-25.0.1.tgz",
--
    "node_modules/mime-db": {
      "version": "1.52.0",
      "resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.52.0.tgz",
--
    "node_modules/mime-types": {
      "version": "2.1.35",
      "resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.35.tgz",
--
    "node_modules/ms": {
      "version": "2.1.3",
      "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.3.tgz",
--
    "node_modules/nwsapi": {
      "version": "2.2.13",
      "resolved": "https://registry.npmjs.org/nwsapi/-/nwsapi-2.2.13.tgz",
--
    "node_modules/parse5": {
      "version": "7.2.1",
      "resolved": "https://registry.npmjs.org/parse5/-/parse5-7.2.1.tgz",
--
    "node_modules/punycode": {
      "version": "2.3.1",
      "resolved": "https://registry.npmjs.org/punycode/-/punycode-2.3.1.tgz",
--
    "node_modules/rrweb-cssom": {
      "version": "0.7.1",
      "resolved": "https://registry.npmjs.org/rrweb-cssom/-/rrweb-cssom-0.7.1.tgz",
--
    "node_modules/safer-buffer": {
      "version": "2.1.2",
      "resolved": "https://registry.npmjs.org/safer-buffer/-/safer-buffer-2.1.2.tgz",
--
    "node_modules/saxes": {
      "version": "6.0.0",
      "resolved": "https://registry.npmjs.org/saxes/-/saxes-6.0.0.tgz",
--
    "node_modules/symbol-tree": {
      "version": "3.2.4",
      "resolved": "https://registry.npmjs.org/symbol-tree/-/symbol-tree-3.2.4.tgz",
--
    "node_modules/tldts": {
      "version": "6.1.57",
      "resolved": "https://registry.npmjs.org/tldts/-/tldts-6.1.57.tgz",
--
    "node_modules/tldts-core": {
      "version": "6.1.57",
      "resolved": "https://registry.npmjs.org/tldts-core/-/tldts-core-6.1.57.tgz",
--
    "node_modules/tough-cookie": {
      "version": "5.0.0",
      "resolved": "https://registry.npmjs.org/tough-cookie/-/tough-cookie-5.0.0.tgz",
--
    "node_modules/tr46": {
      "version": "5.0.0",
      "resolved": "https://registry.npmjs.org/tr46/-/tr46-5.0.0.tgz",
--
    "node_modules/w3c-xmlserializer": {
      "version": "5.0.0",
      "resolved": "https://registry.npmjs.org/w3c-xmlserializer/-/w3c-xmlserializer-5.0.0.tgz",
--
    "node_modules/webidl-conversions": {
      "version": "7.0.0",
      "resolved": "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-7.0.0.tgz",
--
    "node_modules/whatwg-encoding": {
      "version": "3.1.1",
      "resolved": "https://registry.npmjs.org/whatwg-encoding/-/whatwg-encoding-3.1.1.tgz",
--
    "node_modules/whatwg-mimetype": {
      "version": "4.0.0",
      "resolved": "https://registry.npmjs.org/whatwg-mimetype/-/whatwg-mimetype-4.0.0.tgz",
--
    "node_modules/whatwg-url": {
      "version": "14.0.0",
      "resolved": "https://registry.npmjs.org/whatwg-url/-/whatwg-url-14.0.0.tgz",
--
    "node_modules/ws": {
      "version": "8.18.0",
      "resolved": "https://registry.npmjs.org/ws/-/ws-8.18.0.tgz",
--
    "node_modules/xml-name-validator": {
      "version": "5.0.0",
      "resolved": "https://registry.npmjs.org/xml-name-validator/-/xml-name-validator-5.0.0.tgz",
--
    "node_modules/xmlchars": {
      "version": "2.2.0",
      "resolved": "https://registry.npmjs.org/xmlchars/-/xmlchars-2.2.0.tgz",

This is how I use readability to generate the polished HTML: https://github.com/kfstorm/carnivore/blob/bbfd67930223787e58338a16d2d2dffd5d074998/carnivore/app/readability/index.mjs

@kfstorm
Copy link
Author

kfstorm commented Nov 1, 2024

I just tried the main branch version of readability. Still the same.

@kfstorm
Copy link
Author

kfstorm commented Nov 1, 2024

Just found that the older versions of readability don't have this issue. So I did a quick binary search and found out that the issue was introduced by commit 522eb4b, which just changed one line to exclude elements with visibility: hidden style.

I then found out that there are no elements with visibility: hidden style when it's rendered in Firebox, but all the article text is indeed surrounded by a container element with visibility: hidden in the monolith-generated HTML file. It must be that the visibility style was changed by some JavaScript code at runtime.

@kfstorm
Copy link
Author

kfstorm commented Nov 3, 2024

Problem solved by running a headless browser to render the HTML file and get the rendered HTML content.

@kfstorm kfstorm closed this as completed Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant