Can it crawl urls within js? #1

Unkn0wnCreator · 2024-06-11T05:47:37Z

So like there are websites that load stuff with js

like is it possible to save webapps with it? Like photopea or hexed.it

titoBouzout · 2024-06-11T11:47:32Z

It doesn't look for URLs on JavaScript files, In only does in css files by looking for url(...).

The original idea is to save the HTML generated by JavaScript, so it can be indexed by search engines. This will break some applications like the ones you have mentioned. I have just added a spa mode that mostly fixes this problem for the simple applications.

mpa https://example.net --spa

On this mode, it will not save the rendered/generated html, it will save the original html as served by the web server. Web apps also have the issue that some JavaScript modules are loaded only when you click around buttons or links. I just had the brilliant idea of fetching from origin requests not found in the zip, it will update the zip with the fetched file.

So mostly, to back up a simple application will require crawling and then using it a bit to fetch the missing modules that only are requested when using the app. The spa mode seems to work good with hexed.it, but it doesn't seem to work photopea, I haven't really looked much at why photopea doesn't work. Let me know how that goes.

Unkn0wnCreator · 2024-06-12T13:29:58Z

ah cool, okay i will try it but currently i do not have much time so please do note explicit wait for my response.

"Web apps also have the issue that some JavaScript modules are loaded only when you click around buttons or links"

maybe js files can be crawled and when it points to js, css, fonts etc. but this would not work every time ether i think

maybe it would be an idea to look into PWAs i mean photopea allows to install it as pwa could it be possible to grab the installed files? it is just an idea.

Unkn0wnCreator · 2024-06-14T17:46:11Z

I've been encountering a few problems while using files from the archive on a different web server (Caddy):

1

When using the files from the archive on another web server like Caddy, some scripts, for example, /wavacity.com/js/amplitude-8.1.0-min.js, are incorrectly served with the content-type image/png. Even "hexed.it" has this problem.

2

Some websites use the integrity flag and do not work properly. For example, you might see an error like this:

None of the "sha384" hashes in the "integrity" attribute match the content of the subresource. The calculated hash is "OLBgp1GsljhM2TJ+sbHjaiH9txEUvgdDTAzHv2P24donTt6/529l+9Ua0vFImLlb".

It would be helpful if these could be filtered out, possibly even filtering out Google Analytics.

Or maybe a way to filter all external domains with exceptions like cdns.

3

The mpa command should support an option like mpa 0.0.0.0 to allow any PC to connect, not just localhost. Because of that, I cannot use this tool in my case. I use mpa in an LXC container and do not want to use something to proxy it. (Maybe Docker would be an idea. I might make a Dockerfile and can send it here.)

4

Wavacity does not run because it cannot detect WebAssembly. But maybe this is a problem from Webserver? I don't know.

5

when trying to download wavacity this error message is presented:

`✔ https://wavacity.com/css/wavacity_0.1.35.css
🧽 https://wavacity.com/contact.html
🔗 https://wavacity.com/'/fonts/OpenSans-Light.ttf'
🛑 https://wavacity.com/'/fonts/OpenSans-Light.ttf'
node:buffer:319
throw new ERR_INVALID_ARG_TYPE(
^

TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer, ArrayBuffer, or Array or an Array-like Object. Received undefined
at Function.from (node:buffer:319:9)
at writeFile (file:///home/dirtydev/.nvm/versions/node/v22.3.0/lib/node_modules/mpa-archive/src/archive.js:101:19)
at onFile (file:///home/dirtydev/.nvm/versions/node/v22.3.0/lib/node_modules/mpa-archive/src/archive.js:264:3)
at fetchURL (file:///home/dirtydev/.nvm/versions/node/v22.3.0/lib/node_modules/mpa-archive/src/archive.js:313:2)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
code: 'ERR_INVALID_ARG_TYPE'
}

Node.js v22.3.0`

I know your tool was made to run the Archives with MPA, but maybe you could write what needs to be done when using another web server, such as which headers are needed. I want to archive some websites/web tools within my local network so that if the original is down or my internet is down, I could still use them. And it would be cool to use the Web server I already run in my local net.

But this is only what I observe and my use if it does not match with your vision please do not change anything :) this is purely what I would like.

titoBouzout · 2024-06-14T19:54:40Z

caddy content type: seems like a problem to report to Caddy, I just rechecked and the mpa web server has the correct content type. Trying to serve the files from a different webserver requires a few tweaks https://github.com/potahtml/mpa-archive/blob/master/src/server.js#L43-L49 is not that simple to make the urls mapeable to files. With that tweak it should in theory works the same. The only difference is that the mpa server will fetch files when the requested file is not found on the zip, on which case it will fetch the file from origin and update the zip. This is done this way for "buttons" on applications that trigger loading of js modules that werent crawled.
re integrity check: this happens because mpa when NOT in mode --spa rewrites the hard-coded URLs found on documents, for example it replaces https://example.net for /. When in --spa this rewriting doesn't happen, so the integrity check seems to pass. I am willing to figure out a workaround, but I am not sure what to do. I have attempted before to remove the integrity data, but it seems that the check sometimes happens in scripting, maybe we can replace .integrity for .integrityIgnoreMe but this may also cause a new set of issues. Maybe we can discuss with examples about this on a different issue?
0.0.0.0 I am unsure how binding to 0.0.0.0 works, I just attempted to naively change the ip to 0.0.0.0, and I cannot open it in a browser. Suggestions welcome
SharedArrayBuffer requires two special headers on the web server, I just added them. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer#security_requirements
url() property on css files may have quotes, I just changed something to remove the quotes. I wasnt able to reproduce the other error, but I have just added a check for undefined

when crawled in --spa mode the app nows kind of works but there are a bunch of errors in the console that right now I do not have time to investigate. I do not see any request canceled/erroing. Maybe you can suggest what's wrong.

titoBouzout · 2024-07-05T12:32:57Z

3, 4 and 5 are solved. It now listen to 0.0.0.0, the random port is seeded to the zip file name instead of the path to make it more predictable.

2 integrity check it may worth an investigation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can it crawl urls within js? #1

Can it crawl urls within js? #1

Unkn0wnCreator commented Jun 11, 2024 •

edited

Loading

titoBouzout commented Jun 11, 2024 •

edited

Loading

Unkn0wnCreator commented Jun 12, 2024

Unkn0wnCreator commented Jun 14, 2024 •

edited

Loading

titoBouzout commented Jun 14, 2024

titoBouzout commented Jul 5, 2024

Can it crawl urls within js? #1

Can it crawl urls within js? #1

Comments

Unkn0wnCreator commented Jun 11, 2024 • edited Loading

titoBouzout commented Jun 11, 2024 • edited Loading

Unkn0wnCreator commented Jun 12, 2024

Unkn0wnCreator commented Jun 14, 2024 • edited Loading

1

2

3

4

5

titoBouzout commented Jun 14, 2024

titoBouzout commented Jul 5, 2024

Unkn0wnCreator commented Jun 11, 2024 •

edited

Loading

titoBouzout commented Jun 11, 2024 •

edited

Loading

Unkn0wnCreator commented Jun 14, 2024 •

edited

Loading