Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can it crawl urls within js? #1

Open
Unkn0wnCreator opened this issue Jun 11, 2024 · 5 comments
Open

Can it crawl urls within js? #1

Unkn0wnCreator opened this issue Jun 11, 2024 · 5 comments

Comments

@Unkn0wnCreator
Copy link

Unkn0wnCreator commented Jun 11, 2024

So like there are websites that load stuff with js

like is it possible to save webapps with it? Like photopea or hexed.it

@titoBouzout
Copy link
Member

titoBouzout commented Jun 11, 2024

It doesn't look for URLs on JavaScript files, In only does in css files by looking for url(...).

The original idea is to save the HTML generated by JavaScript, so it can be indexed by search engines. This will break some applications like the ones you have mentioned. I have just added a spa mode that mostly fixes this problem for the simple applications.

mpa https://example.net --spa

On this mode, it will not save the rendered/generated html, it will save the original html as served by the web server. Web apps also have the issue that some JavaScript modules are loaded only when you click around buttons or links. I just had the brilliant idea of fetching from origin requests not found in the zip, it will update the zip with the fetched file.

So mostly, to back up a simple application will require crawling and then using it a bit to fetch the missing modules that only are requested when using the app. The spa mode seems to work good with hexed.it, but it doesn't seem to work photopea, I haven't really looked much at why photopea doesn't work. Let me know how that goes.

@Unkn0wnCreator
Copy link
Author

ah cool, okay i will try it but currently i do not have much time so please do note explicit wait for my response.

"Web apps also have the issue that some JavaScript modules are loaded only when you click around buttons or links"

maybe js files can be crawled and when it points to js, css, fonts etc. but this would not work every time ether i think

maybe it would be an idea to look into PWAs i mean photopea allows to install it as pwa could it be possible to grab the installed files? it is just an idea.

@Unkn0wnCreator
Copy link
Author

Unkn0wnCreator commented Jun 14, 2024

I've been encountering a few problems while using files from the archive on a different web server (Caddy):

1

When using the files from the archive on another web server like Caddy, some scripts, for example, /wavacity.com/js/amplitude-8.1.0-min.js, are incorrectly served with the content-type image/png. Even "hexed.it" has this problem.

2

Some websites use the integrity flag and do not work properly. For example, you might see an error like this:

None of the "sha384" hashes in the "integrity" attribute match the content of the subresource. The calculated hash is "OLBgp1GsljhM2TJ+sbHjaiH9txEUvgdDTAzHv2P24donTt6/529l+9Ua0vFImLlb".

It would be helpful if these could be filtered out, possibly even filtering out Google Analytics.

Or maybe a way to filter all external domains with exceptions like cdns.

3

The mpa command should support an option like mpa 0.0.0.0 to allow any PC to connect, not just localhost. Because of that, I cannot use this tool in my case. I use mpa in an LXC container and do not want to use something to proxy it. (Maybe Docker would be an idea. I might make a Dockerfile and can send it here.)

4

Wavacity does not run because it cannot detect WebAssembly. But maybe this is a problem from Webserver? I don't know.

5

when trying to download wavacity this error message is presented:

`✔ https://wavacity.com/css/wavacity_0.1.35.css
🧽 https://wavacity.com/contact.html
🔗 https://wavacity.com/'/fonts/OpenSans-Light.ttf'
🛑 https://wavacity.com/'/fonts/OpenSans-Light.ttf'
node:buffer:319
throw new ERR_INVALID_ARG_TYPE(
^

TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer, ArrayBuffer, or Array or an Array-like Object. Received undefined
at Function.from (node:buffer:319:9)
at writeFile (file:///home/dirtydev/.nvm/versions/node/v22.3.0/lib/node_modules/mpa-archive/src/archive.js:101:19)
at onFile (file:///home/dirtydev/.nvm/versions/node/v22.3.0/lib/node_modules/mpa-archive/src/archive.js:264:3)
at fetchURL (file:///home/dirtydev/.nvm/versions/node/v22.3.0/lib/node_modules/mpa-archive/src/archive.js:313:2)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
code: 'ERR_INVALID_ARG_TYPE'
}

Node.js v22.3.0`


I know your tool was made to run the Archives with MPA, but maybe you could write what needs to be done when using another web server, such as which headers are needed. I want to archive some websites/web tools within my local network so that if the original is down or my internet is down, I could still use them. And it would be cool to use the Web server I already run in my local net.


But this is only what I observe and my use if it does not match with your vision please do not change anything :) this is purely what I would like.

@titoBouzout
Copy link
Member

  1. caddy content type: seems like a problem to report to Caddy, I just rechecked and the mpa web server has the correct content type. Trying to serve the files from a different webserver requires a few tweaks https://github.com/potahtml/mpa-archive/blob/master/src/server.js#L43-L49 is not that simple to make the urls mapeable to files. With that tweak it should in theory works the same. The only difference is that the mpa server will fetch files when the requested file is not found on the zip, on which case it will fetch the file from origin and update the zip. This is done this way for "buttons" on applications that trigger loading of js modules that werent crawled.

  2. re integrity check: this happens because mpa when NOT in mode --spa rewrites the hard-coded URLs found on documents, for example it replaces https://example.net for /. When in --spa this rewriting doesn't happen, so the integrity check seems to pass. I am willing to figure out a workaround, but I am not sure what to do. I have attempted before to remove the integrity data, but it seems that the check sometimes happens in scripting, maybe we can replace .integrity for .integrityIgnoreMe but this may also cause a new set of issues. Maybe we can discuss with examples about this on a different issue?

  3. 0.0.0.0 I am unsure how binding to 0.0.0.0 works, I just attempted to naively change the ip to 0.0.0.0, and I cannot open it in a browser. Suggestions welcome

  4. SharedArrayBuffer requires two special headers on the web server, I just added them. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer#security_requirements

  5. url() property on css files may have quotes, I just changed something to remove the quotes. I wasnt able to reproduce the other error, but I have just added a check for undefined

when crawled in --spa mode the app nows kind of works but there are a bunch of errors in the console that right now I do not have time to investigate. I do not see any request canceled/erroing. Maybe you can suggest what's wrong.

@titoBouzout
Copy link
Member

3, 4 and 5 are solved. It now listen to 0.0.0.0, the random port is seeded to the zip file name instead of the path to make it more predictable.

2 integrity check it may worth an investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants