-
Notifications
You must be signed in to change notification settings - Fork 176
Closed
Description
I am making a separate issue for this because while it is related to this issue, it is for the currently implemented root-relative only traversal mechanism.
Python web server for testing:
malweb.py:
from flask import Flask
app = Flask(__name__)
@app.route("/", methods=["GET"])
def root():
return f"""
<a href="/catch_all_root_relative/1/">Go deeper</a>
"""
@app.route("/catch_all_root_relative/<path:text>", methods=["GET"])
def catch_all_root_relative(text=None):
count = sum([int(x) for x in text if x == "1"]) + 1
text = "/catch_all_root_relative" + ("/1" * count) + "/"
return f"""
<a href="{text}">Go deeper</a>
"""
python3 -m flask --app malweb run
Spider:
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("http://127.0.0.1:5000/")
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
website.crawl_smart().await;
println!("Links found {:?}", website.get_size().await);
}
The infinite recursion problem is a problem for both root-relative and base-relative URLs. Spider should handle this accordingly by keeping track of the link depth and stop crawling once the link depth is reached. There should be detection of this behavior rather than the current depth process that only goes based on the number of path segments.
Metadata
Metadata
Assignees
Labels
No labels