Skip to content

Root relative path infinite loop #247

@phughesion

Description

@phughesion

I am making a separate issue for this because while it is related to this issue, it is for the currently implemented root-relative only traversal mechanism.

Python web server for testing:
malweb.py:

from flask import Flask

app = Flask(__name__)

@app.route("/", methods=["GET"])
def root():
    return f"""
        <a href="/catch_all_root_relative/1/">Go deeper</a>
    """

@app.route("/catch_all_root_relative/<path:text>", methods=["GET"])
def catch_all_root_relative(text=None):
    count = sum([int(x) for x in text if x == "1"]) + 1
    text = "/catch_all_root_relative" + ("/1" * count) + "/"
    return f"""
        <a href="{text}">Go deeper</a>
    """

python3 -m flask --app malweb run

Spider:

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("http://127.0.0.1:5000/")
    .build()
    .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    website.crawl_smart().await;

    println!("Links found {:?}", website.get_size().await);
}

The infinite recursion problem is a problem for both root-relative and base-relative URLs. Spider should handle this accordingly by keeping track of the link depth and stop crawling once the link depth is reached. There should be detection of this behavior rather than the current depth process that only goes based on the number of path segments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions