-
Notifications
You must be signed in to change notification settings - Fork 3
Description
When URLs are filtered out during a scan because by url_filter_not_same_domain() and url_filter_prevent_intersections(), they're logged in a way that makes it difficult to know or unclear which URL is the base_url:
[{2025-09-01T09:26:08+1200} INFO filters.py : 160 ] url_filter_not_same_domain url out due to domain mismatch https://www.msd.govt.nz/ https://findajob.msd.govt.nz/ ThreadPoolExecutor-0_3
[{2025-09-01T09:26:08+1200} INFO filters.py : 160 ] url_filter_not_same_domain url out due to domain mismatch https://www.workandincome.govt.nz https://findajob.msd.govt.nz/ ThreadPoolExecutor-0_3
[{2025-09-01T09:26:32+1200} INFO crawler.py : 188 ] url_filter_prevent_intersections URL filtered out due to not starting with base_url https://www.familyservices.govt.nz/directory/ https://www.familyservices.govt.nz/directory-help/ ThreadPoolExecutor-0_14
[{2025-09-29T20:11:28+1300} INFO crawler.py : 207 ] url_filter_prevent_intersections URL filtered out due to being within the scope of another base_url https://www.companiesoffice.govt.nz/all-registers/contributory-mortgage-brokers/ https://www.companiesoffice.govt.nz/all-registers/contributory-mortgage-brokers/current-registrations/ ThreadPoolExecutor-0_15
Consider prefacing the base_url with "base_url: " and the filtered URL with "filtered_url: ".