Skip to content

distinguish base_url when logging url_filter_not_same_domain() and url_filter_prevent_intersections() #164

@a11ya11y

Description

@a11ya11y

When URLs are filtered out during a scan because by url_filter_not_same_domain() and url_filter_prevent_intersections(), they're logged in a way that makes it difficult to know or unclear which URL is the base_url:

[{2025-09-01T09:26:08+1200} INFO filters.py : 160 ] url_filter_not_same_domain url out due to domain mismatch https://www.msd.govt.nz/ https://findajob.msd.govt.nz/ ThreadPoolExecutor-0_3
[{2025-09-01T09:26:08+1200} INFO filters.py : 160 ] url_filter_not_same_domain url out due to domain mismatch https://www.workandincome.govt.nz https://findajob.msd.govt.nz/ ThreadPoolExecutor-0_3
[{2025-09-01T09:26:32+1200} INFO crawler.py : 188 ] url_filter_prevent_intersections URL filtered out due to not starting with base_url https://www.familyservices.govt.nz/directory/ https://www.familyservices.govt.nz/directory-help/ ThreadPoolExecutor-0_14
[{2025-09-29T20:11:28+1300} INFO crawler.py : 207 ] url_filter_prevent_intersections URL filtered out due to being within the scope of another base_url https://www.companiesoffice.govt.nz/all-registers/contributory-mortgage-brokers/ https://www.companiesoffice.govt.nz/all-registers/contributory-mortgage-brokers/current-registrations/ ThreadPoolExecutor-0_15

Consider prefacing the base_url with "base_url: " and the filtered URL with "filtered_url: ".

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions