Skip to content

Wayback upgrade#2909

Draft
liquidsec wants to merge 29 commits into3.0from
wayback-upgrade
Draft

Wayback upgrade#2909
liquidsec wants to merge 29 commits into3.0from
wayback-upgrade

Conversation

@liquidsec
Copy link
Collaborator

TBA

@liquidsec liquidsec marked this pull request as draft February 19, 2026 03:01
assert "archive_url" in finding.data, (
f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
)
assert "web.archive.org" in finding.data["archive_url"], (

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

The string
web.archive.org
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 17 days ago

In general, the way to fix incomplete URL substring sanitization is to parse the URL using a standard library, extract the hostname, and then compare that hostname (or a suffix of it) to the expected allowed host, instead of checking for a substring in the raw URL string.

In this specific case, we should change the assertion that currently does assert "web.archive.org" in finding.data["archive_url"] so that it parses archive_url with urllib.parse.urlparse, extracts .hostname, and asserts that the hostname is exactly web.archive.org. This preserves the intended functionality (“archive_url should be archive.org URL”) while avoiding arbitrary substring matches. Concretely, within TestWaybackParameters.check, around lines 309–315, we will introduce a local variable such as archive_url_host = urlparse(finding.data["archive_url"]).hostname and assert archive_url_host == "web.archive.org". To do this, we must import urlparse from urllib.parse at the top of the test file, alongside the existing unquote import. No other behavior in the tests needs to change.

Suggested changeset 1
bbot/test/test_step_2/module_tests/test_module_wayback.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/bbot/test/test_step_2/module_tests/test_module_wayback.py b/bbot/test/test_step_2/module_tests/test_module_wayback.py
--- a/bbot/test/test_step_2/module_tests/test_module_wayback.py
+++ b/bbot/test/test_step_2/module_tests/test_module_wayback.py
@@ -1,5 +1,5 @@
 import re
-from urllib.parse import unquote
+from urllib.parse import unquote, urlparse
 
 from werkzeug.wrappers import Response
 
@@ -310,8 +310,10 @@
             assert "archive_url" in finding.data, (
                 f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
             )
-            assert "web.archive.org" in finding.data["archive_url"], (
-                f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}"
+            archive_url_host = urlparse(finding.data["archive_url"]).hostname
+            assert archive_url_host == "web.archive.org", (
+                f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, "
+                f"full URL: {finding.data['archive_url']}"
             )
 
         # WEB_PARAMETERs from archived content should also have archive_url
EOF
@@ -1,5 +1,5 @@
import re
from urllib.parse import unquote
from urllib.parse import unquote, urlparse

from werkzeug.wrappers import Response

@@ -310,8 +310,10 @@
assert "archive_url" in finding.data, (
f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
)
assert "web.archive.org" in finding.data["archive_url"], (
f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}"
archive_url_host = urlparse(finding.data["archive_url"]).hostname
assert archive_url_host == "web.archive.org", (
f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, "
f"full URL: {finding.data['archive_url']}"
)

# WEB_PARAMETERs from archived content should also have archive_url
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bro its a draft step off

@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

📊 Performance Benchmark Report

Comparing 3.0 (baseline) vs wayback-upgrade (current)

📈 Detailed Results (All Benchmarks)

📋 Complete results for all benchmarks - includes both significant and insignificant changes

🧪 Test Name 📏 Base 📏 Current 📈 Change 🎯 Status
Bloom Filter Dns Mutation Tracking Performance 4.28ms 4.27ms -0.4%
Bloom Filter Large Scale Dns Brute Force 17.73ms 17.54ms -1.1%
Large Closest Match Lookup 361.27ms 346.04ms -4.2%
Realistic Closest Match Workload 192.35ms 191.43ms -0.5%
Event Memory Medium Scan 1770 B/event 1774 B/event +0.2%
Event Memory Large Scan 1757 B/event 1757 B/event +0.0%
Event Validation Full Scan Startup Small Batch 483.46ms 488.10ms +1.0%
Event Validation Full Scan Startup Large Batch 762.02ms 766.36ms +0.6%
Make Event Autodetection Small 30.71ms 30.22ms -1.6%
Make Event Autodetection Large 314.35ms 311.22ms -1.0%
Make Event Explicit Types 13.80ms 13.74ms -0.4%
Excavate Single Thread Small 4.043s 4.029s -0.3%
Excavate Single Thread Large 9.539s 9.748s +2.2%
Excavate Parallel Tasks Small 4.155s 4.148s -0.2%
Excavate Parallel Tasks Large 7.261s 7.321s +0.8%
Is Ip Performance 3.19ms 3.17ms -0.6%
Make Ip Type Performance 11.35ms 11.28ms -0.6%
Mixed Ip Operations 4.51ms 4.47ms -0.8%
Typical Queue Shuffle 62.80µs 60.65µs -3.4%
Priority Queue Shuffle 699.81µs 697.49µs -0.3%

🎯 Performance Summary

No significant performance changes detected (all changes <10%)


🐍 Python Version 3.11.14

@liquidsec liquidsec changed the base branch from dev to 3.0 February 28, 2026 18:31
- wayback: override _incoming_dedup_hash for URL events to prevent
  subdomain_enum's domain-based dedup from collapsing distinct URLs
- wayback: fix FINDING confidence "MODERATE" -> "MEDIUM" (valid level)
- wayback: use individual requests instead of request_batch for
  interesting file HEAD checks
- subdomain_enum: revert is_target exemption from wildcard rejection
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 86.64773% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 91%. Comparing base (b90f1d3) to head (9b12e26).
⚠️ Report is 10 commits behind head on 3.0.

Files with missing lines Patch % Lines
bbot/modules/wayback.py 76% 87 Missing ⚠️
bbot/test/test_step_1/test_helpers.py 77% 3 Missing ⚠️
bbot/modules/httpx.py 60% 2 Missing ⚠️
bbot/core/helpers/web/engine.py 75% 1 Missing ⚠️
...st/test_step_2/module_tests/test_module_wayback.py 100% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##             3.0   #2909    +/-   ##
======================================
- Coverage     92%     91%    -0%     
======================================
  Files        436     436            
  Lines      36320   37004   +684     
======================================
+ Hits       33059   33638   +579     
- Misses      3261    3366   +105     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant