Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement deadlink checker for vald web #2643

Merged
merged 7 commits into from
Sep 30, 2024

Conversation

vankichi
Copy link
Contributor

@vankichi vankichi commented Sep 25, 2024

Description

I have implemented a new command line tool to check deadlinks in the target file.
Also, I have added the make command: make deadlink-checker.

The usage is below:

# make deadlink-checker DEADLINK_CHECK_PATH=<path> DEADLINK_IGNORE_PATH=<ignore_path> DEADLINK_CHECK_FORMAT=<format>

make deadlink-checker DEADLINK_CHECK_PATH=./ DEADLINK_IGNORE_PATH=v[0-9]+ DEADLINK_CHECK_FORMAT=html

Related Issue

Versions

  • Vald Version: v1.7.13
  • Go Version: v1.23.1
  • Rust Version: v1.81.0
  • Docker Version: v27.2.1
  • Kubernetes Version: v1.31.0
  • Helm Version: v3.16.0
  • NGT Version: v2.2.4
  • Faiss Version: v1.8.0

Checklist

Special notes for your reviewer

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Introduced a command-line tool for validating links within HTML files, enhancing link management and ensuring content integrity.
    • Added a new deadlink-checker target in the build process for easier link validation.
    • New variables for dead link checking configurations have been added to streamline the process.
  • Bug Fixes
    • Corrected syntax in the usearch/install target for better build accuracy.

Copy link

cloudflare-workers-and-pages bot commented Sep 25, 2024

Deploying vald with  Cloudflare Pages  Cloudflare Pages

Latest commit: c0f1802
Status: ✅  Deploy successful!
Preview URL: https://6e9c5ac2.vald.pages.dev
Branch Preview URL: https://impl-hack-implement-deadlink.vald.pages.dev

View logs

Copy link
Contributor

coderabbitai bot commented Sep 25, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes introduce a command-line tool for validating links within HTML files, implemented in hack/tools/deadlink/main.go. This tool identifies various link types, collects file paths, constructs URLs, checks against a blacklist, and verifies link statuses through HTTP requests. It processes files concurrently, logging results and summarizing the outcomes of the link checks. Additionally, modifications to the Makefile facilitate the integration of this tool, including new variables and a dedicated target for running the link checker.

Changes

File Change Summary
Makefile Adds new variables for dead link checking and a new target deadlink-checker, updates existing target syntax.
Makefile.d/functions.mk Introduces a new function gen-deadlink-checker for building and executing the dead link checker binary.
hack/tools/deadlink/main.go Implements a command-line tool for checking link validity in HTML files, including URL extraction, blacklist checking, and concurrent HTTP requests.

Possibly related PRs

  • add reviewer guideline #2507: The changes in the main PR introduce a new function gen-deadlink-checker and related variables in the Makefile, which may require reviewers to understand the context of these changes, making the reviewer guidelines relevant for assessing the new functionality.

Suggested labels

type/documentation, size/M, actions/backport/release/v1.7


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8942da1 and c0f1802.

📒 Files selected for processing (1)
  • Makefile (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@vdaas-ci
Copy link
Collaborator

[CHATOPS:HELP] ChatOps commands.

  • 🙆‍♀️ /approve - approve
  • 🍱 /format - format codes and add licenses
  • /gen-test - generate test codes
  • 🏷️ /label - add labels
  • 🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 686c16a and b143230.

📒 Files selected for processing (1)
  • hack/tools/deadlink/main.go (1 hunks)
🔇 Additional comments not posted (1)
hack/tools/deadlink/main.go (1)

47-50: Validate regular expressions for accurate link extraction

The regular expressions used to extract links might not cover all valid cases or could produce incorrect matches, especially if attributes contain additional properties or are formatted differently.

Run the following script to test and verify the regular expressions against sample HTML content:

Ensure the patterns match attribute values enclosed in quotes and handle different cases properly.

✅ Verification successful

Further Validate Regular Expressions with Diverse HTML Samples

To ensure the regex patterns handle various attribute formats and edge cases, please run the following script:


Regular Expressions Validated Successfully

The regular expressions correctly extract src, href, and srcset attributes when values are enclosed in quotes. No issues were found with the current patterns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that regular expressions correctly extract links from HTML files.

# Test: Use sample HTML content to test regex patterns.
# Expect: Correct extraction of 'src', 'href', and 'srcset' attributes.

echo '<a href="https://example.com">Link</a><img src="/images/logo.png" /><source srcset="image1.jpg 1x, image2.jpg 2x">' > sample.html

# Patterns to test
grep -Eo 'src=["'\''][^"'\'']+["'\'']' sample.html
grep -Eo 'href=["'\''][^"'\'']+["'\'']' sample.html
grep -Eo 'srcset=["'\''][^"'\'']+["'\'']' sample.html

Length of output: 374


Script:

#!/bin/bash
# Description: Further verify that regular expressions correctly extract links from diverse HTML files.

# Create a sample HTML file with varied attribute formats
cat <<EOF > sample_complex.html
<a href='https://example.com/page?param=value'>Example Link</a>
<img src="/images/logo.png" alt="Logo" />
<source srcset="image1.jpg 1x, image2.jpg 2x, image3.jpg 3x">
<a href="https://example.com/another-page" class="nav-link">Another Link</a>
<img src='/images/banner.png' data-info="banner" />
<source srcset='image4.jpg 1x, image5.jpg 2x'>
<a href=https://example.com/missing-quotes>Missing Quotes</a>
EOF

# Patterns to test
echo "Testing src:"
grep -Eo 'src=["'\''][^"'\'']+["'\'']' sample_complex.html

echo "Testing href:"
grep -Eo 'href=["'\''][^"'\'']+["'\'']' sample_complex.html

echo "Testing srcset:"
grep -Eo 'srcset=["'\''][^"'\'']+["'\'']' sample_complex.html

Length of output: 528

hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
deepsource-autofix bot added a commit that referenced this pull request Sep 26, 2024
This commit fixes the style issues introduced in 4f5a181 according to the output
from Gofumpt and Prettier.

Details: #2643
@vankichi vankichi changed the title ✨ Implement deadlink checker for vald web Implement deadlink checker for vald web Sep 26, 2024
vankichi pushed a commit that referenced this pull request Sep 26, 2024
This commit fixes the style issues introduced in 4f5a181 according to the output
from Gofumpt and Prettier.

Details: #2643
@vankichi vankichi force-pushed the impl/hack/implement-deadlink-checker branch from 471f562 to 223aac3 Compare September 26, 2024 03:34
coderabbitai[bot]
coderabbitai bot previously approved these changes Sep 26, 2024
@vankichi
Copy link
Contributor Author

@coderabbitai please review

Copy link
Contributor

coderabbitai bot commented Sep 26, 2024

@vankichi: I will review the changes.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@vankichi
Copy link
Contributor Author

@coderabbitai full review

Copy link
Contributor

coderabbitai bot commented Sep 26, 2024

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
Makefile (2)

52-54: LGTM! Consider adding comments for clarity.

The new variables for the deadlink checker are well-defined. The use of ?= for DEADLINK_CHECK_PATH and DEADLINK_IGNORE_PATH allows for flexibility in setting these values.

Consider adding brief comments explaining the purpose of each variable to improve maintainability:

-DEADLINK_CHECK_PATH            ?= ""
-DEADLINK_IGNORE_PATH           ?= ""
-DEADLINK_CHECK_FORMAT           = html
+# Path to check for dead links
+DEADLINK_CHECK_PATH            ?= ""
+# Regex pattern for paths to ignore during the check
+DEADLINK_IGNORE_PATH           ?= ""
+# Output format for the results (e.g., html, json)
+DEADLINK_CHECK_FORMAT           = html

Line range hint 691-706: LGTM! Improved cross-platform support for usearch installation.

The changes to the usearch/install target greatly improve cross-platform compatibility by providing specific installation steps for Linux and macOS. The use of conditional statements and appropriate cleanup steps is commendable.

Consider adding error handling for the curl commands to make the installation process more robust:

 usearch/install:
 ifeq ($(OS),linux)
-	curl -sSL https://github.com/unum-cloud/usearch/releases/download/v$(USEARCH_VERSION)/usearch_$(OS)_$(GOARCH)_$(USEARCH_VERSION).deb -o usearch_$(OS)_$(USEARCH_VERSION).deb
+	curl -sSL -f https://github.com/unum-cloud/usearch/releases/download/v$(USEARCH_VERSION)/usearch_$(OS)_$(GOARCH)_$(USEARCH_VERSION).deb -o usearch_$(OS)_$(USEARCH_VERSION).deb || (echo "Failed to download usearch package" && exit 1)
 	dpkg -i usearch_$(OS)_$(USEARCH_VERSION).deb
 	rm usearch_$(OS)_$(USEARCH_VERSION).deb
 	ldconfig
 else ifeq ($(OS),macos)
-	curl -sSL https://github.com/unum-cloud/usearch/releases/download/v$(USEARCH_VERSION)/usearch_macos_$(GOARCH)_$(USEARCH_VERSION).zip -o usearch_macos_$(OS)_$(USEARCH_VERSION).zip
+	curl -sSL -f https://github.com/unum-cloud/usearch/releases/download/v$(USEARCH_VERSION)/usearch_macos_$(GOARCH)_$(USEARCH_VERSION).zip -o usearch_macos_$(OS)_$(USEARCH_VERSION).zip || (echo "Failed to download usearch package" && exit 1)
 	unzip usearch_macos_$(OS)_$(USEARCH_VERSION).zip
 	sudo mv libusearch_c.dylib /usr/local/lib && sudo mv usearch.h /usr/local/include
 	rm -rf usearch_macos_$(OS)_$(USEARCH_VERSION).zip
 	ldconfig
 endif

This addition of the -f flag to curl and the error handling will ensure that the installation fails gracefully if the download doesn't succeed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 223aac3 and e4bcbb2.

📒 Files selected for processing (3)
  • Makefile (3 hunks)
  • Makefile.d/functions.mk (1 hunks)
  • hack/tools/deadlink/main.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • hack/tools/deadlink/main.go
🔇 Additional comments (1)
Makefile (1)

454-458: LGTM! Well-structured target for deadlink checker.

The new deadlink-checker target is correctly defined as phony and properly calls the gen-deadlink-checker function with the necessary parameters. The comment above the target is helpful for users.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
Makefile (1)

Line range hint 691-703: Improvement suggestion for usearch installation

The usearch/install target has platform-specific installation steps for Linux and macOS. While the implementation is correct, it could be improved for better maintainability and readability.

Consider refactoring the usearch/install target to use a more concise and maintainable structure. Here's a suggested improvement:

.PHONY: usearch/install
## install usearch
usearch/install:
	@case "$(OS)" in \
		linux) \
			curl -sSL https://github.com/unum-cloud/usearch/releases/download/v$(USEARCH_VERSION)/usearch_$(OS)_$(GOARCH)_$(USEARCH_VERSION).deb -o usearch_$(OS)_$(USEARCH_VERSION).deb && \
			dpkg -i usearch_$(OS)_$(USEARCH_VERSION).deb && \
			rm usearch_$(OS)_$(USEARCH_VERSION).deb ;; \
		darwin) \
			curl -sSL https://github.com/unum-cloud/usearch/releases/download/v$(USEARCH_VERSION)/usearch_macos_$(GOARCH)_$(USEARCH_VERSION).zip -o usearch_macos_$(USEARCH_VERSION).zip && \
			unzip usearch_macos_$(USEARCH_VERSION).zip && \
			sudo mv libusearch_c.dylib /usr/local/lib && sudo mv usearch.h /usr/local/include && \
			rm -rf usearch_macos_$(USEARCH_VERSION).zip ;; \
		*) echo "Unsupported OS: $(OS)" && exit 1 ;; \
	esac
	ldconfig

This refactored version:

  1. Uses a case statement for better readability.
  2. Combines commands with && to ensure all steps are executed successfully.
  3. Adds an error case for unsupported operating systems.
  4. Keeps the ldconfig command outside the case statement as it's common to both platforms.
hack/tools/deadlink/main.go (1)

24-33: Adhere to Go's naming conventions for constants

In Go, constants are typically named using CamelCase rather than ALL_CAPS. Renaming the constants improves readability and aligns with Go's idiomatic style.

Apply this diff to rename the constants:

 const (
-	BASE_URL      = "https://vald.vdaas.org"
-	PREFIX_PROP   = `property="og:url" content="`
-	PREFIX_SRC    = "src="
-	PREFIX_SRCSET = "srcset="
-	PREFIX_HREF   = "href="
+	baseURL      = "https://vald.vdaas.org"
+	prefixProp   = `property="og:url" content="`
+	prefixSrc    = "src="
+	prefixSrcset = "srcset="
+	prefixHref   = "href="
 
 	BASE_REGEXP = `[\w!\?/\+\-_~=;:\.,\*&@#\$%\(\)'\[\]]+`
 	URL_REGEXP  = `https?://[\w!\?/\+\-_~=;\.,\*&@#\$%\(\)'\[\]]+`
 )

Ensure to update all references to these constants throughout the code.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 686c16a and e4bcbb2.

📒 Files selected for processing (3)
  • Makefile (3 hunks)
  • Makefile.d/functions.mk (1 hunks)
  • hack/tools/deadlink/main.go (1 hunks)
🔇 Additional comments (3)
Makefile (3)

52-54: New variables added for deadlink checker

Three new variables have been added for the deadlink checker functionality:

  • DEADLINK_CHECK_PATH: Path to check for dead links.
  • DEADLINK_IGNORE_PATH: Regex pattern for paths to ignore during the check.
  • DEADLINK_CHECK_FORMAT: Output format for the results, with a default value of "html".

These variables provide good flexibility for configuring the deadlink checker.


454-458: New phony target added for deadlink checker

A new phony target deadlink-checker has been added to generate the deadlink checker. It uses the gen-deadlink-checker function, passing the root directory, maintainer information, and the newly defined variables as parameters.

This addition aligns well with the project's structure and naming conventions.


Line range hint 1-703: Summary of changes: Deadlink checker implementation

The changes to this Makefile primarily focus on implementing a deadlink checker functionality. Key additions include:

  1. New variables for configuring the deadlink checker (lines 52-54).
  2. A new phony target deadlink-checker for generating the deadlink checker (lines 454-458).

These additions are well-integrated into the existing Makefile structure and provide useful functionality for checking dead links in the project. The new variables offer good configurability for the deadlink checker.

While not directly related to the deadlink checker, an improvement suggestion was made for the usearch/install target to enhance its maintainability and readability.

Overall, these changes appear to be a positive addition to the project's build and maintenance capabilities.

Makefile.d/functions.mk Outdated Show resolved Hide resolved
hack/tools/deadlink/main.go Show resolved Hide resolved
hack/tools/deadlink/main.go Outdated Show resolved Hide resolved
@vankichi vankichi requested review from a team, kpango and datelier and removed request for a team September 26, 2024 08:50
datelier
datelier previously approved these changes Sep 27, 2024
@vankichi vankichi marked this pull request as ready for review September 30, 2024 03:02
vankichi and others added 4 commits September 30, 2024 12:02
Signed-off-by: vankichi <[email protected]>
This commit fixes the style issues introduced in 4f5a181 according to the output
from Gofumpt and Prettier.

Details: #2643
Signed-off-by: vankichi <[email protected]>
Signed-off-by: vankichi <[email protected]>
Signed-off-by: vankichi <[email protected]>
@vankichi vankichi force-pushed the impl/hack/implement-deadlink-checker branch from 4466d68 to 8942da1 Compare September 30, 2024 03:03
coderabbitai[bot]
coderabbitai bot previously approved these changes Sep 30, 2024
datelier
datelier previously approved these changes Sep 30, 2024
@kpango kpango merged commit 413b5d7 into main Sep 30, 2024
151 of 153 checks passed
@kpango kpango deleted the impl/hack/implement-deadlink-checker branch September 30, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants