-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
html-validate #26
Comments
I fear the package is a thing to be started with Apache https://askubuntu.com/questions/471523/install-wc3-markup-validator-locally |
based on that it seems to be that to use that package in a workflow some configuration files would need to be changed. then one would need to serve both the website under scrutiny and the validator, then send the link to the website under scrutiny to the validator, then parse the results that would be a HTML file. or maybe if one serves the validator, then there's an API. I hope to find some better docs somewhere. |
apparently any instance would have the API https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-GET |
I was hoping to find a ready-made action but didn't find one. |
found https://www.npmjs.com/package/html-validator by chance (was working on some other invalid HTML 😂 ) |
but it would use the API |
The Quarto team doesn't agree that this validator is an authority, but they follow the w3c one. If the w3c validator is difficult to operate, we could also validate once with the w3c validator, and then come up with exclusions for our validator that lead to a green build. To recap, why I think validation is important: I've heard that search engines treat well-formatted websites better than crappy ones. Happy to revisit this stance if it's irrelevant or wrong. |
A first step would be to identify which pages are modified so as not to send the whole site to the API. 🤔 Probably not just a Git thing because a page's metadata might have changed (so different for Git) without it being worth sending it to the API. Maybe a sitemap thing. Download the current sitemap, get the new one, send the new pages to the API. |
To me, detecting changes is independent, and could also be postponed? |
We need to know which pages to send the API. |
Current script, something is wrong with how I send the document as it's not properly detected. current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
xml2::xml_text()
# quarto::quarto_render()
new_sitemap <- xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
xml2::xml_text()
added_links <- setdiff(new_links, current_links)
validate_page <- function(url) {
file <- file.path("docs", urltools::path(url))
httr2::request("http://validator.w3.org/nu/?out=json") |>
httr2::req_method("POST") |>
httr2::req_headers(
`Content-Type` = "text/html",
"charset"="utf-8"
) |>
httr2::req_body_file(file) |>
httr2::req_perform() |>
httr2::resp_body_json()
}
|
ah, using |
Still not there yet. current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
xml2::xml_text()
# quarto::quarto_render()
new_sitemap <- xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
xml2::xml_text()
added_links <- setdiff(new_links, current_links)
validate_page <- function(url) {
file <- file.path("docs", urltools::path(url))
httr2::request("http://validator.w3.org/nu/") |>
httr2::req_url_query(out = "json") |>
httr2::req_method("POST") |>
httr2::req_headers(
`Content-Type` = "text/html",
"charset"="utf-8"
) |>
httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n")) |>
httr2::req_perform() |>
httr2::resp_body_json()
}
validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#>
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#>
#>
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#>
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#>
#>
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#>
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#>
#>
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#>
#> $messages[[4]]$subType
#> [1] "warning"
#>
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document." Created on 2024-02-19 with reprex v2.1.0 |
The errors make no sense given the actual content of index.html, which means I am sending it in a wrong way. |
Indeed, if I use showsource, it shows I sent nothing. |
But the dry-run of httr2 shows content length. |
I'm putting this aside for now. 😞 |
The last time https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-POST-body was updated was in 2016, so maybe it's no longer valid? |
I tried a bit more without success. current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
xml2::xml_text()
# quarto::quarto_render()
new_sitemap <- xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
xml2::xml_text()
added_links <- setdiff(new_links, current_links)
validate_page <- function(url) {
file <- file.path("docs", urltools::path(url))
httr2::request("http://validator.w3.org/nu/") |>
httr2::req_url_query(out = "json", showsource = "yes", parser = "html5") |>
httr2::req_method("POST") |>
httr2::req_headers(
`Content-Type` = "text/html",
"charset"="utf-8"
) |>
httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n"), "text/html; charset=utf-8") |>
httr2::req_perform() |>
httr2::resp_body_json()
}
validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#>
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#>
#>
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#>
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#>
#>
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#>
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#>
#>
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#>
#> $messages[[4]]$subType
#> [1] "warning"
#>
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document."
#>
#>
#>
#> $source
#> $source$type
#> [1] "text/html"
#>
#> $source$code
#> [1] "" Created on 2024-02-26 with reprex v2.1.0 |
What text are you sending to the API? |
a whole HTML file. |
The file has |
the API sees |
Can you upload a file manually to https://validator.w3.org/nu/about.html ? I'm forgetting again why this is so complicated.
What am I missing? |
I wanted to use the API instead of trying to deploy the thing on GHA, but it's not working. I had been able to use the web interface. |
|
pfff it was actually easy, what was I thinking.
|
This is due to how Quarto handles dark mode. Both files are present in the source.
This is about
quarto-dev/quarto-cli#6987
This is for lines such as
This refers to "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":1131.54-1151.17: info warning: Document uses the Unicode Private Use Area(s), which should not be used in publicly exchanged documents. (Charmod C073)
Maybe |
@DivadNojnarg do the very last two lines of the comment above make sense to you? How could we tweak the script you created to avoid the validator error? |
I'll come back to this issue next week, now that I can run the validator. 😸 |
quarto-dev/quarto-cli#7489
The text was updated successfully, but these errors were encountered: