Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html-validate #26

Open
maelle opened this issue Nov 6, 2023 · 33 comments
Open

html-validate #26

maelle opened this issue Nov 6, 2023 · 33 comments

Comments

@maelle
Copy link
Member

maelle commented Nov 6, 2023

quarto-dev/quarto-cli#7489

@maelle
Copy link
Member Author

maelle commented Nov 27, 2023

I fear the package is a thing to be started with Apache https://askubuntu.com/questions/471523/install-wc3-markup-validator-locally

@maelle
Copy link
Member Author

maelle commented Nov 27, 2023

based on that it seems to be that to use that package in a workflow some configuration files would need to be changed.

then one would need to serve both the website under scrutiny and the validator, then send the link to the website under scrutiny to the validator, then parse the results that would be a HTML file.

or maybe if one serves the validator, then there's an API.

I hope to find some better docs somewhere.

@krlmlr

@maelle
Copy link
Member Author

maelle commented Nov 27, 2023

@maelle
Copy link
Member Author

maelle commented Nov 27, 2023

apparently any instance would have the API https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-GET

@maelle
Copy link
Member Author

maelle commented Nov 27, 2023

I was hoping to find a ready-made action but didn't find one.

@maelle
Copy link
Member Author

maelle commented Dec 1, 2023

found https://www.npmjs.com/package/html-validator by chance (was working on some other invalid HTML 😂 )

@maelle
Copy link
Member Author

maelle commented Dec 1, 2023

but it would use the API

@pat-s
Copy link
Contributor

pat-s commented Dec 18, 2023

@krlmlr
Copy link
Contributor

krlmlr commented Dec 18, 2023

The Quarto team doesn't agree that this validator is an authority, but they follow the w3c one.

quarto-dev/quarto-cli#7489

If the w3c validator is difficult to operate, we could also validate once with the w3c validator, and then come up with exclusions for our validator that lead to a green build.

To recap, why I think validation is important: I've heard that search engines treat well-formatted websites better than crappy ones. Happy to revisit this stance if it's irrelevant or wrong.

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

A first step would be to identify which pages are modified so as not to send the whole site to the API. 🤔

Probably not just a Git thing because a page's metadata might have changed (so different for Git) without it being worth sending it to the API.

Maybe a sitemap thing. Download the current sitemap, get the new one, send the new pages to the API.

@krlmlr
Copy link
Contributor

krlmlr commented Feb 19, 2024

To me, detecting changes is independent, and could also be postponed?

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

We need to know which pages to send the API.

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

Current script, something is wrong with how I send the document as it's not properly detected.

current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
  xml2::xml_text()

# quarto::quarto_render()

new_sitemap <-  xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
  xml2::xml_text()
added_links <- setdiff(new_links, current_links)

validate_page <- function(url) {
  file <- file.path("docs", urltools::path(url))
  httr2::request("http://validator.w3.org/nu/?out=json") |>
    httr2::req_method("POST") |>
    httr2::req_headers(
      `Content-Type` = "text/html",
      "charset"="utf-8"
    ) |>
    httr2::req_body_file(file) |>
    httr2::req_perform() |>
    httr2::resp_body_json()
  
}

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

ah, using httr2::curl_translate() helped

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

Still not there yet.

current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
  xml2::xml_text()

# quarto::quarto_render()

new_sitemap <-  xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
  xml2::xml_text()
added_links <- setdiff(new_links, current_links)

validate_page <- function(url) {
  file <- file.path("docs", urltools::path(url))
  httr2::request("http://validator.w3.org/nu/") |> 
    httr2::req_url_query(out = "json") |>
    httr2::req_method("POST") |>
    httr2::req_headers(
      `Content-Type` = "text/html",
      "charset"="utf-8"
    ) |>
    httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n")) |>
    httr2::req_perform() |>
    httr2::resp_body_json()
  
}

validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#> 
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#> 
#> 
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#> 
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#> 
#> 
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#> 
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#> 
#> 
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#> 
#> $messages[[4]]$subType
#> [1] "warning"
#> 
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document."

Created on 2024-02-19 with reprex v2.1.0

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

The errors make no sense given the actual content of index.html, which means I am sending it in a wrong way.

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

Indeed, if I use showsource, it shows I sent nothing.

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

But the dry-run of httr2 shows content length.

@maelle
Copy link
Member Author

maelle commented Feb 19, 2024

I'm putting this aside for now. 😞

@maelle
Copy link
Member Author

maelle commented Feb 26, 2024

The last time https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-POST-body was updated was in 2016, so maybe it's no longer valid?

@maelle
Copy link
Member Author

maelle commented Feb 26, 2024

I tried a bit more without success.

current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
  xml2::xml_text()

# quarto::quarto_render()

new_sitemap <-  xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
  xml2::xml_text()
added_links <- setdiff(new_links, current_links)

validate_page <- function(url) {
  file <- file.path("docs", urltools::path(url))
  httr2::request("http://validator.w3.org/nu/") |>
    httr2::req_url_query(out = "json", showsource = "yes", parser = "html5") |>
    httr2::req_method("POST") |>
    httr2::req_headers(
      `Content-Type` = "text/html",
      "charset"="utf-8"
    ) |>
    httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n"), "text/html; charset=utf-8") |>
    httr2::req_perform() |>
    httr2::resp_body_json()

}

validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#> 
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#> 
#> 
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#> 
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#> 
#> 
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#> 
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#> 
#> 
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#> 
#> $messages[[4]]$subType
#> [1] "warning"
#> 
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document."
#> 
#> 
#> 
#> $source
#> $source$type
#> [1] "text/html"
#> 
#> $source$code
#> [1] ""

Created on 2024-02-26 with reprex v2.1.0

@krlmlr
Copy link
Contributor

krlmlr commented Feb 26, 2024

What text are you sending to the API?

@maelle
Copy link
Member Author

maelle commented Feb 26, 2024

a whole HTML file. httr2::req_dry_run() shows the content is not empty... but the API output states I sent nothing.

@krlmlr
Copy link
Contributor

krlmlr commented Feb 26, 2024

The file has <!DOCTYPE html> but the API doesn't see it?

@maelle
Copy link
Member Author

maelle commented Feb 26, 2024

the API sees "" apparently.

@krlmlr
Copy link
Contributor

krlmlr commented Feb 26, 2024

Can you upload a file manually to https://validator.w3.org/nu/about.html ?

I'm forgetting again why this is so complicated.

What am I missing?

@maelle
Copy link
Member Author

maelle commented Feb 27, 2024

I wanted to use the API instead of trying to deploy the thing on GHA, but it's not working.

I had been able to use the web interface.

@maelle
Copy link
Member Author

maelle commented Mar 4, 2024

  • Installed the package from npm but was then unable to run it.

@maelle
Copy link
Member Author

maelle commented Mar 4, 2024

pfff it was actually easy, what was I thinking.

@maelle
Copy link
Member Author

maelle commented Mar 4, 2024

vnu-runtime-image/bin/vnu docs/index.html
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":42.1-42.174: error: Duplicate ID “quarto-text-highlighting-styles”.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":41.1-41.148: info warning: The first occurrence of ID “quarto-text-highlighting-styles” was here.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":46.1-46.161: error: Duplicate ID “quarto-bootstrap”.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":45.1-45.136: info warning: The first occurrence of ID “quarto-bootstrap” was here.

This is due to how Quarto handles dark mode. Both files are present in the source.

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":106.3-106.100: info warning: The “type” attribute is unnecessary for JavaScript resources.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":108.1-108.31: info warning: The “type” attribute is unnecessary for JavaScript resources.

This is about <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script> and the lines below

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":258.1-258.113: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":300.1-300.129: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":339.1-339.110: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":375.1-375.117: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":414.1-414.109: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":447.1-447.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":483.1-483.127: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":522.1-522.117: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":557.25-557.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":557.25-557.106: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":592.1-592.110: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":628.1-628.109: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":663.25-663.160: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":695.1-695.121: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":728.1-728.100: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":767.1-767.108: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":802.25-802.161: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":837.1-837.100: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":875.25-875.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":875.25-875.106: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":909.25-909.163: error: Element “img” is missing required attribute “src”.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":909.25-909.163: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: error: Element “img” is missing required attribute “src”.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.

quarto-dev/quarto-cli#6987
plus need for me to apply https://quarto.org/docs/websites/website-listings.html#listing-fields to current post

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.

This is for lines such as <p class="card-img-top"><img data-src="mountain.jpg" style="height: 150px;" class="thumbnail-image card-img"/></p>

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":987.1-987.66: info warning: The “type” attribute is unnecessary for JavaScript resources.

This refers to <script id="quarto-html-after-body" type="application/javascript">

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":1131.54-1151.17: info warning: Document uses the Unicode Private Use Area(s), which should not be used in publicly exchanged documents. (Charmod C073)

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":1527.1-1527.17: error: Element “script” must not have attribute “async” unless attribute “src” is also specified or unless attribute “type” is specified with value “module”.

Maybe <script async="">

@maelle
Copy link
Member Author

maelle commented Mar 4, 2024

@DivadNojnarg do the very last two lines of the comment above make sense to you? How could we tweak the script you created to avoid the validator error?

@maelle
Copy link
Member Author

maelle commented Mar 4, 2024

I'll come back to this issue next week, now that I can run the validator. 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants