Skip to content

url_absolute fails with spaces in url #401

@jonthegeek

Description

@jonthegeek

xml_attr(x, "href") returns un-encoded URLs if that's how they appear in the source, but then those URLs fail in url_absolute.

url <- "/filename with spaces.pdf" 
xml2::url_absolute(
  url,
  base = "https://example.com/"
)
#> [1] NA
xml2::url_absolute(
  utils::URLencode(url),
  base = "https://example.com/"
)
#> [1] "https://example.com/filename%20with%20spaces.pdf"

Created on 2023-08-23 with reprex v2.0.2

url_absolute() gets confused if the URL contains spaces, and silently returns NA. This should at least warn the user, but it might be preferable to deal with it directly.

This is where I found it in the wild:

base_url <- "https://www.copyright.gov/fair-use/fair-index.html"

pdf_urls <-
  rvest::read_html(base_url) |> 
  rvest::html_element("table") |> 
  rvest::html_elements("tr>td:first-of-type>a:first-of-type") |>
  rvest::html_attr("href")

pdf_urls[[10]] |> 
  rvest::url_absolute(base_url)
#> [1] NA

pdf_urls[[10]] |> 
  utils::URLencode() |> 
  rvest::url_absolute(base_url)
#> [1] "https://www.copyright.gov/fair-use/summaries/ONeil%20v.%20Ratajkowski%20No.%2019%20CIV.%209769%20(S.D.N.Y.%202021).pdf"

Created on 2023-08-23 with reprex v2.0.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behaviorurl 👑

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions