Skip to content

Pattern matches does not work for google search results #74

Open
@brandonbrown5

Description

@brandonbrown5

Google search result URL raises Invalid URI error. It appears the Regex expression here does not recognize this as a valid URL, however, you are able to navigate to it via a browser.

URL: https://www.google.com/search?q=capt.%20jacks%20family%20buffet&rlz=1C2CHBF_enUS902US902&sxsrf=APwXEdehG3ObQHEcqZT0clDT-XUDJ2iaXg:1681756568453&source=hp&ei=jpE9ZIioNOvGkPIPzP2ayAE&iflsig=AOEireoAAAAAZD2fnm-EI4rFn06RvhHNRndJIcwCmIRY&oq=capt.+jack&gs_lcp=Cgdnd3Mtd2l6EAEYADIFCAAQgAQyCgguEIAEENQCEAoyBwgAEIAEEAoyCwguEIAEEMcBEK8BMgUIABCABDIFCAAQgAQyBQgAEIAEMgcIABCABBAKMgoILhCABBDUAhAKMggIABCKBRCGAzoHCCMQ6gIQJzoECCMQJzoICAAQigUQkQI6CAgAEIAEELEDOhEILhCABBCxAxCDARDHARDRAzoOCC4QgAQQsQMQxwEQ0QM6DgguEIoFEMcBENEDEJECOg4ILhCABBDJAxDHARCvAToFCC4QgAQ6DgguEIoFEMcBEK8BEJECOgsIABCKBRCxAxCRAjoOCC4QgAQQsQMQgwEQ1AI6CwguEK8BEMcBEIAEOg0ILhCABBDHARCvARAKOgcILhCABBAKOggILhCABBDUAlC3DliGK2DeN2gBcAB4AIABnwGIAaIJkgEDMi44mAEAoAEBsAEK&sclient=gws-wiz&tbs=lf:1,lf_ui:4&tbm=lcl&rflfq=1&num=10&rldimm=425615111808136386&lqi=ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA&ved=2ahUKEwjS1_G2x7H-AhUxtTEKHclvB_cQvS56BAgWEAE&sa=X&rlst=f#rlfi=hd:;si:425615111808136386,l,ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA;mv:[[30.1955067,-85.7794086],[30.161907099999993,-85.8386264]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4

URI.parse(url) raises the following error: lib/uri/rfc3986_parser.rb:66:in split'`. I believe this is caused by the Regex expression not matching this URL.

Activity

duerst

duerst commented on May 17, 2023

@duerst
Member

And the reason that the regular expression does not match the URI is that the relevant grammar (in RFC 3986) does not allow '[' or ']' in the fragment part (the part after the '#'). See https://www.rfc-editor.org/rfc/rfc3986#appendix-A, in particular see https://www.rfc-editor.org/rfc/rfc3986#appendix-A, and look for 'fragment' and 'gen-delims'. The '[' and ']' characters are in gen-delims, but gen-delims isn't allowed in fragment. As the filename where the error message originates makes clear, it's a parser for RFC 3986 URIs, so it better follow that spec. That means that we can close this issue, because the Regexp matches the spec.

The grammar in RFC 2396 (https://www.rfc-editor.org/rfc/rfc2396) is more lenient, and is available in lib/uri/rfc2396_parser.rb, so you may want to try it.

[In Thunderbird, where I saw your message first, the URI is colored up to just before the first ':' in the fragment, and when I click on it, only the part before that ':' is sent to the browser, but both RFC 3986 and RFC 2396 allow ':' in fragments, so this behavior is difficult to explain.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @duerst@brandonbrown5

        Issue actions

          Pattern matches does not work for google search results · Issue #74 · ruby/uri