Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url.parse seems to change depending on the characters used in the domain name #5832

Closed
sam-github opened this issue Mar 21, 2016 · 10 comments
Closed
Labels
url Issues and PRs related to the legacy built-in url module.

Comments

@sam-github
Copy link
Contributor

First, the behaviour:

URL host path
http://*/path /*/path
http://./path . /path
http://=/path /=/path
http://-/path - /path
http://0/path 0 /path
http://,/path /,/path
http://@/path /path
http://;/path ;/path
http://[::1]/path [::1] /path

I would expect that in all of the above, that the path would be /path. From my point of view, random non-URL syntax characters are being pushed into the path, and its pretty surprising.

There are some statements in the test code that make this appear
to be deliberate, but they don't justify the behaviour.

While it is true that * is not a valid domain, according to the host parsing rules quoted, neither is -, or 0 or .. I would expect . to be treated as ., returned in the host string.

I would not expect the url parser to validate that domain names are well formed, though I would expect characters that are defined as part of the URL syntax to of course not be valid.

I can implement my own url parser that allows *, and I will for backwards compat, but I think this is a bit odd. URL is generally very lax in its parsing, it gives you the syntactic bits, and you get to validate whether they are correct for your use-case, this is the first time its failed my expectations.

  • Version: 0.10+
  • Platform: all
  • Subsystem: url
@sam-github sam-github added url Issues and PRs related to the legacy built-in url module. semver-major PRs that contain breaking changes and should be released in the next major version. labels Mar 21, 2016
@Fishrock123
Copy link
Contributor

I think some characters are disallowed by spec or potentially dangerous etc.

cc @nodejs/http probably.

@sam-github
Copy link
Contributor Author

For what its worth, chrome also parses http://*/path with host/hostname being *, and /path being the pathname.

@Fishrock123 Fishrock123 removed the semver-major PRs that contain breaking changes and should be released in the next major version. label Mar 21, 2016
@sam-github
Copy link
Contributor Author

@Fishrock123 there are multiple specs, and we reference two behavioural sources (I hestitate to call them specs), browsers (and we don't do it like Chrome, for example), and the other I link above, where we also follow a fairly random subset of it. * isn't dangerous.

If you look at the code, you can see the code seems to be written as to work as I expected it would... and then near the end there is a call to a fairly incomplete validate function that then rearranges what we just parsed :-(

@dougwilson
Copy link
Member

I would expect that the parser either following the liberal-ness of the WhatWG spec, or completely reject the URL, rather than be in the weird in-between state it is currently.

@jasnell
Copy link
Member

jasnell commented Mar 24, 2016

At this point, for better or worse, the WhatWG spec is likely the most authoritative source. Specifically, this: https://url.spec.whatwg.org/

A fairly comprehensive corpus of tests have been put together here: https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json

We should definitely be working to validate against that suite.

@MylesBorins
Copy link
Contributor

/cc @nodejs/testing

@jasnell
Copy link
Member

jasnell commented Mar 24, 2016

Ref: #5885

@sam-github
Copy link
Contributor Author

Unfortunately, the tests in #5858 don't includ a host with * in it.

From my reading of https://url.spec.whatwg.org/#host-parsing:

If asciiDomain contains U+0000, U+0009, U+000A, U+000D, U+0020, "#", "%", "/", ":", "?", "@", "[", "", or "]", syntax violation, return failure.

* should not be rejected syntactically as a host. The chars above are all pretty obviously ones that have syntactic meaning in URLs (unlike *).

@sam-github
Copy link
Contributor Author

Unfortunately

which was not intended to suggest importing them isn't a great idea, @jasnell

@jasnell
Copy link
Member

jasnell commented May 30, 2017

The current answer to this issue is: use the new WHATWG URL parser as an alternative as this is not likely to be fixed. Closing. We can reopen if necessary.

@jasnell jasnell closed this as completed May 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

No branches or pull requests

5 participants