url.parse seems to change depending on the characters used in the domain name #5832

sam-github · 2016-03-21T18:37:55Z

First, the behaviour:

URL	host	path
http://*/path		/*/path
http://./path	.	/path
http://=/path		/=/path
http://-/path	-	/path
http://0/path	0	/path
http://,/path		/,/path
http://@/path		/path
http://;/path		;/path
http://[::1]/path	[::1]	/path

I would expect that in all of the above, that the path would be /path. From my point of view, random non-URL syntax characters are being pushed into the path, and its pretty surprising.

There are some statements in the test code that make this appear
to be deliberate, but they don't justify the behaviour.

While it is true that * is not a valid domain, according to the host parsing rules quoted, neither is -, or 0 or .. I would expect . to be treated as ., returned in the host string.

I would not expect the url parser to validate that domain names are well formed, though I would expect characters that are defined as part of the URL syntax to of course not be valid.

I can implement my own url parser that allows *, and I will for backwards compat, but I think this is a bit odd. URL is generally very lax in its parsing, it gives you the syntactic bits, and you get to validate whether they are correct for your use-case, this is the first time its failed my expectations.

Version: 0.10+
Platform: all
Subsystem: url

The text was updated successfully, but these errors were encountered:

Fishrock123 · 2016-03-21T22:31:53Z

I think some characters are disallowed by spec or potentially dangerous etc.

cc @nodejs/http probably.

sam-github · 2016-03-21T22:31:53Z

For what its worth, chrome also parses http://*/path with host/hostname being *, and /path being the pathname.

sam-github · 2016-03-21T22:35:31Z

@Fishrock123 there are multiple specs, and we reference two behavioural sources (I hestitate to call them specs), browsers (and we don't do it like Chrome, for example), and the other I link above, where we also follow a fairly random subset of it. * isn't dangerous.

If you look at the code, you can see the code seems to be written as to work as I expected it would... and then near the end there is a call to a fairly incomplete validate function that then rearranges what we just parsed :-(

dougwilson · 2016-03-24T02:17:21Z

I would expect that the parser either following the liberal-ness of the WhatWG spec, or completely reject the URL, rather than be in the weird in-between state it is currently.

jasnell · 2016-03-24T05:11:06Z

At this point, for better or worse, the WhatWG spec is likely the most authoritative source. Specifically, this: https://url.spec.whatwg.org/

A fairly comprehensive corpus of tests have been put together here: https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json

We should definitely be working to validate against that suite.

MylesBorins · 2016-03-24T06:06:53Z

/cc @nodejs/testing

jasnell · 2016-03-24T06:08:20Z

Ref: #5885

sam-github · 2016-04-01T17:07:54Z

Unfortunately, the tests in #5858 don't includ a host with * in it.

From my reading of https://url.spec.whatwg.org/#host-parsing:

If asciiDomain contains U+0000, U+0009, U+000A, U+000D, U+0020, "#", "%", "/", ":", "?", "@", "[", "", or "]", syntax violation, return failure.

* should not be rejected syntactically as a host. The chars above are all pretty obviously ones that have syntactic meaning in URLs (unlike *).

sam-github · 2016-04-01T17:08:31Z

Unfortunately

which was not intended to suggest importing them isn't a great idea, @jasnell

jasnell · 2017-05-30T04:39:48Z

The current answer to this issue is: use the new WHATWG URL parser as an alternative as this is not likely to be fixed. Closing. We can reopen if necessary.

sam-github added url Issues and PRs related to the legacy built-in url module. semver-major PRs that contain breaking changes and should be released in the next major version. labels Mar 21, 2016

Fishrock123 removed the semver-major PRs that contain breaking changes and should be released in the next major version. label Mar 21, 2016

jasnell mentioned this issue Apr 1, 2016

test: add failing url parse tests as known_issue #5885

Closed

4 tasks

jasnell mentioned this issue Jun 1, 2016

proposal: WHATWG URL standard implementation nodejs/node-eps#28

Closed

jasnell closed this as completed May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

url.parse seems to change depending on the characters used in the domain name #5832

url.parse seems to change depending on the characters used in the domain name #5832

sam-github commented Mar 21, 2016

Fishrock123 commented Mar 21, 2016

sam-github commented Mar 21, 2016

sam-github commented Mar 21, 2016

dougwilson commented Mar 24, 2016

jasnell commented Mar 24, 2016

MylesBorins commented Mar 24, 2016

jasnell commented Mar 24, 2016

sam-github commented Apr 1, 2016

sam-github commented Apr 1, 2016

jasnell commented May 30, 2017

url.parse seems to change depending on the characters used in the domain name #5832

url.parse seems to change depending on the characters used in the domain name #5832

Comments

sam-github commented Mar 21, 2016

Fishrock123 commented Mar 21, 2016

sam-github commented Mar 21, 2016

sam-github commented Mar 21, 2016

dougwilson commented Mar 24, 2016

jasnell commented Mar 24, 2016

MylesBorins commented Mar 24, 2016

jasnell commented Mar 24, 2016

sam-github commented Apr 1, 2016

sam-github commented Apr 1, 2016

jasnell commented May 30, 2017