Skip to content
This repository has been archived by the owner on Jul 31, 2018. It is now read-only.

proposal: WHATWG URL standard implementation #28

Closed
wants to merge 1 commit into from

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Jun 1, 2016

The WHATWG URL Standard specifies updated syntax,
parsing and serialization of URLs as currently implemented by the main Web Browsers.
The existing Node.js url module parsing and serialization implementation currently
does not support the URL standard and fails to pass about 160 of the standard tests.

This proposal is to implement the WHATWG URL Standard by introducing a new
URL class off the url module (e.g. require('url').URL).

The existing url module would remain unchanged and there should be no
backwards compatibility concerns.

Example

const url = new URL('http://user:pass@example.org:1234/p/a/t/h?xyz=abc#hash');

console.log(url.protocol;      // http:
console.log(url.username);     // user
console.log(url.password);     // password
console.log(url.host);         // example.org:1234
console.log(url.hostname);     // example.org
console.log(url.port);         // 1234
console.log(url.pathname);     // /p/a/t/h
console.log(url.search);       // ?xyz=abc
console.log(url.searchParams); // SearchParams object
console.log(url.hash);         // hash

// The SearchParams object is defined by the WhatWG spec also
url.searchParams.append('key', 'value');

console.log(url);
 // http://user:pass@example.org:1234/p/a/t/h?xyz=abc&key=value#hash

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

@nodejs/collaborators

@seishun
Copy link

seishun commented Jun 1, 2016

Perhaps we could raise interest in this by providing examples of failing tests.

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

@seishun ... https://github.com/nodejs/node/blob/master/test/known_issues/test-url-parse-conformance.js :-)

We currently fail somewhere around 140+ of the test cases in the WhatWG set.

@MylesBorins
Copy link

+1

Considering that the browser are exposing the global I think it makes a whole bunch of sense. I'd be interested in helping out with this. Have you broken ground on implementation @jasnell?

Can we borrow from implementations at all?

```js
const url = new URL('http://user:pass@example.org:1234/p/a/t/h?xyz=abc#hash');

console.log(url.protocol; // http:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ')'

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

@thealphanerd ... yeah, I've got it mostly implemented already. The next step is to start running it through it's paces with tests and benchmarks and to find ways of optimizing the implementation. It's currently quite a bit slower than the existing require('url') module parsing.

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

Regarding borrowing from other impls, it's entirely possible that we could borrow from chromes implementation. I'm not sure yet if theirs is a pure JS impl or not. I'll look into that.

addition of a single `require('url').URL` export that can be used an an
alternative way of the getting a reference to the new `URL` constructor.

Existing locations within core that currently accept a URL string or the
Copy link

@yorkie yorkie Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to use the lower case "url" to stand for the url string to keep consistence with IDL source in https://url.spec.whatwg.org/

@yorkie
Copy link

yorkie commented Jun 1, 2016

@jasnell How about the following 2 static methods which are defined at standard IDL:

  • domainToASCII(domain)
  • domainToUnicode(domain)

I didn't see those in the proposal :-(

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

Still considering those. They are easy enough to implement given the
punycode module but I'm not sure how extensively they're used.
On Jun 1, 2016 9:40 AM, "Yorkie Liu" notifications@github.com wrote:

@jasnell https://github.com/jasnell How about the following 2 static
methods which is defined at IDL:

  • domainToASCII(domain)
  • domainToUnicode(domain)

I didn't see those in the proposal :-(


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAa2eVA3NG0j4lEZnCUSro0RREhVL5wBks5qHbXqgaJpZM4IrtaC
.

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

There will also be other differences. For instance, I'm not sure if we need
the host parsing component.
On Jun 1, 2016 9:42 AM, "James M Snell" jasnell@gmail.com wrote:

Still considering those. They are easy enough to implement given the
punycode module but I'm not sure how extensively they're used.
On Jun 1, 2016 9:40 AM, "Yorkie Liu" notifications@github.com wrote:

@jasnell https://github.com/jasnell How about the following 2 static
methods which is defined at IDL:

  • domainToASCII(domain)
  • domainToUnicode(domain)

I didn't see those in the proposal :-(


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAa2eVA3NG0j4lEZnCUSro0RREhVL5wBks5qHbXqgaJpZM4IrtaC
.

@targos
Copy link
Member

targos commented Jun 1, 2016

I'm not sure yet if theirs is a pure JS impl or not.

There is https://github.com/jsdom/whatwg-url

cc @domenic

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

Yeah, I'm familiar with (and use) whatwg-url. @domenic, how would you feel about the possibility of pulling the whatwg-url implementation into core? I'm not yet that familiar with how it is implemented internally but having a url standard compliant url parser built into core would be very good.

@domenic
Copy link

domenic commented Jun 1, 2016

Yep, that's @Sebmaster's most excellent work. It's not super optimized, but would be a good starting point.

This thread is very exciting and I'm glad there is an appetite for the idea!! The idea of a global, the same as in browsers, is great.

I think there are two separable problems here:

  • Parsing URLs according to the URL Standard. This could possibly even be done for require("url") in a semver-major bump, as mostly it will change edge cases.
  • Introducing a URL Standard-compliant API for URL manipulation. There are a couple things to note here:
    • The URL Standard uses the USVString type, which (per a convoluted series of specs) essentially means that unpaired surrogates get censored. This is generally a good property, and recommended by Unicode, as network boundaries do not expect unpaired surrogates. But it's not immediately obvious from the algorithms.
    • The URL Standard takes great pains to ensure that the URL is always consistent and usable, by using setters and getters. So e.g. if you do url.pathname = "\uDC01", then that will initiate a parse in the pathname override state, and then if you trigger the getter, you'll find url.pathname === "/%EF%BF%BD". Note how this both appends the leading slash, and also performs the USVString surrogate pair censoring (decodeURIComponent("%EF%BF%BD").charCodeAt(0) === 0xFFFD, not 0xDC01). This is a fairly important property IMO.

There are tests of the URL Standard at https://github.com/w3c/web-platform-tests/tree/master/url, and whatwg-url has a runner. The coverage is pretty reasonable; see web-platform-tests/wpt#3018

@domenic
Copy link

domenic commented Jun 1, 2016

@domenic, how would you feel about the possibility of pulling the whatwg-url implementation into core?

Oh, that'd be very cool! I guess my only concern is that we weren't concerned with speed when writing it, so there is probably lots of low-hanging fruit for performance improvements.

You'd also need to do a bit of work to decouple it from webidl-conversions and webidl2js. If you npm install it you'll find that it follows the same impl/wrapper strategy as browsers do, where there's a "wrapper" that takes care of USVString conversion and so on, delegating to the "impl" where parsing occurs, which in turn delegates to the URL state machine code. At least one of these layers could be disintermediated, although perhaps benchmarking should be done to see if that's actually the area where most improvement could be made.

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

Ok, I'll dig in and explore the whatwg-url internals and see what can be done reasonably. Before getting too deep into this I'd definitely like to get more +1's from collaborators tho. I'd really like to see this happen tho.

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

I think there are two separable problems here

Definitely agree that it's worth separating these. Not sure about modifying the existing require('url') too much tho -- ideally once the global URL is there for a while we could simply deprecate require('url') (and possibly even require('querystring')) entirely with an eye towards reducing the Node.js-specific API surface area.

@Qard
Copy link
Member

Qard commented Jun 1, 2016

Well, you can have a 👍 from me! More browser/server unity would be great. 😺

@Qard
Copy link
Member

Qard commented Jun 1, 2016

BTW, I like the idea of the global, since that's what browsers do, but I'd recommend not exposing it in require('url') because then people might get used to it being there rather than available globally, making future removal of the module more difficult.

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

The only concern I would have with that, @Qard, is that if there is existing code that does URL = whatever, there would be no way of recovering the original global. That should be a limited edge case, however.

@chrisdickinson
Copy link

This is not an argument for or against, but a request for more background. You've described what work you would like to do, and how you would do it — but I don't see much in the way of "why" we'd want to do this. What is not working about the status quo that we will rectify by adding a new global object? What value do Node users gain, concretely, from the URL object?

@Qard
Copy link
Member

Qard commented Jun 1, 2016

Maybe store it on process.URL for safe keeping?

@MylesBorins
Copy link

@chrisdickinson personally I see quite a bit of benefit in minimizing the delta between node + the browser as far as utility API's like this are concerned.

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

@chrisdickinson ... the why is straightforward: Currently Node.js' URL parsing has a number of issues in terms of not following the standardized behavior implemented by browsers. Examples of those failures can be seen in the test case I referenced here. There are also differences in the Node.js provided API that are largely unnecessary. This work would give us an opportunity to not only provide more robust URL parsing, but to provide a unified, non-Node.js specific API.

@Qard ... that's certainly a possibility

@jasnell
Copy link
Member Author

jasnell commented Jun 1, 2016

FWIW, looks like I mis-remembered the number of failures ;-) ... here's the exact count:

bash-3.2$ ./node test/known_issues/test-url-parse-conformance.js 
Unknown globals: URL

assert.js:90
  throw new assert.AssertionError({
  ^
AssertionError: 160 failed tests (out of 352)
    at Object.<anonymous> (/Users/james/Node/main/node/test/known_issues/test-url-parse-conformance.js:57:8)
    at Module._compile (module.js:541:32)
    at Object.Module._extensions..js (module.js:550:10)
    at Module.load (module.js:458:32)
    at tryModuleLoad (module.js:417:12)
    at Function.Module._load (module.js:409:3)
    at Function.Module.runMain (module.js:575:10)
    at startup (node.js:152:18)
    at node.js:449:3
bash-3.2$ 


## Description

The WhatWG URL Standard specifies updated syntax, parsing and serialization of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/WhatWG/WHATWG/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acknowledged. Will fix that in the next update.

@jasnell
Copy link
Member Author

jasnell commented Jul 1, 2016

EPS proposal updated. /cc @nodejs/ctc

APIs.

Initially, the implementation would be introduced as an undocumented
experimental feature exposed via a new `URL` property in the `url` module.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a Url property that points to the old constructor/parser — this is likely to cause confusion. If we go down the route of supporting a new URL parser, we may as well expose it as a global. Code written for the browser expects it to be a global, we may as well avoid artificially increasing the delta between browser JS and server JS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for it being a global.
On Jul 2, 2016 10:18 AM, "Chris Dickinson" notifications@github.com wrote:

In XXX-url.md
#28 (comment):

+## Description
+
+The WHATWG URL Standard specifies updated syntax, parsing and serialization of
+URLs as currently implemented by the main Web Browsers. The existing Node.js
+url module parsing and serialization implementation currently does not support
+the URL standard and fails to pass 160 of the WHATWG URL parsing tests.
+
+This proposal is to implement the WHATWG URL Standard by modifying the existing
+url module to provide an implementation of the URL object and associated
+APIs. Doing so improves the robustness of URL parsing, provides consistency
+with browser js code, and can eventually allow the reduction of node.js specific
+APIs.
+
+Initially, the implementation would be introduced as an undocumented
+experimental feature exposed via a new URL property in the url module.

There's already a Url property that points to the old constructor/parser
— this is likely to cause confusion. If we go down the route of
supporting a new URL parser
, we may as well expose it as a global. Code
written for the browser expects it to be a global, we may as well avoid
artificially increasing the delta between browser JS and server JS.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/nodejs/node-eps/pull/28/files/a0a892210e9fa20f684247dd58e838a9cd333df1#r69380482,
or mute the thread
https://github.com/notifications/unsubscribe/AAa2ecg3l3xtEecYClWoO40LgnxvIjUEks5qRp1pgaJpZM4IrtaC
.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine with that. Already have encodeURI(), etc.

@jasnell
Copy link
Member Author

jasnell commented Jul 6, 2016

Putting this on the ctc-agenda as I believe it's ready for discussion/review.

@evanlucas
Copy link

I would be against this. Browsers don't parse special schemes this way, and we should not either. git should be parsed the same as data, I believe.

Does npm rely on the parsing of urls with support for the git protocol? If so, do we have a migration path for that?

@trevnorris
Copy link
Contributor

There's a certain loss of functionality by using new URL() over url.parse(). Specifically that url.parse() isn't just for URLs but also for URIs. For example: url.parse('github.com') works fine but new URL('github.com') throws.

Taking the following excerpt from the WHATWG URL specification:

Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.

It seems that URL is not meant to mean only URL. Even though there a scheme is always necessary (whether absolute or relative URL). Because of this any scheme is acceptable, but possibly parsed as expected. For example:

new URL('git://github.com/foo/bar').pathname == '//github.com/foo/bar';

Even though git: has been provisionally registered with the IANA it is still not parsed as a standard domain. It should be possible to properly properly parse the pathname from the above.

Taking from RFC3986 we can see a simple generic way of parsing the URL that I think could be used at minimum to parse unfamiliar schemes:

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment
          |   _____________________|__
         / \ /                        \
         urn:example:animal:ferret:nose

@domenic

I would be against this. Browsers don't parse special schemes this way, and we should not either. git should be parsed the same as data, I believe.

Not to be rude, but check your own work first. Chrome has added an exception to URL() for its own needs:

new URL('chrome-extension://adlfkjaslkfj/foo/bar.html').pathname === '/foo/bar.html';

Additionally, URL() today doesn't parse unknown schemes as data:

const u = new URL('git://github.com/foo/bar');
u.protocol === 'git:';
u.origin === 'git://';

All this said, we should support other/unknown schemes properly with URL().

@domenic
Copy link

domenic commented Jul 13, 2016

It's already been clarified upthread that I was incorrect about special schemes. There are no special schemes besides file:

@trevnorris
Copy link
Contributor

@domenic oop. sorry about that. completely glanced over that comment.

@jasnell
Copy link
Member Author

jasnell commented Jul 16, 2016

Adding git as a special scheme would be trivial if necessary. If necessary
we should first try to upstream such a change into the URL spec itself.

On Jul 13, 2016 10:13 PM, "Trevor Norris" notifications@github.com wrote:

@domenic https://github.com/domenic oop. sorry about that. completely
glanced over that comment.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAa2eczDLckC0l3fAaMSqdd27bHxSOJVks5qVcV-gaJpZM4IrtaC
.

@jasnell
Copy link
Member Author

jasnell commented Jul 25, 2016

After further testing, git URLs are handled by this implementation without any problems whatsoever using the implementation in #28.

@jasnell
Copy link
Member Author

jasnell commented Aug 10, 2016

The CTC discussed this today and has decided to advance this to DRAFT status with the exception of the URL constructor not being a global. I will get the PR updated to remove part about it being a global and will get it landed. Thank you all.

@yorkie
Copy link

yorkie commented Aug 11, 2016

By the way, after the core implemented the WHATWG URL standard, I think also should start deprecating the path in the http module at here where we were using the path in request option.

@jasnell
Copy link
Member Author

jasnell commented Aug 11, 2016

@yorkie ... eventually, once the new URL object is no longer experimental, my goal is to replace use of the existing url.parse within the http module entirely. That's not going to be easy tho. It's going to take a long time.

jasnell added a commit that referenced this pull request Aug 11, 2016
@jasnell
Copy link
Member Author

jasnell commented Aug 11, 2016

Landed in f16f770

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.