Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect URL parsing #2114

Closed
trevnorris opened this issue Jul 6, 2015 · 15 comments
Closed

incorrect URL parsing #2114

trevnorris opened this issue Jul 6, 2015 · 15 comments
Labels
confirmed-bug Issues with confirmed bugs. http Issues or PRs related to the http subsystem. url Issues and PRs related to the legacy built-in url module.

Comments

@trevnorris
Copy link
Contributor

A regression was introduced between v0.10 and v0.12 for the URL parsing of an http.get() request. Basically, multi-byte characters are decoded as 'binary' instead of either

  1. Being decoded as UTF-8
  2. Properly decoded into their '%' counterparts.

Test and additional information is located at nodejs/node-v0.x-archive#25634 (comment)

@trevnorris trevnorris added the http Issues or PRs related to the http subsystem. label Jul 6, 2015
@rvagg
Copy link
Member

rvagg commented Jul 7, 2015

@trevnorris so to clarify, you're saying that io.js is impacted by this also (since we're not strictly an 0.12 fork that's not obvious)?

@trevnorris
Copy link
Contributor Author

Yes. The test in the linked issue has the same result in io.js. A possible solution would be to simply parse the string with encodeURI() before turning it into a buffer using binary encoding.

@vkurchatkin
Copy link
Contributor

This could be related: #1693

@bnoordhuis
Copy link
Member

It's probably the result of this change. Before f674b09, headers and the status line were parsed as UTF-8, now they're parsed as ISO-8859-1.

@trevnorris
Copy link
Contributor Author

@bnoordhuis Parsing the string this way follows more closely to the spec. Even though browsers may show the unicode characters in the URL bar, checking the network request shows it also decodes them before firing the request. io.js' http module would also barf on this request since we decode incoming headers using ISO-8859-1.

@vkurchatkin It does look like the same issue. IMO the options are to let the user know they need to encode their header strings before sending them, or we should consider doing that automatically before turning them into a buffer.

@Fishrock123 Fishrock123 added the confirmed-bug Issues with confirmed bugs. label Aug 27, 2015
@Fishrock123
Copy link
Contributor

Duplicate of #1693. See also: #2629

@Fishrock123 Fishrock123 added the duplicate Issues and PRs that are duplicates of other issues or PRs. label Aug 31, 2015
@vkurchatkin
Copy link
Contributor

I wouldn't say it's duplicate. There are two problems with UTF8: parsing and writing, and they seem unrelated.

@Fishrock123 Fishrock123 reopened this Aug 31, 2015
@Fishrock123 Fishrock123 removed the duplicate Issues and PRs that are duplicates of other issues or PRs. label Aug 31, 2015
Flimm added a commit to Flimm/node that referenced this issue Sep 25, 2015
http would previously accept paths with non-ASCII characters. This
proved problematic, because multi-byte characters were encoded as
'binary', that is, the first byte was taken and the remaining bytes were
dropped for that character.

There is no sensible way to fix this without breaking backwards
compatibility for paths containing U+0080 to U+00FF characters.

We already reject paths with unescaped spaces with an exception. This
commit does the same for paths with non-ASCII characters too.

The alternative would have been to encode paths in UTF-8, but this would
cause the behaviour to silently change for paths with single-byte
non-ASCII characters (eg: the copyright character U+00A9 ©). I find it
preferable to to add to the existing prohibition of bad paths with
spaces.

Bug report: nodejs#2114
@Trott
Copy link
Member

Trott commented Apr 5, 2016

IMO the options are to let the user know they need to encode their header strings before sending them, or we should consider doing that automatically before turning them into a buffer.

Is there consensus at this time that one of these two options is superior to the other?

@trevnorris
Copy link
Contributor Author

@Trott Nope. May just want to throw this into the CTC meeting for quick vote for fast resolution.

@jasnell jasnell added the url Issues and PRs related to the legacy built-in url module. label Jun 7, 2016
@jasnell
Copy link
Member

jasnell commented Jul 3, 2016

Should be addressed by the WHATWG URL impl here: #7448

@ChALkeR
Copy link
Member

ChALkeR commented Jul 3, 2016

@jasnell Does that change what http.get sends?

@jasnell
Copy link
Member

jasnell commented Jul 20, 2016

@ChALkeR ... actually no, it doesn't, you're right.

@jasnell
Copy link
Member

jasnell commented May 1, 2017

Does this need to remain open?

@jasnell
Copy link
Member

jasnell commented May 30, 2017

Closing given the lack of any further progress on this. It's not even clear if this is still an issue

@jasnell jasnell closed this as completed May 30, 2017
@Flimm
Copy link
Contributor

Flimm commented May 30, 2017

I've created a very similar issue (with a failing test-case) here: #13296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. http Issues or PRs related to the http subsystem. url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants