Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: simplify checkIsHttpToken() #17399

Closed
wants to merge 1 commit into from
Closed

http: simplify checkIsHttpToken() #17399

wants to merge 1 commit into from

Conversation

Trott
Copy link
Member

@Trott Trott commented Nov 30, 2017

Replace code optimized for older versions of V8 with more
straightforward code in checkIsHttpToken().

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

http

@nodejs-github-bot nodejs-github-bot added the http Issues or PRs related to the http subsystem. label Nov 30, 2017
@Trott
Copy link
Member Author

Trott commented Nov 30, 2017

Benchmark results:

                                                                 improvement confidence      p.value
 http/check_is_http_token.js n=1000000 key=":"                      -67.14 %        *** 4.065668e-34
 http/check_is_http_token.js n=1000000 key=":alternate-protocol"    -54.61 %        *** 5.705110e-28
 http/check_is_http_token.js n=1000000 key="((((())))"              -66.34 %        *** 2.636118e-33
 http/check_is_http_token.js n=1000000 key="@@"                     -67.24 %        *** 2.329542e-28
 http/check_is_http_token.js n=1000000 key="Accept-Ranges"          146.79 %        *** 2.298935e-45
 http/check_is_http_token.js n=1000000 key="alt-svc"                 -3.79 %        *** 3.200379e-04
 http/check_is_http_token.js n=1000000 key="alternate-protocol:"    150.80 %        *** 1.859951e-41
 http/check_is_http_token.js n=1000000 key="alternate-protocol"     239.27 %        *** 2.887460e-42
 http/check_is_http_token.js n=1000000 key="Cache-Control"          148.05 %        *** 1.588821e-41
 http/check_is_http_token.js n=1000000 key="Connection"              38.53 %        *** 9.847708e-29
 http/check_is_http_token.js n=1000000 key="Content-Encoding"       205.91 %        *** 1.416987e-37
 http/check_is_http_token.js n=1000000 key="content-length"         157.29 %        *** 1.049744e-37
 http/check_is_http_token.js n=1000000 key="Content-Location"       203.87 %        *** 1.005885e-38
 http/check_is_http_token.js n=1000000 key="content-type"            60.62 %        *** 1.051226e-40
 http/check_is_http_token.js n=1000000 key="Content-Type"            61.86 %        *** 1.993077e-30
 http/check_is_http_token.js n=1000000 key="date"                   -58.40 %        *** 4.130459e-34
 http/check_is_http_token.js n=1000000 key="ETag"                   -58.21 %        *** 2.484249e-43
 http/check_is_http_token.js n=1000000 key="Expires"                 -1.77 %            8.884764e-02
 http/check_is_http_token.js n=1000000 key="Keep-Alive"              38.08 %        *** 3.190244e-35
 http/check_is_http_token.js n=1000000 key="Last-Modified"          150.14 %        *** 4.677389e-55
 http/check_is_http_token.js n=1000000 key="location"                12.19 %        *** 1.952609e-14
 http/check_is_http_token.js n=1000000 key="server"                 -18.56 %        *** 6.947429e-20
 http/check_is_http_token.js n=1000000 key="Server"                 -18.91 %        *** 1.729007e-19
 http/check_is_http_token.js n=1000000 key="status"                 -19.17 %        *** 4.218003e-27
 http/check_is_http_token.js n=1000000 key="TCN"                    -61.80 %        *** 1.219346e-23
 http/check_is_http_token.js n=1000000 key="Transfer-Encoding"      223.55 %        *** 1.359534e-40
 http/check_is_http_token.js n=1000000 key="Vary"                   -59.13 %        *** 3.830565e-45
 http/check_is_http_token.js n=1000000 key="version"                 -6.03 %        *** 2.214569e-04
 http/check_is_http_token.js n=1000000 key="x-frame-options"        164.26 %        *** 1.243066e-30
 http/check_is_http_token.js n=1000000 key="x-xss-protection"       204.88 %        *** 1.000083e-35
 http/check_is_http_token.js n=1000000 key="中文呢"                 -31.09 %        *** 6.800282e-23

@Trott
Copy link
Member Author

Trott commented Nov 30, 2017

Benchmark summary seems to be:

  • Longer valid tokens perform much better with this change.
  • Shorter valid tokens take a perf hit with this change.
  • Invalid tokens where the invalid character appear early on in the string perform better with the existing code, but you get the opposite result if it appears late in the string.

I'm tempted to log arguments sent to checkIsHttpToken() in a real world app to see what kinds of strings are processed and how often.

@Trott
Copy link
Member Author

Trott commented Nov 30, 2017

@nodejs/v8

Copy link
Contributor

@cjihrig cjihrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for readability.

@mscdex
Copy link
Contributor

mscdex commented Nov 30, 2017

I think I'd prefer a hybrid + expanded loop unrolling solution. The expanded unrolling would cover the most common http headers (based on headers listed on Wikipedia for example and also the list we use in _http_incoming.js when converting headers to lowercase) and the regexp would be used for larger, less common header names.

We could dynamically generate the function to avoid the lengthy function source code, it performs the same as the inline version:

const validTokens = [
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0 - 15
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 16 - 31
  0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, // 32 - 47
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, // 48 - 63
  0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 64 - 79
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, // 80 - 95
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 96 - 111
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, // 112 - 127
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 128 ...
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0  // ... 255
];
const tokenRegExp = /^[\^_`a-zA-Z\-0-9!#$%&'*+.|~]+$/;
function checkIsHttpToken(val) {
  if (val.length > 19) return tokenRegExp.test(val);
  if (!validTokens[val.charCodeAt(0)]) return false;
  if (val.length < 2) return true;
  if (!validTokens[val.charCodeAt(1)]) return false;
  if (val.length < 3) return true;
  if (!validTokens[val.charCodeAt(2)]) return false;
  if (val.length < 4) return true;
  if (!validTokens[val.charCodeAt(3)]) return false;
  if (val.length < 5) return true;
  if (!validTokens[val.charCodeAt(4)]) return false;
  if (val.length < 6) return true;
  if (!validTokens[val.charCodeAt(5)]) return false;
  if (val.length < 7) return true;
  if (!validTokens[val.charCodeAt(6)]) return false;
  if (val.length < 8) return true;
  if (!validTokens[val.charCodeAt(7)]) return false;
  if (val.length < 9) return true;
  if (!validTokens[val.charCodeAt(8)]) return false;
  if (val.length < 10) return true;
  if (!validTokens[val.charCodeAt(9)]) return false;
  if (val.length < 11) return true;
  if (!validTokens[val.charCodeAt(10)]) return false;
  if (val.length < 12) return true;
  if (!validTokens[val.charCodeAt(11)]) return false;
  if (val.length < 13) return true;
  if (!validTokens[val.charCodeAt(12)]) return false;
  if (val.length < 14) return true;
  if (!validTokens[val.charCodeAt(13)]) return false;
  if (val.length < 15) return true;
  if (!validTokens[val.charCodeAt(14)]) return false;
  if (val.length < 16) return true;
  if (!validTokens[val.charCodeAt(15)]) return false;
  if (val.length < 17) return true;
  if (!validTokens[val.charCodeAt(16)]) return false;
  if (val.length < 18) return true;
  if (!validTokens[val.charCodeAt(17)]) return false;
  if (val.length < 19) return true;
  if (!validTokens[val.charCodeAt(18)]) return false;
  return true;
}

Benchmarking the various solutions with 1e8 iterations, I see these results for various headers:

master this PR hybrid+expanded unroll
If-Unmodified-Since (19 chars) 10.7s 5.6s 3.4s
Content-Type (12 chars) 6.1s 4.3s 2.4s
Date (4 chars) 0.96s 3.0s 1.1s

@ofrobots
Copy link
Contributor

It would be good to get some real world benchmarking data; but lacking that, my preference would be for readability.

@apapirovski
Copy link
Member

Ignore my — now deleted — post re: for loop, testing on the wrong V8 version... It seems like the version @mscdex proposed is currently the best (albeit ugly) it gets.

@Trott
Copy link
Member Author

Trott commented Dec 1, 2017

@mscdex or anyone else: Do you have an explanation as to why your benchmarking shows the hybrid+expanded unroll faster for a 19-character value than this PR? I'm mystified as to how that could be and TBH it's making me look at those timings with a bit of side-eye....

@mscdex
Copy link
Contributor

mscdex commented Dec 1, 2017

@Trott because a regexp (or just a loop for that matter) has a lot more overhead than a series of if statements? That was why I had unrolled the loop a bit originally, although it was not much because of Crankshaft's inlining requirements.

@Trott
Copy link
Member Author

Trott commented Dec 1, 2017

@Trott because a regexp (or just a loop for that matter) has a lot more overhead than a series of if statements? That was why I had unrolled the loop a bit originally, although it was not much because of Crankshaft's inlining requirements.

@mscdex But for a 19-character string, your unrolled version uses a regexp too. So how is it faster for a 19-character string? Something isn't right...

@apapirovski
Copy link
Member

@Trott it's > 19 so that test string still uses the unrolled checks.

@Trott
Copy link
Member Author

Trott commented Dec 1, 2017

@Trott it's > 19 so that test string still uses the unrolled checks.

Ah! Off-by-one error in my brain.

@psmarshall
Copy link
Contributor

@Trott It looks like the regexp code is faster at processing 8 or more characters. By 'processing' I mean how many characters it actually looks at, not how long the input string is, e.g. it bails on out the first-character of ":alternate-protocol" and so is slower than the handwritten JS code. It looks like every string under 7 characters in length regresses.

My guess is that that is caused by the overhead of calling to the regexp builtin or some setup/initialization/allocation that we have to do for each match. I'm not an expert on regexps in v8 though.

The manual loop unrolling is concerning - this should definitely not be necessary to do by hand.

I like this code much better just based on readability. Parsing time will also be lower which is nice. Inlining decisions are more complicated than just the pure length of the function now - I think we have a higher budget for extremely small functions, so this could potentially help there too.

One more thing to think about - is this function even run as optimized code on a server? i.e. is it actually called enough times with reasonably stable input types. I don't have any intuition there. The microbenchmarks probably aren't stressing this code in the same way a real server would.

I'd suggest the following:

  1. Log a v8 issue for the regexp being slower for < 7 characters. It could just be the overhead of calling to a builtin and we might not be able to do much about it - but worth confirming that at least, because it looks like that is stopping people from writing more idiomatic code.
  2. Log a v8 issue for the manual loop unrolling/iteration peeling being faster. I've seen this elsewhere in node code so maybe it is a wider issue. Ideally the optimizing compiler should be better at doing this than humans, so it is definitely worth looking into as well.

@apapirovski
Copy link
Member

apapirovski commented Dec 8, 2017

I'm pretty in favour of landing this very soon (with the caveat we don't use it for v8.x or lower). @mscdex would you still like this to use unrolled checks? Could you make your request/objection more explicit if so, as otherwise this will end up landing eventually given the 3 approvals (incl 2 from TSC).

@mscdex
Copy link
Contributor

mscdex commented Dec 8, 2017

@apapirovski to be honest a lot of the performance suggestions I make are too much for most people, so just take the suggestions/benchmark results I posted as some food for thought.

@apapirovski
Copy link
Member

/**
* Verifies that the given val is a valid HTTP token
* per the rules defined in RFC 7230
* See https://tools.ietf.org/html/rfc7230#section-3.2.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep this comment. When looking through the code it is good to have a reference handy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BridgeAR Restored.

Replace code optimized for older versions of V8 with more
straightforward code in checkIsHttpToken().
@apapirovski
Copy link
Member

Landed in 9f55eac 🎉

apapirovski pushed a commit that referenced this pull request Dec 10, 2017
Replace code optimized for older versions of V8 with more
straightforward code in checkIsHttpToken().

PR-URL: #17399
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>
@Trott
Copy link
Member Author

Trott commented Dec 10, 2017

Any volunteers to open the two issues for V8 as suggested by @psmarshall? I'll do it if not one else does, but I feel like someone who better understands V8 and benchmarking would do a better job... @nodejs/v8

@bmeurer
Copy link
Member

bmeurer commented Dec 11, 2017

Great job! 👍

MylesBorins pushed a commit that referenced this pull request Dec 12, 2017
Replace code optimized for older versions of V8 with more
straightforward code in checkIsHttpToken().

PR-URL: #17399
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>
MylesBorins pushed a commit that referenced this pull request Dec 12, 2017
Replace code optimized for older versions of V8 with more
straightforward code in checkIsHttpToken().

PR-URL: #17399
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: Timothy Gu <timothygu99@gmail.com>
@MylesBorins MylesBorins mentioned this pull request Dec 12, 2017
@mathiasbynens
Copy link
Contributor

mathiasbynens commented Dec 12, 2017

Any volunteers to open the two issues for V8 as suggested by @psmarshall? I'll do it if no one else does, but I feel like someone who better understands V8 and benchmarking would do a better job…

@Trott Please go ahead and file those issues. You’ve got this! 👍

@Trott
Copy link
Member Author

Trott commented Dec 12, 2017

V8 issues opened at https://bugs.chromium.org/p/v8/issues/detail?id=7200 and https://bugs.chromium.org/p/v8/issues/detail?id=7201.

sethbrenith added a commit to sethbrenith/node that referenced this pull request Feb 8, 2018
In the spirit of [17399](nodejs#17399), we
can also simplify checkInvalidHeaderChar to use regex matching instead
of a loop. This makes it faster on long matches and slower on short
matches or non-matches. This change also includes some sample data from
an AcmeAir benchmark run, as a rough proxy for real-world data.
BridgeAR pushed a commit to BridgeAR/node that referenced this pull request Feb 16, 2018
In the spirit of [17399](nodejs#17399),
we can also simplify checkInvalidHeaderChar to use regex matching
instead of a loop. This makes it faster on long matches and slower
on short matches or non-matches. This change also includes some
sample data from an AcmeAir benchmark run, as a rough proxy for
real-world data.

PR-URL: nodejs#18381
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Benedikt Meurer <benedikt.meurer@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
MylesBorins pushed a commit that referenced this pull request Feb 21, 2018
In the spirit of [17399](#17399),
we can also simplify checkInvalidHeaderChar to use regex matching
instead of a loop. This makes it faster on long matches and slower
on short matches or non-matches. This change also includes some
sample data from an AcmeAir benchmark run, as a rough proxy for
real-world data.

PR-URL: #18381
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Benedikt Meurer <benedikt.meurer@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
MylesBorins pushed a commit that referenced this pull request Feb 21, 2018
In the spirit of [17399](#17399),
we can also simplify checkInvalidHeaderChar to use regex matching
instead of a loop. This makes it faster on long matches and slower
on short matches or non-matches. This change also includes some
sample data from an AcmeAir benchmark run, as a rough proxy for
real-world data.

PR-URL: #18381
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Benedikt Meurer <benedikt.meurer@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
MayaLekova pushed a commit to MayaLekova/node that referenced this pull request May 8, 2018
In the spirit of [17399](nodejs#17399),
we can also simplify checkInvalidHeaderChar to use regex matching
instead of a loop. This makes it faster on long matches and slower
on short matches or non-matches. This change also includes some
sample data from an AcmeAir benchmark run, as a rough proxy for
real-world data.

PR-URL: nodejs#18381
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Benedikt Meurer <benedikt.meurer@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
@Trott Trott deleted the istoken branch January 13, 2022 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.