Skip to content
This repository has been archived by the owner on Nov 28, 2020. It is now read-only.

Status of running node core benchmarks #127

Open
gareth-ellis opened this issue Aug 4, 2017 · 29 comments
Open

Status of running node core benchmarks #127

gareth-ellis opened this issue Aug 4, 2017 · 29 comments

Comments

@gareth-ellis
Copy link
Member

Since the PR was closed once I'd merged in my changes, I figured I'd open this to track current progress on getting this working.

This is a continuation of #58 .

Current status is that I have the job setup, however we currently fail on the fact Rscript is missing. I have opened nodejs/build#821 to get this installed.

In the mean time I am going to create a branch in benchmarking with a change to not use Rscript, which we can then use for continued testing.

@gareth-ellis
Copy link
Member Author

gareth-ellis commented Aug 10, 2017

This now seems to be working, and is ready for testing. @mcollina you mentioned you had some PRs to try it on.

At the moment it is set to test a PR with its impact on master.

There's four parameters available:
BRANCH - defaults to master
PULL_ID - the PR number
CATEGORY - the folder of benchmarks you want to run from node/benchmark
FILTER - to limit the selection of benchmarks in that folder.

Please take a look and let me know how it goes!

Example of output :
https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/8/

URL to launch from:
https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/build?delay=0sec

NOTE: Be careful to understand how long the run you are submitting will take:
Here is a guide that @AndreasMadsen wrote a year or so ago, each number (time (sec) ) that you intend to run needs multiplying by 60 (as the script will run 30 of the "old" and 30 of the "new" build.

@mcollina
Copy link
Member

I'm currently on vacation, I'll get this going asap when I come back (1 week).

@AndreasMadsen
Copy link
Member

ProTip: To multiple by 60 just read the seconds as minutes.

@gareth-ellis
Copy link
Member Author

Thoughts: currently its taking AGES to clone node core, What are thoughts on cloning once on the machine and then copying over to the workspace and doing a pull origin master?

Perhaps we could update the central clone once a week?

Investigating why the run @mcollina failed - will update with findings....

@gareth-ellis
Copy link
Member Author

gareth-ellis commented Aug 28, 2017

How to: Running core benchmarks on Node.js CI

Current function

The current jenkins job aims at allowing the impact of a PR on a particular branch to be compared. For example how does PR XXXX impact performance on master?

How to configure the job

To start with visit https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/build?delay=0sec
Here you have four values to enter:

BRANCH -

this is which branch or tag to compare to, defaults to master

PULL_ID -

pull request to test
These two variables provide the two copies of Node.js to compare - we build BRANCH, then take a copy and apply the diff of PULL_ID, and build again. We now have two copies of Node to compare

CATEGORY -

This is a category of benchmarks in the node/benchmark folder from core

FILTER -

this reduces further the benchmarks in scope and is includes benchmark where FILTER exists in the filename.

On running, a minimum of 30 iterations of each benchmark/variation per build, once they have all finished Rscript will summarise the results, suggesting improvement / regression as well as confidence intervals.

@mhdawson
Copy link
Member

mhdawson commented Sep 8, 2017

@gareth-ellis It would probably make sense to PR this as a doc somewhere in the repo.

@gareth-ellis
Copy link
Member Author

@AndreasMadsen @mcollina @mscdex , if you have some cycles do you want to have a go with the job, using instructions above and see how you get on?

Also got a PR for the above doc with a view to getting something together we can send out to a wider audience and get people using this job and core benchmarks. (PR in #144 )

@AndreasMadsen
Copy link
Member

I would love to see it in action but right now I don't have anything I would like to benchmark :(

@mcollina
Copy link
Member

I have started https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/14/

@mcollina
Copy link
Member

I think the script should automatically rebase the branch, because a difference in the V8 engine could skew the results and provide wrong feedback. Maybe this should be a checkbox, so we can verify across different release lines.

I run the tests on my laptop in June and I got no regressions (nodejs/node#12857 (comment)) - this was at the time of V8 5.9.

However the branch is still using an extremely old V8 engine, and we got some very weird results running it now:

                                                   improvement confidence     p.value
 net/net-c2s-cork.js dur=5 type="buf" len=1024          2.56 %            0.330368195
 net/net-c2s-cork.js dur=5 type="buf" len=128          -5.73 %          * 0.010873902
 net/net-c2s-cork.js dur=5 type="buf" len=16           -0.34 %            0.869100857
 net/net-c2s-cork.js dur=5 type="buf" len=32            3.41 %            0.218088510
 net/net-c2s-cork.js dur=5 type="buf" len=4            -1.04 %            0.491132059
 net/net-c2s-cork.js dur=5 type="buf" len=512          -5.45 %          * 0.032530436
 net/net-c2s-cork.js dur=5 type="buf" len=64           -1.61 %            0.579751020
 net/net-c2s-cork.js dur=5 type="buf" len=8            -2.84 %            0.242407708
 net/net-c2s.js dur=5 type="asc" len=102400            -0.86 %            0.500537810
 net/net-c2s.js dur=5 type="asc" len=16777216          -0.73 %            0.762753668
 net/net-c2s.js dur=5 type="buf" len=102400             0.55 %            0.664537769
 net/net-c2s.js dur=5 type="buf" len=16777216           1.11 %            0.557571579
 net/net-c2s.js dur=5 type="utf" len=102400            -0.39 %            0.822950308
 net/net-c2s.js dur=5 type="utf" len=16777216           1.38 %            0.453838521
 net/net-pipe.js dur=5 type="asc" len=102400            0.41 %            0.712092837
 net/net-pipe.js dur=5 type="asc" len=16777216          0.81 %            0.775137756
 net/net-pipe.js dur=5 type="buf" len=102400            0.76 %            0.466008198
 net/net-pipe.js dur=5 type="buf" len=16777216          0.90 %            0.558028886
 net/net-pipe.js dur=5 type="utf" len=102400            0.54 %            0.671323747
 net/net-pipe.js dur=5 type="utf" len=16777216          3.82 %          * 0.032465465
 net/net-s2c.js dur=5 type="asc" len=102400             0.84 %            0.509765096
 net/net-s2c.js dur=5 type="asc" len=16777216          -2.78 %            0.217136214
 net/net-s2c.js dur=5 type="buf" len=102400             0.11 %            0.925209854
 net/net-s2c.js dur=5 type="buf" len=16777216           0.03 %            0.986008241
 net/net-s2c.js dur=5 type="utf" len=102400            -2.64 %            0.129138848
 net/net-s2c.js dur=5 type="utf" len=16777216           2.26 %            0.088823248
 net/tcp-raw-c2s.js dur=5 type="asc" len=102400         2.51 %            0.292369145
 net/tcp-raw-c2s.js dur=5 type="asc" len=16777216      -0.08 %            0.972487579
 net/tcp-raw-c2s.js dur=5 type="buf" len=102400         6.07 %            0.052679575
 net/tcp-raw-c2s.js dur=5 type="buf" len=16777216       1.04 %            0.562458368
 net/tcp-raw-c2s.js dur=5 type="utf" len=102400         5.49 %            0.217685094
 net/tcp-raw-c2s.js dur=5 type="utf" len=16777216      -0.06 %            0.884432672
 net/tcp-raw-pipe.js dur=5 type="asc" len=102400        0.93 %            0.731389654
 net/tcp-raw-pipe.js dur=5 type="asc" len=16777216    -18.48 %            0.304436728
 net/tcp-raw-pipe.js dur=5 type="buf" len=102400       -0.78 %            0.781388921
 net/tcp-raw-pipe.js dur=5 type="buf" len=16777216     75.96 %            0.165276028
 net/tcp-raw-pipe.js dur=5 type="utf" len=102400       -5.66 %         ** 0.006928205
 net/tcp-raw-pipe.js dur=5 type="utf" len=16777216     -1.05 %            0.669799761
 net/tcp-raw-s2c.js dur=5 type="asc" len=102400         0.48 %            0.848273086
 net/tcp-raw-s2c.js dur=5 type="asc" len=16777216      -1.09 %            0.664140660
 net/tcp-raw-s2c.js dur=5 type="buf" len=102400         5.85 %          * 0.046938660
 net/tcp-raw-s2c.js dur=5 type="buf" len=16777216      -0.38 %            0.884173926
 net/tcp-raw-s2c.js dur=5 type="utf" len=102400        -4.80 %            0.174611084
 net/tcp-raw-s2c.js dur=5 type="utf" len=16777216      -0.71 %            0.364776237
Notifying upstream projects of job completion
Finished: SUCCESS

I kicked off the job again: https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/15/.

It would be good to access the raw data, so I can use R to plot graphs.. is that pushed somewhere?

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Sep 14, 2017

I think the script should automatically rebase the branch, because a difference in the V8 engine could skew the results and provide wrong feedback. Maybe this should be a checkbox, so we can verify across different release lines.

The script checkouts the $BRANCH (master) build node, then cherry-pick $PULL_ID on top and build again. So it should be rebased.

However the branch is still using an extremely old V8 engine, and we got some very weird results running it now:

What are the wired results? You have to remember that there is a small risk for a false positive and with many independent test that risk becomes large. At only 5% confidence (one star), the risk of getting 5 false positive in 44 tests is 4.590962%.

rplot

plot is only valid for 44 independent test.

plot code
library(ggplot2);
c1 = data.frame(
    false.positive=0:44,
    properbility=dbinom(0:44, 44, 0.05),
    confidence='*'
)
c2 = data.frame(
    false.positive=0:44,
    properbility=dbinom(0:44, 44, 0.01),
    confidence='**'
)
c3 = data.frame(
    false.positive=0:44,
    properbility=dbinom(0:44, 44, 0.001),
    confidence='***'
)
c = rbind(rbind(c1, c2), c3)

p = ggplot(c, aes(x=false.positive))
  + geom_bar(aes(y=properbility, fill=confidence), stat="identity")
  + facet_grid(confidence ~ .)
  + scale_x_continuous(breaks=0:44, limit=c(-0.5,8)))

print(p)

@mcollina
Copy link
Member

My bad. So, this is fine.

The script works beautifully then. I think we should document it in the main repo, it would be of great help to everyone.

@gareth-ellis
Copy link
Member Author

Excellent - I will see what I can do about providing the raw data.

Yes, we plan to get the info out to collaborators to start using, but thought it would be good to have some people test first :)

I'll add a bit of a description similar to what @AndreasMadsen wrote to explain how it works, and include in the doc in #144 .

Any other issues or quirks you noticed should be documented?

@mcollina
Copy link
Member

@gareth-ellis +1 to all that you just said.

Running all the http benchmarks without a filter takes forever (44 hours on my laptop and I got to 60%). We should not allow that, or strongly suggest that they need to add a filter. Running a job for 3-4 days does not sound right. IMHO The filter might well be mandatory for this script.

Posting the result to the PR would be a huge plus, but I do not know if that is easy to do.

@AndreasMadsen
Copy link
Member

Running all the http benchmarks without a filter takes forever (44 hours on my laptop and I got to 60%). We should not allow that, or strongly suggest that they need to add a filter. Running a job for 3-4 days does not sound right. IMHO The filter might well be mandatory for this script.

I agree. Statistically, due to the false positive risk, it is also not very useful to run that many tests. I think we should just check the expected benchmark time using a precompiled list as shown earlier #58 (comment).

@gareth-ellis
Copy link
Member Author

@mcollina , for getting to the raw data, if you click on your job, you should be able to open the workspace, and then browse to
e.g. https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/ws/benchmarking/experimental/benchmarks/community-benchmark/node/output.csv.

I'll modify the script to move it to the top level, so there is less moving around. I think the gotcha here could be the workspace may be cleared before each run. I will rename the file then I can be sure. Otherwise another option may be copying it to a folder on the benchmarking machine and have it synced over to the website for download. We do something similar for the charts (at https://benchmarking.nodejs.org) so it should in theory be possible. Then we could post a link in the job output to where the file will be.

For posting to a PR, I know there's some bots - in the org - are there any i can pass data to post to a PR? In the job I already have the PR number, so I have the information to send and where it needs to go.

@mcollina
Copy link
Member

both approach make sense. Even uploading it to a gist on github (together with a chart, maybe), would be fantastic. Something that last for a bit of time with some meta-info (shas being tested).

@gareth-ellis
Copy link
Member Author

OK, will look into that.

In the meantime, csvs now get saved in the root of the ws with date/timestamp:

https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/ws/

@mhdawson
Copy link
Member

Are we ready to announce to the rest of the team they can start using this or is there any other testing feedback we should get first.

@AndreasMadsen
Copy link
Member

@mhdawson I don't think there is anything to lose. It is not like making this public can cause anything else to fail or break.

@refack
Copy link

refack commented Nov 9, 2017

@gareth-ellis is there a reason that RUNS wasn't exposed as a parameter to the job?

@gareth-ellis
Copy link
Member Author

the main reason was that we were using the default number set of iterations set per benchmark, I can add the RUNS parameter if you like, just need to appreciate that if you set the number too low, the results may not be particularly useful, and if set too high and with enough benchmarks in scope it could take a long time to run

@refack
Copy link

refack commented Nov 9, 2017

I looked at the code and saw that the script will use it, so I added it, because we have a PR (nodejs/node#16888 (comment)) that is very noisy, and AFAICT it would give better results with more runs but shorter iterations.

@AndreasMadsen
Copy link
Member

AFAICT it would give better results with more runs but shorter iterations

In theory, it will only make a tiny difference. After 30-100 runs, the difference should be less than the numerical precision. In practice, decreasing the number of iteration can change the optimization behavior.

@refack
Copy link

refack commented Nov 9, 2017

Look at this distribution:
unpaired t test data
I think all the outliers are cases where the GC kicks in, and it's not what we're trying to look at.

@AndreasMadsen
Copy link
Member

@refack what are the outliers?

@AndreasMadsen
Copy link
Member

@refack I have made a long comment in the other issue (nodejs/node#16888 (comment)), hopefully that will clarify things.

@mhdawson
Copy link
Member

mhdawson commented Dec 4, 2017

Gareth opened nodejs/node#17359 to let collaborators know about it.

@mhdawson
Copy link
Member

Anything more to be discussed in this issue ? The benchmark job is now available, publicized and initial issues resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants