Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fabtest latency tests #10394

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

iziemba
Copy link
Contributor

@iziemba iziemba commented Sep 18, 2024

The issue with current fi_rdm_pingpong and fi_rdm_tagged_pingpong logic is reported latency is higher than running OSU latency with with MPI + libfabric. Because of this, these pingpong benchmarks are not particularly useful.

Looking at how OSU latency + MPI + libfabric works, OSU latency initiator issues an MPI_Send before the MPI_Recv. Taking a similar approach in fabtests shows decreased latency. The following are examples with rxm;tcp.

Default (pre-post RX) pingpong

$ taskset -c 32 ./fi_rdm_pingpong -p "tcp;ofi_rxm" -s 169.254.0.35 -S all -w 1000 -O 169.254.1.36 -b 169.254.0.36
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
1       10k     19k         0.30s      0.07      14.94       0.07
2       10k     39k         0.30s      0.13      15.24       0.07
3       10k     58k         0.30s      0.20      15.25       0.07
4       10k     78k         0.30s      0.26      15.25       0.07
6       10k     117k        0.31s      0.38      15.63       0.06
8       10k     156k        0.31s      0.51      15.63       0.06
12      10k     234k        0.31s      0.77      15.62       0.06
16      10k     312k        0.31s      1.02      15.63       0.06
24      10k     468k        0.31s      1.54      15.63       0.06
32      10k     625k        0.31s      2.05      15.63       0.06
48      10k     937k        0.31s      3.07      15.62       0.06
64      10k     1.2m        0.31s      4.10      15.63       0.06
96      10k     1.8m        0.32s      6.05      15.88       0.06
128     10k     2.4m        0.34s      7.59      16.87       0.06
192     10k     3.6m        0.34s     11.31      16.98       0.06
256     10k     4.8m        0.34s     15.17      16.88       0.06
384     10k     7.3m        0.34s     22.57      17.01       0.06
512     10k     9.7m        0.34s     30.15      16.98       0.06
768     10k     14m         0.34s     45.05      17.05       0.06
1k      10k     19m         0.34s     59.70      17.15       0.06
...

TX First pingpong

$ taskset -c 32 ./fi_rdm_pingpong -p "tcp;ofi_rxm" -s 169.254.0.35 -S all -w 1000 -O 169.254.1.36 -b -T 169.254.0.36
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
1       10k     19k         0.26s      0.08      13.18       0.08
2       10k     39k         0.27s      0.15      13.49       0.07
3       10k     58k         0.26s      0.23      13.11       0.08
4       10k     78k         0.26s      0.30      13.11       0.08
6       10k     117k        0.27s      0.44      13.49       0.07
8       10k     156k        0.27s      0.59      13.51       0.07
12      10k     234k        0.27s      0.89      13.51       0.07
16      10k     312k        0.27s      1.19      13.49       0.07
24      10k     468k        0.27s      1.78      13.48       0.07
32      10k     625k        0.28s      2.31      13.86       0.07
48      10k     937k        0.27s      3.56      13.47       0.07
64      10k     1.2m        0.27s      4.74      13.49       0.07
96      10k     1.8m        0.27s      6.99      13.73       0.07
128     10k     2.4m        0.29s      8.76      14.61       0.07
192     10k     3.6m        0.29s     13.08      14.67       0.07
256     10k     4.8m        0.29s     17.42      14.70       0.07
384     10k     7.3m        0.31s     24.85      15.45       0.06
512     10k     9.7m        0.29s     35.58      14.39       0.07
768     10k     14m         0.29s     53.24      14.43       0.07
1k      10k     19m         0.30s     68.58      14.93       0.07
...

The -T option is used to enable this behavior. Since this option requires no pre-posting of buffers, inband address exchange cannot be used.

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun change! Have you run with other providers? Which see an improvement and which see degradation?

fabtests/common/shared.c Outdated Show resolved Hide resolved
fabtests/benchmarks/benchmark_shared.c Show resolved Hide resolved
fabtests/benchmarks/benchmark_shared.c Outdated Show resolved Hide resolved
fabtests/benchmarks/benchmark_shared.c Outdated Show resolved Hide resolved
fabtests/benchmarks/benchmark_shared.c Outdated Show resolved Hide resolved
@ooststep
Copy link
Contributor

not necessarily related to this change, but I notice with the fabtests you're requesting tcp;rxm. tcp does directly support rdm endpoints now, so this is not necessary and could remove some overheads if the rxm layering is not needed (it should likely only be needed for backward compatibility). additionally, if your mpi runs are not using the rxm layering (I'm not sure how you're requesting provider in that case) that could contribute to some differences as well.

@iziemba
Copy link
Contributor Author

iziemba commented Sep 18, 2024

Fun change! Have you run with other providers? Which see an improvement and which see degradation?

I have run with RxM + TCP and CXI providers. Both saw improvements with posting TX before RX.

@iziemba
Copy link
Contributor Author

iziemba commented Sep 18, 2024

Thinking out loud..... The reason why I did not want to delete the existing logic is because I cannot guarantee the new logic will not introduce a regression for some providers (hence the -T arg). Do we want to consider removing old pingpong logic?

@j-xiong
Copy link
Contributor

j-xiong commented Sep 18, 2024 via email

@shefty
Copy link
Member

shefty commented Sep 19, 2024

Posting the Tx prior to posting the Rx effectively hides the latency of posting the Rx, assuming the RTT is longer than the Rx posting time. If an Rx is posted prior to entering the pingpong loop, this should work for verbs without verbs hitting RNR NACK.

Move the current pingpong logic into a pingpong pre-posted RX function.
This better describes the behavior of this pingpong test. In addition,
this change will allow for a pingpong without pre-posted RX buffers to
be defined.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
The run() logic in rdm_pingpong and rdm_tagged_pingpong is the same.
Consolidate this logic into a single run_pingpong() function.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Split out the ft_sync logic into two separate functions: inband sync
(ft_sync_inband) and out-of-band sync (ft_sync_oob). The inband sync
supports the option to conditionally repost buffers after the sync.

The breaking out of the sync logic is needed to support
fi_rdm_pingpong/fi_rdm_tagged_pingpong with a no pre-posted RX buffer
option.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
The new pingpong test allows for TX operations to be posted and
processed, if necessary, before post the RX buffer. This better aligns
to how OSU latency works. By doing this, fi_rdm_tagged_latency is now
lower than OSU latency which makes sense since less SW is involved.

The no prepost RX pingpong test can be enabled by using the -r option.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
@zachdworkin
Copy link
Contributor

@iziemba I am fixing Intel CI and will replay your PR when I am finished. Sorry for the delays

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants