Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask #7462

Open
Rhett-Ying opened this issue Jun 17, 2024 · 16 comments
Assignees
Labels
Work Item Work items tracked in project tracker

Comments

@Rhett-Ying
Copy link
Collaborator

Rhett-Ying commented Jun 17, 2024

🔨Work Item

IMPORTANT:

  • This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
  • DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

Below numbers are from node classification example on latest master branch(2024.06.14) with CPU sampling.

prob data

# Add prob data.
    num_edges = dataset.graph.total_num_edges
    prob_data = torch.rand(num_edges)
    prob_data[torch.randperm(len(prob_data))[: int(len(prob_data) * 0.5)]] = 0.0
    dataset.graph.add_edge_attribute("prob", prob_data)

non-dist, gb, homo, nc, no prob
Training...
Training: 3it [00:05, 1.55s/it]---- Average time for sampling: 0.082120824418962
Training: 10it [00:13, 1.19s/it]---- Average time for sampling: 0.0836272995453328
Training: 16it [00:19, 1.12s/it]---- Average time for sampling: 0.08516605570912361
Training: 23it [00:27, 1.14s/it]---- Average time for sampling: 0.0846009589266032
Training: 30it [00:35, 1.13s/it]---- Average time for sampling: 0.0846672392077744
Training: 36it [00:42, 1.12s/it]---- Average time for sampling: 0.08583860701570908

non-dist, gb, homo, nc, prob
Training: 4it [00:22, 5.03s/it]---- Average time for sampling: 1.1615502193570137
Training: 11it [00:52, 4.32s/it]---- Average time for sampling: 1.228120240289718
Training: 18it [01:22, 4.19s/it]---- Average time for sampling: 1.2553682390290002
Training: 24it [01:47, 4.19s/it]---- Average time for sampling: 1.2268523721955717
Training: 31it [02:16, 4.18s/it]---- Average time for sampling: 1.2365567354112863
Training: 38it [02:46, 4.21s/it]---- Average time for sampling: 1.2496669557721665

mask data

# Add prob data.
    num_edges = dataset.graph.total_num_edges
    prob_data = torch.rand(num_edges)
    prob_data[torch.randperm(len(prob_data))[: int(len(prob_data) * 0.5)]] = 0.0
    dataset.graph.add_edge_attribute("prob", prob_data)
    mask_data = (prob_data > 0.2).to(torch.float32)
    dataset.graph.add_edge_attribute("mask", mask_data)

GB + NoMask
Training...
Training: 3it [00:00, 3.37it/s]---- Average time for sampling: 0.0064875221811234954
Training: 10it [00:02, 4.05it/s]---- Average time for sampling: 0.006722455704584717
Training: 16it [00:04, 3.92it/s]---- Average time for sampling: 0.008144200344880422
Training: 23it [00:06, 3.68it/s]---- Average time for sampling: 0.008010426000691951
Training: 30it [00:07, 4.15it/s]---- Average time for sampling: 0.008361106105148793
Training: 36it [00:09, 4.31it/s]---- Average time for sampling: 0.00815806492852668

GB + Mask
Training...
Training: 3it [00:01, 2.61it/s]---- Average time for sampling: 0.04655262678861618
Training: 10it [00:03, 3.52it/s]---- Average time for sampling: 0.05128640588372946
Training: 16it [00:04, 3.61it/s]---- Average time for sampling: 0.054167306640495856
Training: 23it [00:06, 3.64it/s]---- Average time for sampling: 0.052452172990888356
Training: 30it [00:08, 3.74it/s]---- Average time for sampling: 0.05273009695112705
Training: 36it [00:10, 3.49it/s]---- Average time for sampling: 0.05279933324394127

critical call

/** @brief An operator to perform non-uniform sampling. */
static torch::Tensor NonUniformPickOp(
torch::Tensor probs, int64_t fanout, bool replace) {
auto positive_probs_indices = probs.nonzero().squeeze(1);
auto num_positive_probs = positive_probs_indices.size(0);
if (num_positive_probs == 0) return torch::empty({0}, torch::kLong);
if ((fanout == -1) || (num_positive_probs <= fanout && !replace)) {

Depending work items or issues

@Rhett-Ying Rhett-Ying added the Work Item Work items tracked in project tracker label Jun 17, 2024
@Rhett-Ying
Copy link
Collaborator Author

As @mfbalin suggested, x86simdsort::qselect is equivalent to std::nth_element that we can use for weighted sampling and obtain speed-up. https://github.com/intel/x86-simd-sort

@az15240
Copy link
Collaborator

az15240 commented Jul 23, 2024

I'm trying to reproduce your result. I didn't change any configuration or parameter options, I assume that those options (non-dist, gb, homo, nc, no prob, etc.) are aligned with the current example code.

For the prob/mask data, do we regenerate them: (a) once for the whole code execution, (b) once per batch of training, or (c) once per iteration? Currently I tried all three kinds of prob/mask data generation plan but I manually put them directly in the batch or iteration loop. This might be optimized.

I ran on the ogbn-products dataset:

None prob mask
Sample Once per Execution 68.09it/s 57.52it/s 56.79it/s
Sample Once per Batch 68.09it/s 56.92it/s 57.45it/s
Sample Once per Iteration 68.09it/s 3.43it/s 55.09it/s

I hypothesized that a reason of the huge performance difference is because we recomputed the whole prob/mask data per iteration, and for huge dataset it can be really costly. For prob data, 57.52/3.43=16.76, which is roughly 14x difference, but it confuses me when the mask data doesn't have a huge difference.

I'm also confused of your result:
"GB + NoMask" takes >10x less time than "non-dist, gb, homo, nc, no prob". I suppose GB + NoMask only added the mask data on "non-dist, gb, homo, nc, no prob".


P.S. shall we update the link of the node classification example from the description? It is moved to a different path.

@az15240
Copy link
Collaborator

az15240 commented Jul 24, 2024

@Rhett-Ying

@Rhett-Ying
Copy link
Collaborator Author

The issue I hit when reporting, the prob/mask is generated once we load the dataset, so it's not changed during model training. So according to the results you shared above, you didn't reproduce the issue. could you file a draft PR for your reproduce code changes? then I could review it and find out what's the discrepancy.

feel free to update the example path accordingly.

@mfbalin
Copy link
Collaborator

mfbalin commented Jul 24, 2024

I'm trying to reproduce your result. I didn't change any configuration or parameter options, I assume that those options (non-dist, gb, homo, nc, no prob, etc.) are aligned with the current example code.

For the prob/mask data, do we regenerate them: (a) once for the whole code execution, (b) once per batch of training, or (c) once per iteration? Currently I tried all three kinds of prob/mask data generation plan but I manually put them directly in the batch or iteration loop. This might be optimized.

I ran on the ogbn-products dataset:

None prob mask
Sample Once per Execution 68.09it/s 57.52it/s 56.79it/s
Sample Once per Batch 68.09it/s 56.92it/s 57.45it/s
Sample Once per Iteration 68.09it/s 3.43it/s 55.09it/s
I hypothesized that a reason of the huge performance difference is because we recomputed the whole prob/mask data per iteration, and for huge dataset it can be really costly. For prob data, 57.52/3.43=16.76, which is roughly 14x difference, but it confuses me when the mask data doesn't have a huge difference.

I'm also confused of your result: "GB + NoMask" takes >10x less time than "non-dist, gb, homo, nc, no prob". I suppose GB + NoMask only added the mask data on "non-dist, gb, homo, nc, no prob".

P.S. shall we update the link of the node classification example from the description? It is moved to a different path.

@Rhett-Ying @az15240 You probably ran the GPU version of the kernels. It is expected that the GPU results are this fast. You need to move the graph to the CPU to measure CPU sampling time.

@az15240
Copy link
Collaborator

az15240 commented Jul 24, 2024

Please checkout PR #7580 for more information!

@az15240
Copy link
Collaborator

az15240 commented Jul 24, 2024

@Rhett-Ying How did you calculate the average time for sampling? I think it is more direct If we focus on the number of iterations per second. This is what I get using CPU, and seems like the runtime isn't too bad.

None: 2.32it/s
mask: 1.02it/s
weight: 1.59it/s

@mfbalin
Copy link
Collaborator

mfbalin commented Jul 24, 2024

@Rhett-Ying How did you calculate the average time for sampling? I think it is more direct If we focus on the number of iterations per second. This is what I get using CPU, and seems like the runtime isn't too bad.

None: 2.32it/s mask: 1.02it/s weight: 1.59it/s

Try using cpu-cuda instead of cpu-cpu.

@mfbalin
Copy link
Collaborator

mfbalin commented Jul 24, 2024

You might want to use --mode=cpu-pinned-cuda in the pyg/node_classification_advanced.py example if you want to focus on the CPU sampling runtime. --mode=cuda-pinned-cuda would give you the GPU sampling runtime.

@Rhett-Ying
Copy link
Collaborator Author

@az15240 As mentioned above, please make sure you're using CPU for sampling. This issue happens with CPU sampling only. I think we already see the inefficiency of cpu sampling on prob/mask from the regression you previously added.

@Rhett-Ying Rhett-Ying changed the title [GraphBolt] sample_neighbors() with prob/mask is 14x slower than w/o prob/mask [GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask Jul 25, 2024
@az15240
Copy link
Collaborator

az15240 commented Jul 25, 2024

@Rhett-Ying How did you calculate the average time for sampling? I think it is more direct If we focus on the number of iterations per second. This is what I get using CPU, and seems like the runtime isn't too bad.
None: 2.32it/s mask: 1.02it/s weight: 1.59it/s

Try using cpu-cuda instead of cpu-cpu.

I don't know why but I got a Segmentation fault (core dumped) when running with this mode. All other modes runs normally for me, and I've restarted my machine as well. 😿 But right now I think I'll focus mainly on the benchmark regression result. I'll focus on the performance of graph.sample_neighbors.

@az15240
Copy link
Collaborator

az15240 commented Jul 29, 2024

Notes:

  1. early stop in NumPick ([Temporal] Optimize num pick by early stop #7370)
  2. uniform sampling for mask ([Temporal] Pick function optimization #7417)
  3. preprocess (filter down the edges for mask, or calculate the sum of neighboring edges' probs for nodes) (decent for one time usage, bad for dynamic edge_attribute generation case)

@az15240
Copy link
Collaborator

az15240 commented Jul 29, 2024

NumPick will decide the number of nodes/edges to pick, for example when there are a conflict between fanout number and number of non-zero prob/mask entries.
Pick will actually perform the picking of edges.

@az15240
Copy link
Collaborator

az15240 commented Aug 5, 2024

Notes:

  1. early stop in NumPick ([Temporal] Optimize num pick by early stop #7370)
  2. uniform sampling for mask ([Temporal] Pick function optimization #7417)
  3. preprocess (filter down the edges for mask, or calculate the sum of neighboring edges' probs for nodes) (decent for one time usage, bad for dynamic edge_attribute generation case)

The first option has no change to the performance, since it changes on TemporalNumPick instead of NumPick.
The second option has 1%~3% improvement to the sampling time on NonUniformPick, which is seen in prob/mask.

@az15240
Copy link
Collaborator

az15240 commented Aug 5, 2024

One hypothesis to explain the huge sampling time is that, in NonUniformPickOp function,

  auto positive_probs_indices = probs.nonzero().squeeze(1);
  auto num_positive_probs = positive_probs_indices.size(0);

takes too long to execute.

Meanwhile, when replace=False and edge_attribute=None, it is immediately returned:

if ((fanout == -1) || (num_neighbors <= fanout && !replace)) {
std::iota(picked_data_ptr, picked_data_ptr + num_neighbors, offset);
return num_neighbors;

@az15240
Copy link
Collaborator

az15240 commented Sep 10, 2024

Some progress that I currently have:

  1. I changed the number of workers, but if we sum the runtime of all the subprocesses up, it gives a very similar runtime as number of worker=0.
  2. I tried manually setting the number of threads around these two lines, but I got a very similar runtime regardless of the number of threads. Since the probs tensor is large, I hypothesize that these two lines are implemented single-threaded.
  3. I tried Peiyuan's ideas: [Temporal] Optimize num pick by early stop #7370 and [Temporal] Pick function optimization #7417, both didn't improve the runtime by more than 3%.
  4. I tried the three changes:
    4.1. auto positive_probs_indices = probs.nonzero().select(1, 0);, no major improvement
    4.2. auto positive_probs_indices = torch::where(probs > 0)[0];, making it worse
    4.3. auto positive_probs_indices = torch::argwhere(probs > 0).flatten();, making it worse
    4.4. auto temp = probs.contiguous().nonzero();, slight improvement, improving the prob/mask sampling by 4.12% on average.

Data for 4.4: https://docs.google.com/spreadsheets/d/1QrH-A4Fch0McHxux7pwaNgSkaoH4HslwFHSYdZ_ADwc/edit?usp=sharing, by running benchmark_graphbolt_sampling.py

I think we can take the 4.4 change.

az15240 pushed a commit that referenced this issue Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Work Item Work items tracked in project tracker
Projects
Status: 🏗 In progress
Development

No branches or pull requests

3 participants