[GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask #7462

Rhett-Ying · 2024-06-17T00:51:22Z

🔨Work Item

IMPORTANT:

This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

Below numbers are from node classification example on latest master branch(2024.06.14) with CPU sampling.

prob data

# Add prob data.
    num_edges = dataset.graph.total_num_edges
    prob_data = torch.rand(num_edges)
    prob_data[torch.randperm(len(prob_data))[: int(len(prob_data) * 0.5)]] = 0.0
    dataset.graph.add_edge_attribute("prob", prob_data)

non-dist, gb, homo, nc, no prob
Training...
Training: 3it [00:05, 1.55s/it]---- Average time for sampling: 0.082120824418962
Training: 10it [00:13, 1.19s/it]---- Average time for sampling: 0.0836272995453328
Training: 16it [00:19, 1.12s/it]---- Average time for sampling: 0.08516605570912361
Training: 23it [00:27, 1.14s/it]---- Average time for sampling: 0.0846009589266032
Training: 30it [00:35, 1.13s/it]---- Average time for sampling: 0.0846672392077744
Training: 36it [00:42, 1.12s/it]---- Average time for sampling: 0.08583860701570908

non-dist, gb, homo, nc, prob
Training: 4it [00:22, 5.03s/it]---- Average time for sampling: 1.1615502193570137
Training: 11it [00:52, 4.32s/it]---- Average time for sampling: 1.228120240289718
Training: 18it [01:22, 4.19s/it]---- Average time for sampling: 1.2553682390290002
Training: 24it [01:47, 4.19s/it]---- Average time for sampling: 1.2268523721955717
Training: 31it [02:16, 4.18s/it]---- Average time for sampling: 1.2365567354112863
Training: 38it [02:46, 4.21s/it]---- Average time for sampling: 1.2496669557721665

mask data

# Add prob data.
    num_edges = dataset.graph.total_num_edges
    prob_data = torch.rand(num_edges)
    prob_data[torch.randperm(len(prob_data))[: int(len(prob_data) * 0.5)]] = 0.0
    dataset.graph.add_edge_attribute("prob", prob_data)
    mask_data = (prob_data > 0.2).to(torch.float32)
    dataset.graph.add_edge_attribute("mask", mask_data)

GB + NoMask
Training...
Training: 3it [00:00, 3.37it/s]---- Average time for sampling: 0.0064875221811234954
Training: 10it [00:02, 4.05it/s]---- Average time for sampling: 0.006722455704584717
Training: 16it [00:04, 3.92it/s]---- Average time for sampling: 0.008144200344880422
Training: 23it [00:06, 3.68it/s]---- Average time for sampling: 0.008010426000691951
Training: 30it [00:07, 4.15it/s]---- Average time for sampling: 0.008361106105148793
Training: 36it [00:09, 4.31it/s]---- Average time for sampling: 0.00815806492852668

GB + Mask
Training...
Training: 3it [00:01, 2.61it/s]---- Average time for sampling: 0.04655262678861618
Training: 10it [00:03, 3.52it/s]---- Average time for sampling: 0.05128640588372946
Training: 16it [00:04, 3.61it/s]---- Average time for sampling: 0.054167306640495856
Training: 23it [00:06, 3.64it/s]---- Average time for sampling: 0.052452172990888356
Training: 30it [00:08, 3.74it/s]---- Average time for sampling: 0.05273009695112705
Training: 36it [00:10, 3.49it/s]---- Average time for sampling: 0.05279933324394127

critical call

dgl/graphbolt/src/fused_csc_sampling_graph.cc

Lines 1322 to 1328 in ed50c17

    
           /** @brief An operator to perform non-uniform sampling. */ 
        
           static torch::Tensor NonUniformPickOp( 
        
               torch::Tensor probs, int64_t fanout, bool replace) { 
        
             auto positive_probs_indices = probs.nonzero().squeeze(1); 
        
             auto num_positive_probs = positive_probs_indices.size(0); 
        
             if (num_positive_probs == 0) return torch::empty({0}, torch::kLong); 
        
             if ((fanout == -1) || (num_positive_probs <= fanout && !replace)) {

Depending work items or issues

The text was updated successfully, but these errors were encountered:

Rhett-Ying · 2024-07-03T07:06:46Z

As @mfbalin suggested, x86simdsort::qselect is equivalent to std::nth_element that we can use for weighted sampling and obtain speed-up. https://github.com/intel/x86-simd-sort

az15240 · 2024-07-23T10:39:20Z

I'm trying to reproduce your result. I didn't change any configuration or parameter options, I assume that those options (non-dist, gb, homo, nc, no prob, etc.) are aligned with the current example code.

For the prob/mask data, do we regenerate them: (a) once for the whole code execution, (b) once per batch of training, or (c) once per iteration? Currently I tried all three kinds of prob/mask data generation plan but I manually put them directly in the batch or iteration loop. This might be optimized.

I ran on the ogbn-products dataset:

	None	prob	mask
Sample Once per Execution	68.09it/s	57.52it/s	56.79it/s
Sample Once per Batch	68.09it/s	56.92it/s	57.45it/s
Sample Once per Iteration	68.09it/s	3.43it/s	55.09it/s

I hypothesized that a reason of the huge performance difference is because we recomputed the whole prob/mask data per iteration, and for huge dataset it can be really costly. For prob data, 57.52/3.43=16.76, which is roughly 14x difference, but it confuses me when the mask data doesn't have a huge difference.

I'm also confused of your result:
"GB + NoMask" takes >10x less time than "non-dist, gb, homo, nc, no prob". I suppose GB + NoMask only added the mask data on "non-dist, gb, homo, nc, no prob".

P.S. shall we update the link of the node classification example from the description? It is moved to a different path.

az15240 · 2024-07-24T04:26:20Z

@Rhett-Ying

Rhett-Ying · 2024-07-24T08:12:16Z

The issue I hit when reporting, the prob/mask is generated once we load the dataset, so it's not changed during model training. So according to the results you shared above, you didn't reproduce the issue. could you file a draft PR for your reproduce code changes? then I could review it and find out what's the discrepancy.

feel free to update the example path accordingly.

mfbalin · 2024-07-24T10:32:46Z

I'm trying to reproduce your result. I didn't change any configuration or parameter options, I assume that those options (non-dist, gb, homo, nc, no prob, etc.) are aligned with the current example code.

For the prob/mask data, do we regenerate them: (a) once for the whole code execution, (b) once per batch of training, or (c) once per iteration? Currently I tried all three kinds of prob/mask data generation plan but I manually put them directly in the batch or iteration loop. This might be optimized.

I ran on the ogbn-products dataset:

None prob mask
Sample Once per Execution 68.09it/s 57.52it/s 56.79it/s
Sample Once per Batch 68.09it/s 56.92it/s 57.45it/s
Sample Once per Iteration 68.09it/s 3.43it/s 55.09it/s
I hypothesized that a reason of the huge performance difference is because we recomputed the whole prob/mask data per iteration, and for huge dataset it can be really costly. For prob data, 57.52/3.43=16.76, which is roughly 14x difference, but it confuses me when the mask data doesn't have a huge difference.

I'm also confused of your result: "GB + NoMask" takes >10x less time than "non-dist, gb, homo, nc, no prob". I suppose GB + NoMask only added the mask data on "non-dist, gb, homo, nc, no prob".

P.S. shall we update the link of the node classification example from the description? It is moved to a different path.

@Rhett-Ying @az15240 You probably ran the GPU version of the kernels. It is expected that the GPU results are this fast. You need to move the graph to the CPU to measure CPU sampling time.

az15240 · 2024-07-24T10:36:04Z

Please checkout PR #7580 for more information!

az15240 · 2024-07-24T10:43:47Z

@Rhett-Ying How did you calculate the average time for sampling? I think it is more direct If we focus on the number of iterations per second. This is what I get using CPU, and seems like the runtime isn't too bad.

None: 2.32it/s
mask: 1.02it/s
weight: 1.59it/s

mfbalin · 2024-07-24T10:45:23Z

@Rhett-Ying How did you calculate the average time for sampling? I think it is more direct If we focus on the number of iterations per second. This is what I get using CPU, and seems like the runtime isn't too bad.

None: 2.32it/s mask: 1.02it/s weight: 1.59it/s

Try using cpu-cuda instead of cpu-cpu.

mfbalin · 2024-07-24T10:46:11Z

You might want to use --mode=cpu-pinned-cuda in the pyg/node_classification_advanced.py example if you want to focus on the CPU sampling runtime. --mode=cuda-pinned-cuda would give you the GPU sampling runtime.

Rhett-Ying · 2024-07-25T00:57:09Z

@az15240 As mentioned above, please make sure you're using CPU for sampling. This issue happens with CPU sampling only. I think we already see the inefficiency of cpu sampling on prob/mask from the regression you previously added.

az15240 · 2024-07-25T10:37:31Z

@Rhett-Ying How did you calculate the average time for sampling? I think it is more direct If we focus on the number of iterations per second. This is what I get using CPU, and seems like the runtime isn't too bad.
None: 2.32it/s mask: 1.02it/s weight: 1.59it/s

Try using cpu-cuda instead of cpu-cpu.

I don't know why but I got a Segmentation fault (core dumped) when running with this mode. All other modes runs normally for me, and I've restarted my machine as well. 😿 But right now I think I'll focus mainly on the benchmark regression result. I'll focus on the performance of graph.sample_neighbors.

az15240 · 2024-07-29T09:11:22Z

Notes:

early stop in NumPick ([Temporal] Optimize num pick by early stop #7370)
uniform sampling for mask ([Temporal] Pick function optimization #7417)
preprocess (filter down the edges for mask, or calculate the sum of neighboring edges' probs for nodes) (decent for one time usage, bad for dynamic edge_attribute generation case)

az15240 · 2024-07-29T09:49:05Z

NumPick will decide the number of nodes/edges to pick, for example when there are a conflict between fanout number and number of non-zero prob/mask entries.
Pick will actually perform the picking of edges.

az15240 · 2024-08-05T06:12:13Z

Notes:

early stop in NumPick ([Temporal] Optimize num pick by early stop #7370)

uniform sampling for mask ([Temporal] Pick function optimization #7417)

preprocess (filter down the edges for mask, or calculate the sum of neighboring edges' probs for nodes) (decent for one time usage, bad for dynamic edge_attribute generation case)

The first option has no change to the performance, since it changes on TemporalNumPick instead of NumPick.
The second option has 1%~3% improvement to the sampling time on NonUniformPick, which is seen in prob/mask.

az15240 · 2024-08-05T06:44:10Z

One hypothesis to explain the huge sampling time is that, in NonUniformPickOp function,

  auto positive_probs_indices = probs.nonzero().squeeze(1);
  auto num_positive_probs = positive_probs_indices.size(0);

takes too long to execute.

Meanwhile, when replace=False and edge_attribute=None, it is immediately returned:

dgl/graphbolt/src/fused_csc_sampling_graph.cc

Lines 1227 to 1229 in ed50c17

    
           if ((fanout == -1) || (num_neighbors <= fanout && !replace)) { 
        
             std::iota(picked_data_ptr, picked_data_ptr + num_neighbors, offset); 
        
             return num_neighbors;

az15240 · 2024-09-10T08:09:42Z

Some progress that I currently have:

I changed the number of workers, but if we sum the runtime of all the subprocesses up, it gives a very similar runtime as number of worker=0.
I tried manually setting the number of threads around these two lines, but I got a very similar runtime regardless of the number of threads. Since the probs tensor is large, I hypothesize that these two lines are implemented single-threaded.
I tried Peiyuan's ideas: [Temporal] Optimize num pick by early stop #7370 and [Temporal] Pick function optimization #7417, both didn't improve the runtime by more than 3%.
I tried the three changes:
4.1. auto positive_probs_indices = probs.nonzero().select(1, 0);, no major improvement
4.2. auto positive_probs_indices = torch::where(probs > 0)[0];, making it worse
4.3. auto positive_probs_indices = torch::argwhere(probs > 0).flatten();, making it worse
4.4. auto temp = probs.contiguous().nonzero();, slight improvement, improving the prob/mask sampling by 4.12% on average.

Data for 4.4: https://docs.google.com/spreadsheets/d/1QrH-A4Fch0McHxux7pwaNgSkaoH4HslwFHSYdZ_ADwc/edit?usp=sharing, by running benchmark_graphbolt_sampling.py

I think we can take the 4.4 change.

Rhett-Ying added the Work Item Work items tracked in project tracker label Jun 17, 2024

Rhett-Ying added this to the 2024 Integrate GraphBolt with DistDGL milestone Jun 17, 2024

Rhett-Ying assigned az15240 Jun 18, 2024

az15240 pushed a commit that referenced this issue Jul 24, 2024

Stanley's node_classification code for #7462

8446b23

Rhett-Ying changed the title ~~[GraphBolt] sample_neighbors() with prob/mask is 14x slower than w/o prob/mask~~ [GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask Jul 25, 2024

az15240 pushed a commit that referenced this issue Sep 10, 2024

change based on #7462

1fc5f8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask #7462

[GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask #7462

Rhett-Ying commented Jun 17, 2024 •

edited

Loading

Rhett-Ying commented Jul 3, 2024

az15240 commented Jul 23, 2024

az15240 commented Jul 24, 2024

Rhett-Ying commented Jul 24, 2024

mfbalin commented Jul 24, 2024 •

edited

Loading

az15240 commented Jul 24, 2024

az15240 commented Jul 24, 2024

mfbalin commented Jul 24, 2024

mfbalin commented Jul 24, 2024 •

edited

Loading

Rhett-Ying commented Jul 25, 2024

az15240 commented Jul 25, 2024

az15240 commented Jul 29, 2024

az15240 commented Jul 29, 2024

az15240 commented Aug 5, 2024

az15240 commented Aug 5, 2024

az15240 commented Sep 10, 2024

[GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask #7462

[GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask #7462

Comments

Rhett-Ying commented Jun 17, 2024 • edited Loading

🔨Work Item

Description

prob data

mask data

critical call

Depending work items or issues

Rhett-Ying commented Jul 3, 2024

az15240 commented Jul 23, 2024

az15240 commented Jul 24, 2024

Rhett-Ying commented Jul 24, 2024

mfbalin commented Jul 24, 2024 • edited Loading

az15240 commented Jul 24, 2024

az15240 commented Jul 24, 2024

mfbalin commented Jul 24, 2024

mfbalin commented Jul 24, 2024 • edited Loading

Rhett-Ying commented Jul 25, 2024

az15240 commented Jul 25, 2024

az15240 commented Jul 29, 2024

az15240 commented Jul 29, 2024

az15240 commented Aug 5, 2024

az15240 commented Aug 5, 2024

az15240 commented Sep 10, 2024

Rhett-Ying commented Jun 17, 2024 •

edited

Loading

mfbalin commented Jul 24, 2024 •

edited

Loading

mfbalin commented Jul 24, 2024 •

edited

Loading