Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Track sorted status of COO from creation #2645

Merged
merged 17 commits into from
Apr 16, 2021

Conversation

nv-dlasalle
Copy link
Collaborator

Description

This PR tries to make sure we check if a COO is sorted, when it is created via the CAPI. This avoids adding a synchronization during forward/backward to check if its sorted.

Because the unsorted path on the CPU was the only parallel path for COO->CSR, this also parallelizes the sorted path, to ensure we don't see a regression for accurately marking matrices as sorted.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change

Changes

Performance on COO->CSR conversion:

pre #2391:

=== livejournal ===
CPU: 0.40786104202270507 s
GPU: 0.009723591804504394 s

post #2391:

=== livejournal ===
CPU: 0.229170560836792 s
GPU: 0.08134148120880128 s

This PR:

=== livejournal ===
CPU: 0.10662269592285156 s
GPU: 0.005210304260253906 s

@@ -267,7 +267,8 @@ class CSR : public GraphInterface {
class COO : public GraphInterface {
public:
// Create a coo graph that shares the given src and dst
COO(int64_t num_vertices, IdArray src, IdArray dst);
COO(int64_t num_vertices, IdArray src, IdArray dst,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jermainewang @zheng-da is immutable_graph data structure still in use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but we could leave the change here.

const int64_t nz_end = std::min(NNZ, nz_start+nz_chunk);

// each thread searchs the row array for a change, and marks it's
// location in Bp. Threads have overlapping ranges from nz_start-1 to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate more on the overlapping ranges?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded it and added an example in d824573

341         // Each thread searches the row array for a change, and marks it's
342         // location in Bp. Threads, other than the first, start at the last
343         // index covered by the previous, in order to detect changes in the row
344         // array between thread partitions. This means that each thread after
345         // the first, searches the range [nz_start-1, nz_end). That is,
346         // if we had 10 non-zeros, and 4 threads, the indexes searched by each
347         // thread would be:
348         // 0: [0, 1, 2]
349         // 1: [2, 3, 4, 5]
350         // 2: [5, 6, 7, 8]
351         // 3: [8, 9]
352         //
353         // That way, if the row array were [0, 0, 1, 2, 2, 2, 4, 5, 5, 6], each
354         // change in row would be captured by one thread:
355         //
356         // 0: [0, 0, 1] - row 0
357         // 1: [1, 2, 2, 2] - row 1
358         // 2: [2, 4, 5, 5] - rows 2, 3, and 4
359         // 3: [5, 6] - rows 5 and 6
360         //

@@ -312,26 +313,65 @@ CSRMatrix COOToCSR(COOMatrix coo) {
NDArray ret_indices;
NDArray ret_data;

bool row_sorted = coo.row_sorted;
bool col_sorted = coo.col_sorted;
const bool row_sorted = coo.row_sorted;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also change cuda/spmat_op_impl_coo.cu?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation in cuda/coo2csr.cu only operates on the sorted COO (and sorts it if its not).

I updated this, as properly tracking the sorted status of a COO was causing a regression, where the conversion was being done serially. Now that both paths are parallel, properly tracking the COO to be sorted does not result in a slowdown.

// setup sorted flags
bool row_sorted, col_sorted;
std::tie(row_sorted, col_sorted) = COOIsSorted(
aten::COOMatrix(num_src, num_dst, row, col));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this check here could potentially slowdown graph construction. We should expose the sorted flags up to the user-facing API:

g = dgl.graph((src, dst), row_sorted=True, col_sorted=True)

Probably also need to expose a utility function for checking sorting status:

row_sorted, col_sorted = dgl.utils.is_sorted_srcdst(src, dst)

We could push the later utility function to next PR. When creating the livejournal graph, we manually set both flags to true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jermainewang Given that previously COOIsSorted() was being called at each format conversion, and that it's roughly the cost of copying the graph, would it be fine to make it the default to check?

We could have a flag check_sorted that would default to True, but would be ignored if row_sorted was true. That way power users and internal calls could skip the check:

g = dgl.graph((src, dst)) # IsSorted check is invoked
g = dgl.graph((src, dst), check_sorted=False) # IsSorted is not invoked
g = dgl.graph((src, dst), row_sorted=True, col_sorted=True) # IsSorted is not invoked
g = dgl.graph((src, dst), row_sorted=True) # IsSorted is not invoked

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the dgl.utils.is_sorted_srcdst in #2685

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jermainewang Given that previously COOIsSorted() was being called at each format conversion, and that it's roughly the cost of copying the graph, would it be fine to make it the default to check?

It is possible that users only want to perform operations on COO thus not involving any format conversion. For example, in graph classification, once each sample is converted to a graph (likely from a COO format), users call dgl.batch immediately before any computation. In this case, adding the sorted check will likely slowdown the graph construction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 26e7364 I changed the check_sorted parameter to default to false.

@jermainewang
Copy link
Member

@nv-dlasalle I pushed some changes myself and approved the PR. I further removed the check_sorted flag in dgl.graph. I think users should explicitly call is_srcdst_sorted to get those flags if they wish to optimize format conversion. DGL just trusts any flags user provided and by default assumes nothing (not sorted). For the next step, we should refactor all our graph datasets to return sorted graphs by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants